Abstract
The Han Chinese history is shaped by substantial demographic activities and sociocultural transmissions. However, it remains challenging to assess the contributions of demic and cultural diffusion to Han culture and language, primarily due to the lack of rigorous examination of genetic–linguistic congruence. Here we digitized a large-scale linguistic inventory comprising 1,018 lexical traits across 926 dialect varieties. Using phylogenetic analysis and admixture inference, we revealed a north–south gradient of lexical differences that probably resulted from historical migrations. Furthermore, we quantified extensive horizontal language transfers and pinpointed central China as a dialectal melting pot. Integrating genetic data from 30,408 Han Chinese individuals, we compared the lexical and genetic landscapes across 26 provinces. Our results support a hybrid model where demic diffusion predominantly impacts central China, while cultural diffusion and language assimilation occur in southwestern and coastal regions, respectively. This interdisciplinary study sheds light on the complex social-genetic history of the Han Chinese.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The lexical inventory of Chinese dialects and other necessary datasets to reproduce the results in the paper are available in the Supplementary Tables. The allele frequency data of genetic variants used in this study are available via PGG.Han website at https://www.biosino.org/pgghan/data/variant. The pairwise FST for any two provincial populations and the inferred ancestry composition for each of the provincial populations are available in GitHub at https://github.com/Shuhua-Group/Genetic-characteristics-of-the-Han100K-initiative or via Zenodo at https://doi.org/10.5281/zenodo.10816923 (ref. 112). The use of genetic data in this work is approved by The National Health Commission of the People’s Republic of China (No. 2024BAT00503). Release of the summary statistics of genetic data in this work is recorded by the National Health Commission (NHC) of the People’s Republic of China at Open Archive for Miscellaneous Data (OMIX) with accession number OMIX004518. All data generated or analysed during this study are included in this Article, its Supplementary Information and publicly available repositories.
Code availability
The codes required to transform the data into the statistics and outputs reported in the paper are available in Supplementary Information and have been deposited in GitHub at https://github.com/JoshuaThieriot/Chinese-dialect-project and in Zenodo at https://doi.org/10.5281/zenodo.10867759 (ref. 113).
References
Pagel, M. Human language as a culturally transmitted replicator. Nat. Rev. Genet. 10, 405–415 (2009).
Diamond, J. & Bellwood, P. Farmers and their languages: the first expansions. Science 300, 597–603 (2003).
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, 1994).
Bouckaert, R. et al. Mapping the origins and expansion of the Indo-European language family. Science 337, 957–960 (2012).
Gray, R. D., Drummond, A. J. & Greenhill, S. J. Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323, 479–483 (2009).
Tagore, D., Aghakhanian, F., Naidu, R., Phipps, M. E. & Basu, A. Insights into the demographic history of Asia from common ancestry and admixture in the genomic landscape of present-day Austroasiatic speakers. BMC Biol. 19, 61 (2021).
de Filippo, C., Bostoen, K., Stoneking, M. & Pakendorf, B. Bringing together linguistic and genetic evidence to test the Bantu expansion. Proc. Biol. Sci. 279, 3256–3263 (2012).
Tambets, K. et al. Genes reveal traces of common recent demographic history for most of the Uralic-speaking populations. Genome Biol. 19, 139 (2018).
Robbeets, M. et al. Triangulation supports agricultural spread of the Transeurasian languages. Nature 599, 616–621 (2021).
Ge, J., Wu, S. & Cao, S. Zhongguo Yi Min Shi (History of Migrations in China) (Fujian People’s Publishing House, 1997).
Zhou, Z. & Lo, K. Migrations in Chinese history and their legacy on Chinese dialects. J. Chin. Linguist. Monogr. Ser. 3, 29–49 (1991).
Coblin, W. S. Migration history and dialect development in the lower Yangtze watershed. Bull. Sch. Orient. Afr. Stud. Univ. Lond. 65, 529–543 (2002).
Lee, J. Z. in Annales de demographie historique Vol. 1982 279–304 (Persée, 1982).
Lee, J. & Wong, R. B. Population movements in Qing China and their linguistic legacy. J. Chin. Linguist. Monogr. Ser. 3, 50–75 (1991).
Xu, S. et al. Genomic dissection of population substructure of Han Chinese and its implication in association studies. Am. J. Hum. Genet. 85, 762–774 (2009).
Wen, B. et al. Genetic evidence supports demic diffusion of Han culture. Nature 431, 302–305 (2004).
Deng, W. et al. Evolution and migration history of the Chinese population inferred from Chinese Y-chromosome evidence. J. Hum. Genet. 49, 339–348 (2004).
Ethnologue: Languages of the World (SIL International, 2023).
The Sino-Tibetan Languages (Routledge, 2016).
Norman, J. Chinese (Cambridge Univ. Press, 1988).
Yuan, J. Hanyu Fangyan Gaiyao (Shangwu Yinshuguan, 2003).
Coblin, W. S. A brief history of Mandarin. J. Am. Orient. Soc. 120, 537–552 (2000).
Hamed, M. B. Neighbour-nets portray the Chinese dialect continuum and the linguistic legacy of China’s demic history. Proc. R. Soc. B 272, 1015–1022 (2005).
Zheng, Z. & Xiong, Z. (eds) Language Atlas of China 2nd edition Vol. Chinese Dialects (Shangwu Yinshuguan, 2012).
Kurpaska, M. Chinese Language(s): A Look Through the Prism of the Great Dictionary of Modern Chinese Dialects. Chinese Language(s) (De Gruyter Mouton, 2010).
Ho, D. in The Oxford Handbook of Chinese Linguistics (eds Wang, W. S.-Y. & Sun, C.) 149–160 (Oxford Univ. Press, 2015).
LaPolla, R. J. in Areal Diffusion and Genetic Inheritance: Problems in Comparative Linguistics (eds Aikhenvald, A. Y. & Dixon, R. M. W.) 225–254 (Oxford Univ. Press, 2001).
Xue, F. et al. A spatial analysis of genetic structure of human populations in China reveals distinct difference between maternal and paternal lineages. Eur. J. Hum. Genet. 16, 705–717 (2008).
LaPolla, R. J. in The Cambridge Handbook of Language Contact Vol. 1 (eds Escobar, A. M. & Mufwene, S. S.) 64–83 (Cambridge Univ. Press, 2022).
Zhang, M. Diversity of language structure is shaped by demographic activities: comment on ‘Rethinking foundations of language from a multidisciplinary perspective’ by T. Gong et al. Phys. Life Rev. 26–27, 146–148 (2018).
Cao, Z. et al. (eds) Hanyu Fangyan Dituji (Linguistic Atlas of Chinese Dialects) Vol. Lexicon (Shangwu Yinshuguan, 2008).
Coblin, W. S. Neo-Hakka, Paleo-Hakka, and Early Southern Highlands Chinese. Yuyán Ánjiù Jíkan 21,175–238 (2018).
Baker, H. D. R. Migration and ethnicity in Chinese history: Hakkas, Pengmin, and their neighbours. By Sow-Theng Leong edited By Tim Wright, pp. xix, 234, 1 fig., 11 maps. Stanford, California, Stanford Univ. Press. 1997. J. R. Asiat. Soc. 9, 350–351 (1999).
Hashimoto, M. J. Origin of the East Asian linguistic structure: latitudinal transitions and longitudinal developments of East and Southeast Asian languages. Comput. Anal. Asian Afr. Lang. 24, 35–42 (1984).
Hashimoto, M. in Contributions to Sino-Tibetan Studies 76–97 (Brill, 1986).
Hashimoto, M. Language diffusion on the Asian continent: problems of typological diversity in Sino-Tibetan. Comput. Anal. Asian Afr. Lang. 3, 49–65 (1976).
Yue-Hashmoto, A. The lexicon in syntactic change: lexical diffusion in Chinese syntax. J. Chin. Linguist. 21, 213–254 (1993).
Weir, B. S. & Cockerham, C. C. Estimating F-statistics for the analysis of population structure. Evolution 38, 1358–1370 (1984).
Bryant, D. & Moulton, V. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Evol. 21, 255–265 (2004).
List, J.-M., Shijulal, N.-S., Martin, W. & Geisler, H. Using phylogenetic networks to model Chinese dialect history. Lang. Dyn. Change 4, 222–252 (2014).
Pulleyblank, E. G. Chinese dialect studies. J. Chin. Linguist. Monogr. Ser. 3, 429–453 (1991).
Zhang, M.-H., Pan, W.-Y., Yan, S. & Jin, L. Phonemic evidence reveals interwoven evolution of Chinese dialects. Preprint at https://doi.org/10.48550/arXiv.1802.05820 (2018).
Coblin, W. S. A Study of Comparative Gàn (Institute of Linguistics, Academia Sinica, 2015).
Iwata, R. Chinese geolinguistics: history, current trends, and theoretical issues. Dialectologia: revista electrònica 1, 97–121 (2010).
You, R. et al. Hanyu Fangyanxue Daolun (Chinese Dialectology) (Shanghai Jiaoyu Chubanshe, 1992).
Levinson, S. C. & Gray, R. D. Tools from evolutionary biology shed new light on the diversification of languages. Trends Cogn. Sci. 16, 167–173 (2012).
Syrjänen, K., Honkola, T., Lehtinen, J., Leino, A. & Vesakoski, O. Applying population genetic approaches within languages: Finnish dialects as linguistic populations. Lang. Dyn. Change 6, 235–283 (2016).
Dor, D. & Eva, J. From cultural selection to genetic selection: a framework for the evolution of language. Selection 1, 33–56 (2001).
Carling, G., Cronhamn, S., Lundgren, O., Bogren Svensson, V. & Frid, J. The evolution of lexical semantics dynamics, directionality, and drift. Front. Commun. https://doi.org/10.3389/fcomm.2023.1126249 (2023).
Lupyan, G. & Dale, R. Language structure is partly determined by social structure. PLoS ONE 5, e8559 (2010).
Romano, N., Ranacher, P., Bachmann, S. & Joost, S. Linguistic traits as heritable units? Spatial Bayesian clustering reveals Swiss German dialect regions. J. Linguist. Geogr. 10, 11–22 (2022).
Jackson, D. A. Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology 74, 2204–2214 (1993).
Shen, R. in The Palgrave Handbook of Chinese Language Studies (ed. Ye, Z.) 441–456 (Palgrave Macmillan, 2021).
Norman, J. Guanyu guanhuafangyan zaoqi fazhan de yixie xiangfa (some thoughts on the early development of Mandarin). Dialect 4, 295–300 (2004).
Liu, X. Zailun hanyu beifanghua de fenqu (On the dialect areas of Northern Chinese). Zhongguo Yuwen 8, 439–452 (1995).
Hashimoto, M. J. The Hakka dialect: a linguistic study of its phonology, syntax and lexicon. Bull. Sch. Orient. Afr. Stud. 37, 278–279 (1974).
Hashimoto, M. J. Hakka in Wellentheorie perspective. J. Chin. Linguist. 20, 1–49 (1992).
Yan, M. M. Introduction to Chinese Dialectology (LINCOM Europa, 2006).
Chappell, H. in Sinitic Grammar: Synchronic and Diachronic Perspectives (ed. Chappell, H.) 3–28 (Oxford Univ. Press, 2001).
Norman, J. The Mǐn dialects in historical perspective. J. Chin. Linguist. Monogr. Ser. 3, 323–358 (1991).
Lipson, M. et al. Efficient moment-based inference of admixture parameters and sources of gene flow. Mol. Biol. Evol. 30, 1788–1802 (2013).
Sagart, L. Gan, Hakka and the Formation of Chinese Dialects (Academia Sinica, 2002).
Szeto, P. Y., Ansaldo, U. & Matthews, S. Typological variation across Mandarin dialects: an areal perspective with a quantitative approach. Linguist. Typol. 22, 233–275 (2018).
You, R. & Zhenhe, Z. Fangyan Yu Zhongguo Wenhua (Dialects and Chinese Culture) (Shanghai Renmin Chubanshe, 2006).
Wang, J., Lin, X., Bloomgarden, Z. T. & Ning, G. The Jiangnan diet, a healthy diet pattern for Chinese. J. Diabetes 12, 365–371 (2020).
He, K., Lu, H., Zhang, J., Wang, C. & Huan, X. Prehistoric evolution of the dualistic structure mixed rice and millet farming in China. Holocene 27, 1885–1898 (2017).
Valliant, J. C. D., Bruce, A. B., Houser, M., Dickinson, S. L. & Farmer, J. R. Product diversification, adaptive management, and climate change: farming and family in the U.S. Corn Belt. Front. Clim. https://doi.org/10.3389/fclim.2021.662847 (2021).
Honkola, T. et al. Evolution within a language: environmental differences contribute to divergence of dialect groups. BMC Evol. Biol. 18, 132 (2018).
Mufwene, S. Population movements and contacts in language evolution. J. Lang. Contact 1, 63–92 (2007).
Posth, C. et al. Language continuity despite population replacement in Remote Oceania. Nat. Ecol. Evol. 2, 731–740 (2018).
Szeto, P. Y. & Yurayong, C. Sinitic as a typological sandwich: revisiting the notions of Altaicization and Taicization. Linguist. Typology 25, 551–599 (2021).
Chappell, H. in Areal Diffusion and Genetic Inheritance: Problems in Comparative Linguistics (eds Aikhenvald, A. Y. & Dixon, R. M. W.) 328–357 (Oxford University Press, 2001).
Jolliffe, I. in Encyclopedia of Statistics in Behavioral Science (eds Everitt, B. S. & Howell, D. C.) https://doi.org/10.1002/0470013192.bsa501 (John Wiley & Sons, 2005).
Hastie, T., Tibshirani, R., Narasimhan, B. & Chu, G. impute: imputation for microarray data. R package version 1.76.0 https://bioconductor.org/packages/impute (2023).
Novembre, J., Williams, R., Pourreza, H., Wang, Y. & Carbonetto, P. PCAviz: visualizing principal components analysis. R package version 0.3-37 http://github.com/NovembreLab/PCAviz (2019).
Gower, J. C. Generalized procrustes analysis. Psychometrika 40, 33–51 (1975).
Wang, C. et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9, 13 (2010).
Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2021).
Hijmans, R. J. Raster: geographic data analysis and modeling. R package version 3.4-8 https://rspatial.org/raster (CRAN, 2023).
Hamming, R. W. Error detecting and error correcting codes. Bell Syst. Tech. J. 29, 147–160 (1950).
Mantel, N. & Valand, R. S. A technique of nonparametric multivariate analysis. Biometrics 26, 547–558 (1970).
Oksanen, J. et al. Vegan: Community Ecology Package (CRAN, 2022).
Evans, C. et al. The uses and abuses of tree thinking in cultural evolution. Phil. Trans. R. Soc. B 376, 20200056 (2021).
Mace, R. & Holden, C. J. A phylogenetic approach to cultural evolution. Trends Ecol. Evol. 20, 116–121 (2005).
Wu, F. & Huang, Y. in The Palgrave Handbook of Chinese Language Studies (ed. Ye, Z.) 1–28 (Springer Nature, 2020).
Hamed, M. B. & Wang, F. Stuck in the forest: trees, networks and Chinese dialects. Diachronica 23, 29–60 (2006).
Huson, D. H. & Bryant, D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006).
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).
Hubisz, M. J., Falush, D., Stephens, M. & Pritchard, J. K. Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour. 9, 1322–1332 (2009).
Evanno, G., Regnaut, S. & Goudet, J. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol. Ecol. 14, 2611–2620 (2005).
Reesink, G., Singer, R. & Dunn, M. Explaining the linguistic diversity of Sahul using population models. PLoS Biol. 7, e1000241 (2009).
Auderset, S., Greenhill, S. J., DiCanio, C. T. & Campbell, E. W. Subgrouping in a ‘dialect continuum’: a Bayesian phylogenetic analysis of the Mixtecan language family. J. Lang. Evol. 8, 33–63 (2023).
Jakobsson, M. & Rosenberg, N. A. CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23, 1801–1806 (2007).
Earl, D. A. & vonHoldt, B. M. STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conserv. Genet. Resour. 4, 359–361 (2012).
Caye, K., Deist, T. M., Martins, H., Michel, O. & François, O. TESS3: fast inference of spatial population structure and genome scans for selection. Mol. Ecol. Resour. 16, 540–548 (2016).
Lipson, M. et al. Reconstructing Austronesian population history in Island Southeast Asia. Nat. Commun. 5, 4689 (2014).
Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).
Reich, D., Thangaraj, K., Patterson, N., Price, A. L. & Singh, L. Reconstructing Indian population history. Nature 461, 489–494 (2009).
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012).
Sagart, L. in Dialect Variations in Chinese 129–154 (Academia Sinica, 2002).
Lipson, M. New Statistical Genetic Methods for Elucidating the History and Evolution of Human Populations. Ph.D. thesis, Massachusetts Institute of Technology (2014).
MATLAB version 8.6.0 (R2015b) (MathWorks, 2015).
Privé, F., Luu, K., Vilhjálmsson, B. J. & Blum, M. G. B. Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. Mol. Biol. Evol. 37, 2153–2154 (2020).
Storey, J. D., Bass, A. J., Dabney, A. & Robinson, D. Qvalue: Q-value estimation for false discovery rate control. R package version 2.34.0 https://bioconductor.org/packages/qvalue (2023).
Gao, Y. et al. PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res. 48, D971–D976 (2019).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Cong, P.-K. et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat. Commun. 13, 2939 (2022).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Zhang, X. Shuhua-Group/Genetic-characteristics-of-the-Han100K-initiative (v1.0). Zenodo https://doi.org/10.5281/zenodo.10816923 (2024).
Yang, C. JoshuaThieriot/Chinese-dialect-project: the first release of analytical codes for Chinese dialects (v1.0.0). Zenodo https://doi.org/10.5281/zenodo.10867759 (2024).
Acknowledgements
This research was supported by the National Natural Science Foundation of China (T2122007, 32288101, 32070577 and 32030020), National Key R&D Program of China (2023YFC2605400, 2020YFE0201600), National Social Science Foundation (23&ZD317 and 20&ZD301), the Shanghai Science and Technology Commission Program (23JS1410100), the Office of Global Partnerships (Key Projects Development Fund), and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 883700 TRAM). This work was also sponsored by the ‘Shuguang Program’ supported by the Shanghai Education Development Foundation and Shanghai Municipal Education Commission (20SG06), and by the Fundamental Research Funds for the Central Universities (2022ECNU-XWK-XK005). The computational work in this study was supported by the CFFF Computing Platform and the Human Phenome Data Center of Fudan University. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
C.Y., S. Yan, L.J., S.X. and M.Z. designed the research. C.Y., F.Y., Y.C., N.X., Z.W. and M.Z. collated the linguistic data of the Chinese dialects. X.Z. and S.X. assembled the genetic data of the Han Chinese and performed genetic data analysis. C.Y., S. Yang, B.W. and M.Z. performed the linguistic analyses and interdisciplinary alignment. C.Y., S. Yan, L.J., S.X. and M.Z. discussed the results. C.Y., X.Z., S.X. and M.Z. wrote and revised the paper. All authors approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Peer review
Peer review information
Nature Human Behaviour thanks Randy J. Lapolla and Chuang-Chao Wang for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Note, Discussion, Figs. 1–3, and table of contents for Supplementary Tables 1–13.
Supplementary Tables 1–13
Lexical inventory, datasets generated from statistical analysis, and other relevant analytical results.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, C., Zhang, X., Yan, S. et al. Large-scale lexical and genetic alignment supports a hybrid model of Han Chinese demic and cultural diffusions. Nat Hum Behav (2024). https://doi.org/10.1038/s41562-024-01886-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41562-024-01886-9
This article is cited by
-
Language evolution in China
Nature Human Behaviour (2024)