The Connection of the Genetic, Cultural and Geographic Landscapes of Transoxiana

We have analyzed Y-chromosomal variation in populations from Transoxiana, a historical region covering the southwestern part of Central Asia. We studied 780 samples from 10 regional populations of Kazakhs, Uzbeks, Turkmens, Dungans, and Karakalpaks using 35 SNP and 17 STR markers. Analysis of haplogroup frequencies using multidimensional scaling and principal component plots, supported by an analysis of molecular variance, showed that the geographic landscape of Transoxiana, despite its distinctiveness and diversity (deserts, fertile river basins, foothills and plains) had no strong influence on the genetic landscape. The main factor structuring the gene pool was the mode of subsistence: settled agriculture or nomadic pastoralism. Investigation of STR-based clusters of haplotypes and their ages revealed that cultural and demic expansions of Transoxiana were not closely connected with each other. The Arab cultural expansion introduced Islam to the region but did not leave a significant mark on the pool of paternal lineages. The Mongol expansion, in contrast, had enormous demic success, but did not impact cultural elements like language and religion. The genealogy of Muslim missionaries within the settled agricultural communities of Transoxiana was based on spiritual succession passed from teacher to disciple. However, among Transoxianan nomads, spiritual and biological succession became merged.

tribal-clan structure. Often the name of the lineage, clan and tribe are inherited through the male line just as the Y-chromosome is. Therefore, it is important to study them together 22,33 . The areas studied here are predominantly populated by the Kazakh tribe Konyrat, the Kazakh clans Alimuly, Kozha (Khoja) and Sunak, as well as the Turkmen tribe Yomut. A brief summary of the genealogy and history of these clans is given in the Supplementary Text.
This study aims to examine the genetic landscape of Transoxiana and explore its connection to geographical and cultural landscapes. To achieve this aim, we examined Y-chromosomal variation in Kazakhs, Uzbeks, Karakalpaks, Turkmens and Dungans with reference to their cultural and geographical landscapes. The results were used to determine whether or not the two last major expansions (Arab and Mongol) were demic and influenced the Y-chromosomal variation in Transoxiana.
The prevalence of specific haplogroups is even more pronounced for tribal-clan groups than for geographic populations ( Supplementary Fig. 1): C2*-М217(хМ48) comprises 88% of the Y-chromosomes of the Konyrat Scientific RepoRts | 7: 3085 | DOI:10.1038/s41598-017-03176-z tribe, C2b1a2-М48 reaches 75% in the Kazakh clan Alimuly, and Q-M242 accounts for 71% in the Turkmen tribe Yomut. Based on haplogroup frequency, the Konyrat tribe is the most homogenous (HD = 0.23), while the Kozha-Sunak clan group is the most heterogeneous (HD = 0.94). The specificities of the clan pools of paternal lineages are the reason for the specificities of the geographic populations: the clan Alimuly prevails in the KAZ2 population (79% samples are from this clan), the tribe Konyrat predominates in the KAZ1 population (62%), and the tribe Yomut predominates in the TUR1 population (88%).
The Transoxianan paternal heritage in the Asian context. The Tables 2 and 3). Clusters corresponding to geographic parts of Asia were revealed in the multidimensional scaling plot (Fig. 3). The Western Asian cluster was represented by Arab, Turkish and Iranian populations. Populations of India, Pakistan and Afghanistan made up Southern Asian cluster. Chinese form the Eastern Asia cluster. All Transoxianan populations lie in the Central Asian cluster.
Analysis on a narrower geographic scale (Transoxiana and the neighboring regions) is available in Supplementary Fig. 3 (Supplementary Table 4). This PCA plot is based on a smaller number of haplogroups, but includes more Central Asian populations. Both an MDS plot of 30 haplogroups and a PC plot of 19 haplogroups (Fig. 3, Supplementary Fig. 2, Supplementary Fig. 3) demonstrate the four following patterns.
First, Uzbek and Tajik populations practicing settled agriculture, as well as Kyrgyz, are genetically distant from most nomadic populations (Mongol, Kazakh, Hazaras). Second, despite originating from three countries (Uzbekistan, Iran, Afghanistan), Turkmen populations form their own firmly separated cluster. The reason lies The lack of relationship between genetics and geography. To determine the driving forces that shaped the Y-chromosomal variation in Transoxiana, we examined patterns of genetic variation by AMOVA (Supplementary Table 5). Populations were arranged into groups in three ways: (a) Geography -river basins: populations from the Amu Darya or Syr Darya basins; (b) Geography -altitude: plain or foothill populations (400 meters altitude was used as a threshold); (c) Mode of subsistence: settled agriculturalists or nomadic pastoralists.
Both ways of geographic grouping had little to no influence on the genetic structure (Table 1). A Mantel test (Yr = −0.006, p = 0.44) further supports the idea that, unlike most other regions, in Transoxiana genetic distances between populations do not correlate with the geographic distances. But the mode of subsistence had a significant impact on explaining the genetic structure: the differences between settled and nomadic populations accounts for 2.85% of the total genetic variation, which is almost three times larger than the differences between the geographic groups of populations (Table 1).
From the 5 th to 2 nd millennia BC a complete transition to a cattle-raising and agricultural tribal existence occurred in Transoxiana populations 36 . Since that time, the mode of subsistence -settled agriculture or nomadic pastoralism -was the main cultural distinction within Central Asia. This lets us conclude that the influence of geography on the genetic structure was mediated by a combination of subsistence and traditional culture. One may suppose that such relationships of cultural and geographical factors have persisted for thousands of years. It underlines the important role which technical innovations and culture often play in shaping the genetic landscape 37 .

Arab and Mongol expansions: migration of cultures or populations?
In order to search for signs of male demic expansions, we identified four modal STR-haplotypes of Transoxiana (those present in more than 10 samples in our dataset, Table 2). For each modal haplotype we then identified related haplotypes. We considered haplotypes which were fewer than 5 mutational steps from the modal haplotype and belong to the same haplogroup. Five mutations -considering 15 Y-STRs and mutation rate 0.0021 per locus per generation -might occur within roughly two thousand years, which covers the time interval important for our analysis. The search for related haplotypes was performed in a database of 4495 Y-STR Asian haplotypes using the Haplomatch software 38 . This methodology is similar to that applied by Balaresque and colleagues 30 in their search for Asian primary descent clusters.
The modal haplotype 1 ( Table 2) and 257 related haplotypes belonging to haplogroup C2b1a2-M48 were used to construct a phylogenetic network ( Supplementary Fig. 4A). Two clusters can be distinguished: cluster α (which includes the modal haplotype) and cluster β. Cluster α is 600 ± 200 years old and its modal haplotype is most prominent among the Kazakh clan Alimuly (33%). Cluster β is mostly present among Mongols and Mongolian-speaking Kalmyks. Cluster β is older (800 ± 200 years using the rho estimate and 660 years using ASD, Table 2), suggesting the gene flow took place from Mongolia to Transoxiana rather than in the reverse direction. The age of the cluster overlaps with the formation of the Mongol Empire (13th century AD) making this suggestion plausible.
The modal haplotype 2 ( Table 2) and 138 related haplotypes belonging to the C2*-M217(xM48) haplogroup were arranged into a second phylogenetic network ( Supplementary Fig. 4B). Here as well, two clusters can be distinguished: γ and σ. Cluster γ is prevalent among Mongolian-speaking Kalmyks and in Mongolia itself. The age of this cluster (600 ± 200 years) overlaps with the time of migration of the Kalmyk ancestors (Oyrats) from Mongolia and the following back migration of some Kalmyk groups. Cluster σ is specific to the Kazakh tribe Konyrat and modal haplotype accounts for 17% of the tribal paternal pool. The age of this cluster (1100 ± 400 years old) suggests a fairly early migration from Mongolia followed by an expansion within the single tribe. The modal haplotype 3 ( Table 2) and 189 related haplotypes belonging to the C2*-M217(xM48) haplogroup were plotted similarly ( Supplementary Fig. 4C). This haplotype coincides with a previously-described haplotype, putatively connected to Genghis Khan's relatives, collectively forming the "С3* star-cluster" (μ) 21 . From Abilev et al. 26 it is known that 76.5% of the Kazakh tribe Kerey belong to the star-cluster, including the 16% that fall within the third modal haplotype in our classification. Within Transoxiana, this founder haplotype is most common among the Kazakh clan Tore (11%), tribe Uysun (6%) and Karakalpaks (5%). The estimated age of the μ cluster (1100 ± 300 years) aligns with previous estimations of ~1000 years 21,30,39 . It may be assumed that modal haplotype 3 was the "proto-Mongolian haplotype", inherited, among others, by Genghis Khan, his descendants and patrilineal relatives. It is important to mention that Temujin (Genghis Khan) belonged to the Kiyat clan, which in turn is a branch of the Borjigin tribe, part of the Nirun Mongols. Subcluster λ, aged 400 ± 100 years old, is specific for Hazara from various countries and can be distinguished within the cluster.
The modal haplotype 4 ( Table 2) and 97 related haplotypes belonging to the Q-M242 haplogroup were again plotted on a network (Supplementary Fig. 4D). The overwhelming majority of these haplotypes came from Turkmen populations from several countries. The cluster δ is 1400 ± 500 years old, making it older than the Mongol expansion. Despite a small part of the confidence interval overlapping with the period of Arab expansion, haplogroup Q-M242 accounts for just 1.5% of the population of the Arabian Peninsula, which means that expansion of this cluster in Turkmen populations is more likely caused by a local founder effect predating both Arab and Mongol influences.
Thus, three out of four signals of expansion in Transoxiana are connected to Mongol populations and likely reflect the migration to Transoxiana from Mongolia or neighboring regions which was followed by rapid growth of the migrants' descendants (Supplementary Text). Notably, such successful demic expansion was not accompanied by cultural expansion (language change) -most populations of present-day Transoxiana speak not Mongolian, but Turkic languages. The factor that unifies not just most, but all, populations of Transoxiana is Islam. However, our analysis has not revealed any signs of significant demic expansions linked to the Arabs. In a more direct attempt to uncover signs of such expansions, we have analyzed the Y-chromosomes of nomadic Islamic clergy. Y-chromosomal haplotypes of the Kozha-Sunak tribe are shown in Fig. 4. Many separate individual haplotypes, or sometimes mini-clusters, can be observed. Therefore, unlike most Transoxianan clans, the Kozha and Sunak clans do not have a predominant paternal common ancestor. This is confirmed by the high haplogroup variation (HD = 0.86) among the Kozha-Sunak, which is 2-4 times higher than all other Transoxianan lineages studied ( Supplementary Fig. 2). Due to the fact that subclans can have different origins, we divided the Kozha-Sunak into four groups based on traditional genealogy, and likewise divided all other Transoxianan clans. In the course of this analysis (Supplementary Table 6) we have discovered a pattern: the Kozha-Sunak lineages are highly heterogeneous based on the mean number of pairwise differences between haplotypes (PD between 8 and 9), while subclans of other Transoxianan clans are relatively homogeneous (PD between 1 and 7). The absence of one principal male root in the Kozha-Sunak tribe is indicative of its origin not being one Arab ancestor. Furthermore, haplogroup J1-M267, considered a marker of Arab expansion 40 , was not found among the Kozha-Sunak.  Table 2. Features of the primary descent clusters. Notes: *Number of samples carrying the modal haplotype; **number of samples carrying related haplotypes (fewer than 5 mutational steps from the modal haplotype); ***number of samples within the given cluster. ****Duplication of the DYS19 locus was observed only in some of M48 haplotypes. Each ASD estimate falls within confidence interval of the corresponding rho estimate, so we mention mainly rho estimates in the text.

In search of Arab ancestry in
In order to trace the origin of the haplotype mini-clusters identified (Fig. 4), we searched for related haplotypes (fewer than 5 mutational steps) in other Asian populations. Mini-cluster R1a (ε1) had no related haplotypes, but mini-cluster G1 (ε2) has related haplotypes among the Kazakh tribe Argyn 33 . This suggests that the origin of this mini-cluster is local and has not migrated with the Arab expansion.
Sayeds (lineage within the Kozha and Sunak clans) are known as descendants of the Prophet Muhammad on the paternal side, and reside beyond Transoxiana as well. Sayeds of Pakistan, for example, were analyzed by Belle et al. 41 , who reported that they are genetically closer to Arabs than to the surrounding populations of Pakistan and India, but did not find a founder effect. Furthermore, the Pakistani Sayeds are quite different genetically from the Transoxianan Sayeds. Thus, despite traditionally attributing their paternal ancestry to one common root among Arab missionaries, the pronounced Y-chromosomal variation suggests that Transoxiana's lineages have descended from several unrelated local ancestors.
A possible explanation is that nomadic clergy genealogy was based on silsila, a spiritual legacy passed from teacher to disciple, rather than a biological relationship. In Central Asia, Islam was spread by the Sufi orders Yasawiyya, Naqshbandi and Bektashi. In these orders, the leadership was based on silsila, a sequence of teachers who taught the succeeding leader of the Sufi order. However, even in agriculturalists spiritual succession was sometimes passed from father to son 42 and when occurring within a nomadic patronymic tradition could become a patrilineal biological legacy. The age of mini-cluster ε1 (600 ± 200 years old) corresponds to the time when the Golden Horde adopted Islam as the official religion, and the rise of the Kozha-Sunak tribal-clan group in social status. This may have facilitated the transition from spiritual silsila to biological genealogy in order to maintain the privileged social status within the tribal-clan group. This conclusion coincides with the supposition of Heyer et al. 43 that cultural transmission of reproductive success could play an important role in shaping genetic diversity in Central Asia.

Conclusions
We have analyzed human Y-chromosomal variation in ten populations from Transoxiana, a historical region covering Uzbekistan, western Tajikistan, western Kyrgyzstan, northwestern Turkmenistan and southern Kazakhstan. Considering the peculiar features of the geographical landscape of the region, abrupt shifts of cultural landscapes in the course of its history, and presence of patrilineal tribal-clan groups, we jointly analyzed the patrilineal genetic variation, patrilineal genealogies and historical data. We identified three features of the genetic landscape of Transoxiana and its connection to geographical and cultural landscapes.
First, cultural and demic expansions of Transoxiana were not closely connected with each other. Arab cultural expansion introduced Islam to the region but did not leave a significant mark on the Y-chromosomal pool. The Mongol expansion, in contrast, had enormous demic success, but did not impact on cultural elements like language and religion. Second, the geographic landscape of Transoxiana, despite its peculiarity and diversity (deserts, fertile river basins, foothills and plains) had no strong influence on the genetic landscape. The main factor structuring the Y-chromosomal variation was the mode of subsistence: settled agriculture or nomadic pastoralism.
Third, the genealogy of Muslim missionaries within the settled agricultural community was based on spiritual succession passed from teacher to disciple, rather than on biological relationship. However, among nomads, spiritual and biological succession merged, leading to the formation of haplotype mini-clusters among nomadic clergy.

Methods
Samples. Blood Table 1). Only individuals who had all their ancestors for at least three generations descending from the specific population, and not related to each other, were selected. All 780 subjects provided their written informed consent in a form approved by the Ethics Committee of the Research Centre for Medical Genetics (Moscow, Russia). When performing genotyping and data analyses, we followed lab protocols approved by the same Ethics Committee.
The MDS analysis dataset consisted of 5,998 samples from 79 populations, including 780 samples reported here for the first time (Supplementary Table 2 Fig. 4A-D). Both the MDS and the phylogenetic dataset were extracted from the in-house Y-base database, compiling published data on Y-chromosomal variation in human populations.
In distinguishing the clusters on the networks, we followed the procedure described earlier 11,64 . Briefly, we first looked for a zone in the network carrying mostly haplotypes from a single population; then the most recent common ancestral haplotype for all haplotypes in the zone was identified in the network; finally, all haplotypes downstream to this ancestral haplotype were attributed to the cluster, though some very distant haplotypes were ignored. This procedure transforms the initial arbitrary "zone" to a monophyletic -to the best of the network's performance -clade. In previous studies 64 we selected the ancestral haplotype as cluster's founder, but 65 found that using the modal haplotype works much better, so here we selected modal haplotypes as founders. Cluster age was determined using the rho-statistic 66, 67 and, because rho was shown to introduce a systematic bias 68 , we also used the average squared distance (ASD) estimator 69,70 . Rho was calculated using Network 5, and ASD was calculated by Y TMRCA Calculator (http://ehelix.pythonanywhere.com/init/default/about), which is a derivative of Matlab/Octave program Ytime 71 . To convert the number of mutations into number of generations, the "genealogical" mutation rate of 2.1 × 10 −3 mutations per STR per generation was used 72,73 as the analysis in Karmin 74 indicated that for clusters younger than 30,000 years, this rate is consistent with full Y-chromosomal sequence data. When converting the number of generations into an age in years, the male generation time was set to 30 years 75 . To determine whether or not most haplotypes in the same genealogical lineage originated from a common ancestor, we used phylogenetic networks, haplotype variability statistics and the mean number of pairwise differences within each lineage (calculated using Arlequin 3.5.1.3 software).