The genomic landscape of Nepalese Tibeto-Burmans reveals new insights into the recent peopling of Southern Himalayas

While much research attention has focused on demographic processes that enabled human diffusion on the Tibetan plateau, little is known about more recent colonization of Southern Himalayas. In particular, the history of migrations, admixture and/or isolation of populations speaking Tibeto-Burman languages, which is supposed to be quite complex and to have reshaped patterns of genetic variation on both sides of the Himalayan arc, remains only partially elucidated. We thus described the genomic landscape of previously unsurveyed Tibeto-Burman (i.e. Sherpa and Tamang) and Indo-Aryan communities from remote Nepalese valleys. Exploration of their genomic relationships with South/East Asian populations provided evidence for Tibetan admixture with low-altitude East Asians and for Sherpa isolation. We also showed that the other Southern Himalayan Tibeto-Burmans derived East Asian ancestry not from the Tibetan/Sherpa lineage, but from low-altitude ancestors who migrated from China plausibly across Northern India/Myanmar, having experienced extensive admixture that reshuffled the ancestral Tibeto-Burman gene pool. These findings improved the understanding of the impact of gene flow/drift on the evolution of high-altitude Himalayan peoples and shed light on migration events that drove colonization of the southern Himalayan slopes, as well as on the role played by different Tibeto-Burman groups in such a complex demographic scenario.


Setting GCA Populations in the South/East Asian Genomic Landscape
When PCA was extended to the 75 Asian populations selected from the "extended" dataset, PC1 (4.8% of variance) captured the known genetic differentiation between groups from the SA peninsula (i.e. India and Pakistan) and from EA (i.e. Indochina, China, Japan) 1,2 , while PC2 (0.6%) reflected clines of variation observable within these two macro-geographic areas. In particular, the wellestablished North to South gradient of SA genetic diversity [3][4][5][6] was clearly detectable. Conversely, the EA cluster appeared to be more homogeneous, with the cline described by PC2 recapitulating only to a certain degree the latitudinal extension of continental East Asia.
As regards IAR, they were substantially scattered in the PCA space, especially along PC1.
For instance, two IAR individuals clustered within the Indian Tharu people, an Indo-Aryan speaking group that resides in the Terai plain and that is known to show also EA ancestry components 3,7,8 .
Hereafter, we thus included Tharu in the bulk of Tibeto-Burman populations since the majority of SA groups presenting EA admixture actually speaks Tibeto-Burman languages.

Complex Admixture Patterns of GCA and Tibeto-Burman Populations
When ancestry proportions for each GCA subject were investigated via ADMIXTURE by considering a large set of SA and EA samples, IAR showed extremely variable levels of SA/EA admixture, with some individuals presenting a Northern Indian-like ancestry profile characterized by none or negligible EA ancestry and others showing significant proportions (up to 80%) of this genetic component ( Fig. 2A). In contrast, GCA Tibeto-Burman populations presented homogeneous ancestry profiles. Specifically, TAS were characterized by evident SA/EA admixture, with on average 23% of their ancestry being derived from SA genetic components and 77% from EA ones, through all K presenting the highest log likelihood values. The proportion of SA ancestry was considerably lower in TAT, with a mean value of approximately 9% ( Supplementary Fig. S3). Conversely, SRH showed almost no signatures of recent SA admixture, in line with what observed for the Sherpas from Khumbu.
The "Sherpa-like" component maintained relatively high proportions also in other Tibeto-Burmans. For instance, it varied from those observed in Tibeto-Burmans from the Indian subcontinent (i.e. Manipuri, 18%; Burma, 21%; Jamatia, 23%; Tripuri, 23% and Tharu, 27%) or China (i.e. Naxi, 30%; Yizu, 32%) to those proper of Nagas (42%), Tamangs (TAS, 42%; TAT, 52%) and TIB (58%), being instead detected at substantially lower values in non-Tibeto-Burman EA populations. Among them, only Tu people from the Chinese Qinghai-Gansu area and speaking Mongolic languages, but interlinked with Tibetan dialects 9 , presented a proportion comparable to that observed in Tibeto-Burmans (26%). Table S3; Supplementary Fig. S5) showed that only Burma, Yizu, Naxi and TIB failed to be confirmed as admixed thus representing the sole exceptions together with TAT, who turned out to be not admixed according to the computed f3 statistics, but resulted significantly admixed with ALDER. Results obtained for the Indian Tibeto-Burman groups (i.e. Jamatia, Tripuri, Manipuri, and Tharu) were in line with those previously described by Basu et al. 3 and achieved using a different dating method. We considered a value of 25 years per generations to derive the time since admixture from the rate of exponential LD decay. Jamatia and Tharu indeed presented the oldest time ranges, spanning from 1.6 to 0.6 thousand years ago (kya), while we identified more recent admixture events for Manipuri and Tripuri (0.8-0.2 kya). As regards TAS and TAT, the inferred dates of admixture were even more recent (0.65-0.38 kya and 0.43-0.13kya, respectively).  Table S4 and Supplementary   Table S5). Accordingly, proportions of haplotypes shared among individuals belonging to the same population were substantially higher in Sherpa communities (SHR, 54% and 32% for mtDNA and Ychromosome respectively; SHT, 85% for mtDNA) when compared to the mean values calculated for the assembled reference datasets (i.e. 27% for mtDNA and 18% for Y-chromosome). A pattern of remarkable intra-population homogeneity emerged also from the analysis of SRH haplogroup composition, which highlighted a limited number of lineages accounting for the great majority of the examined mtDNAs (cumulative percentage of A, C4a3b and M9a1a of 56%) and Y-chromosomes Table S6). These findings were in agreement with those described by Bhandari et al. 10 and corroborated the hypothesis that historical isolation and subsequent drift have actually influenced patterns of uniparental variation in Sherpa populations.

Disentangling the Impact of Admixture and Drift on the History of Sherpa People
Our ROH analyses provided a first clue supporting an appreciable impact of these demographic events also on the autosomal genome of Sherpa people, especially as regards SHT who presented remarkable values for both the average length and the number of DNA segments in homozygosity ( Supplementary Fig. S6).
According to these findings, we aimed at testing whether the "Sherpa-like" ancestry component pointed out by ADMXTURE was influenced by strong genetic drift experienced by the Sherpas. For this purpose, we applied a procedure based on CHROMOPAINTER "chunk-lengths" output on a subset of EA populations included in the "extended" dataset from which we excluded SHK (since they appeared to be admixed with TIB and other SA/EA populations, Supplementary   Table S2) and TAT (due to their low sample size). The rationale beyond the two conducted CHROMOPAINTER runs was that if Himalayan populations would share the same painting profiles it would mean that most of the differences observable between the examined groups are ascribable to recent events of drift and gene flow that differentiated the two groups. These two sets of painting profiles were then used to infer proportions of the genome that the tested population derived from all donor groups, as described in Materials and Methods.

Genomic Relationships Between GCA and South/East Asian Populations
We calculated a series of ad hoc D-statistics by testing separately the three Tibeto-Burman GCA groups and each EA population and by considering a four-population phylogeny in the form: (EA, TIB; GCA, YRI). In particular, by forcing GCA groups to branch out before the split between TIB and the rest of EA populations, we aimed at testing whether the poor fit of GCA samples in such an artificial phylogeny deviates towards their closer genetic connection with TIB or with other EA groups. Accordingly, negative D values indicated high affinity with TIB, whereas positive values suggested close relationship with the tested EA population. Overall, these analyses ( Supplementary   Fig. S8) validated results obtained by computing outgroup f3 statistics ( Supplementary Fig. S7).

The Admixed Phylogeny of Tibeto-Burman Populations
To reconstruct Tibeto-Burman phylogenetic relationships by taking into account their SA/EA admixture profiles, we used relatively non-admixed populations identified according to f3 statistics to build a scaffold tree by means of the MIXMAPPER algorithm. In detail, we selected Nagas, SHT and SRH as representative Tibeto-Burman groups. We then included in the tree also TIB, which presented significantly negative f3 values, but not when tested against the other low-altitude EA populations used to build the scaffold tree. In fact, we aimed at considering also the TIB sample because we were mainly interested in identifying the branching point between low-and high-altitude Tibeto-Burmans. Nevertheless, it is important to note that our dataset included only one Tibetan population, which could be not representative of the overall genetic variation present across the plateau. We then selected the non-admixed EA population of Dai as a proxy for groups with predominant Southern EA ancestry and the Yakuts as representative of people with Northern EA ancestry. Moreover, the Brahui and Birhor SA populations were also included to represent the two main sources of Indian ancestry (i.e. the ancestral northern and the ancestral southern ones) 3,6 , together with YRI that were used as outgroup.
Although robustness of the reconstructed topology was strongly supported by 500/500 bootstrap replications, it is important to note that when formally tested via computation of f4 and Dstatistics in the form (YRI, CDX; Nagas, Tibetans/Sherpas), the divergence between Nagas and the high-altitude clade cannot be explained by a single split from a common ancestor. In fact, significantly negative f4 and D-statistics were obtained for all tests (Supplementary Table S7), suggesting that Tibetans and Sherpas may represent a present-day proxy for a distinct branch of ancestral Tibeto-Burmans.

Isolating the East Asian Ancestry of Tibeto-Burman Populations
In a first explorative PCA performed on masked haploid data obtained by HAPMIX analysis, Central Siberian populations, such as Yakuts and Uygurs, were identified as genetic outliers, in line with their known ancient connection with a Western Eurasian ancestry source 11 . After removing such populations, we focused on those laying on the gradient described by PC1 and we noted that Tibeto-Burmans diverged from the classical South to North cline of variation present in EA, forming a distinct cluster.
Outgroup f3 statistics computed on the masked dataset (Fig. 4B) pointed out more recent genetic relationships between Tibeto-Burmans. In fact, Tharu form the Terai plains of Northern India and Nepal resulted more closely related to TAS (f3 = 0.220) as well, in accordance to their geographic proximity, but again with Nagas being responsible for the second highest f3 peak (0.219). In addition, Jamatia and Tripuri showed closer affinity with each other (f3 = 0.221) as they both reside in the Tripura region and speak sub-lineages of the Kok Borok branch of Tibeto-Burman languages 9 , and then to Nagas (f3 = 0.219 and 0.218, respectively). Instead, Manipuri turned out to be more closely related directly to Nagas (f3 = 0.219), as expected according to their linguistic affiliation to the Kuki-Chin-Naga Tibeto-Burman branch 9 .

Data Curation
Analyses of uniparental haplotypes were carried out by combining newly generated data with those publicly available for several SA and EA populations (i.e. 48 as concerns mtDNA and 25 for Ychromosome) (Supplementary Table S4). Nei's genetic diversity (H) was estimated with ARLEQUIN v.3.5.2.2 12 , while haplogroup prediction was performed using the online HAPLOGREP2.0 13,14 and HAPLOGROUP PREDICTOR 15 tools.
As QC performed on genome-wide data, we retained only autosomal loci with a genotyping success rate higher than 95% and no significant deviations from the Hardy-Weinberg equilibrium (p > 1.42x10 -8 after Bonferroni correction for multiple testing). Moreover, one individual showing more than 5% of missing data was excluded, together with ambiguous strand SNPs with A/T or G/C alleles.
The obtained high-quality dataset was submitted to LD pruning and the identified 102,980 SNPs in approximate linkage equilibrium with each other (r 2 < 0.2) were used to estimate the degree of recent shared ancestry (IBD) for each pair of subjects by calculating the genome-wide proportion of shared alleles (i.e. identity by state, IBS). Accordingly, we removed 12 individuals showing IBD kinship coefficient higher than 0.2. As a final QC step, we also applied PCA on the LD-pruned dataset to further exclude outliers that did not cluster with the majority of subjects belonging to their selfassigned ethnic group. Three individuals were thus filtered out as exceeding ± 3 standard deviations from the mean PC1 or PC2 eigenvector calculated for their respective cluster ( Supplementary Fig.   S1).

Tests Aimed at Providing Demographic Inferences
According to the equation adopted for MIXMAPPER analysis, it was possible to derive the split point of the two putative ancestral branches, the combined terminal branch length and the related mixture fraction. In particular, for each choice of the branches, the algorithm found the least-square solution from which it was possible to identify the combination of previous parameters presenting the smallest residual norm.
In order to formally test whether the clade including low-and high-altitude Tibeto-Burman groups (as resulted from MIXMAPPER analysis) is consistent with a single split of these two lineages from a common ancestor, we computed f4 and D-statistics in the form (YRI, CDX; Nagas, Tibetans/Sherpas). For this purpose, we used the functions implemented in the ADMIXTOOLS v3.0 package and we applied default parameters (Supplementary Table S7).