Genomic study of the Ket: a Paleo-Eskimo-related ethnic group with significant ancient North Eurasian ancestry

The Kets, an ethnic group in the Yenisei River basin, Russia, are considered the last nomadic hunter-gatherers of Siberia, and Ket language has no transparent affiliation with any language family. We investigated connections between the Kets and Siberian and North American populations, with emphasis on the Mal’ta and Paleo-Eskimo ancient genomes, using original data from 46 unrelated samples of Kets and 42 samples of their neighboring ethnic groups (Uralic-speaking Nganasans, Enets, and Selkups). We genotyped over 130,000 autosomal SNPs, identified mitochondrial and Y-chromosomal haplogroups, and performed high-coverage genome sequencing of two Ket individuals. We established that Nganasans, Kets, Selkups, and Yukaghirs form a cluster of populations most closely related to Paleo-Eskimos in Siberia (not considering indigenous populations of Chukotka and Kamchatka). Kets are closely related to modern Selkups and to some Bronze and Iron Age populations of the Altai region, with all these groups sharing a high degree of Mal’ta ancestry. Implications of these findings for the linguistic hypothesis uniting Ket and Na-Dene languages into a language macrofamily are discussed.


Clustering
Within the Ket population, we have found a number of subpopulations using a combination of KMEANS clustering and Kullback-Leibler distance approach (Sahu and Cheng 2003). We used the KMEANS clustering routine in R. Let N be the number of individuals. We ran the KMEANS clustering for k ranging from the N to two, using the matrix of admixture proportions as input (the matrix was calculated with ADMIXTURE (Alexander et al. 2009) for the dataset GenoChip). At each iteration, we calculated the ratio of the sum of squares between groups and the total sum of squares. If this ratio was >0.9, then we accepted the k-component model. Since KMEANS clustering cannot be implemented for k=1, to decide between two clusters or a possible single cluster, we also calculated Kullback-Leibler distance (KLD) between the k=2 and k=1 models. If the KLD <0.1 and the ratio of the sum of squares between groups and the total sum of squares for two-component model was above 0.9, then the k=1 model was selected because, in such cases, there were no subgroups in the population.

GPS
An admixture-based Geographic Population Structure (GPS) method (Elhaik et al., 2014) (Elhaik et al. 2013). The resulting distance matrix was used as an input for the hclust routine in R and displayed using package ape (Suppl. Fig. 4.1). Separation of the Ket population into five distinct clusters can be possibly explained by several factors: geography, family structure (see percentage of relatedness in Suppl. file S1), and admixture from other ethnic groups.
Then we applied GPS (Elhaik et al., 2014) and reAdmix (Kozlov et al. 2015) algorithms to infer provenance of the samples and confirm self-reported ethnic origin. For that purpose we compared the GenoChip SNP array data for the Ket, Selkup, Nganasan, and Enets populations (Suppl. file S1) to the worldwide collection of populations (Elhaik et al., 2014) based on 130K ancestry-informative markers (Elhaik et al. 2013). According to the GPS analysis, 46 of 57 (80%) self-reported Kets were identified as Kets, 9 (16%) as Selkups, one as a Khakas, and one as a Dolgan (Turkic speakers from the Taymyr Peninsula). In addition to the proposed population and geographic location, GPS also reports prediction uncertainty (the smallest distance to the nearest reference population) (Suppl. Fig. 4.2). The average prediction uncertainty was 2.5% for those Ket individuals identified as Kets; as Selkups, 4.4%; as Khakas, 5.6%; and 3.9% for the individual identified as a Dolgan. Prediction uncertainty over 4% indicates that the individual is of a mixed origin and the GPS algorithm is not applicable.
Using the reAdmix approach (in the unconditional mode) (Kozlov et al. 2015), we represented 57 Kets as weighted sums of modern reference populations (Suppl. Fig. 4.3, Suppl. Table 3). The median weight of the Ket ancestry in self-identified Kets was 94%; 39 (68%) of them had over 90% of the Ket ancestry (non-admixed Kets). Seven individuals with self-reported purely Ket origin appear to be closer to Selkups, with median 89% percent of Selkup ancestry. This closeness is not surprising, given the long shared history of Ket and Selkup people (Vajda 2004). Individuals with incorrect selfidentification were randomly distributed across sixteen birthplaces along the Yenisei River (Suppl. Table 3). 86% of GPS predictions agree with the major ancestry prediction by reAdmix (Suppl. Table   3). The Pearson's correlation between percentage of major ancestry and GPS uncertainty is -0.42, meaning that the individuals predicted by reAdmix to be of non-admixed origin are likely to be predicted to be non-admixed by GPS as well. Hence, we identified a subset of non-admixed Kets among self-identified Ket individuals, and nominated individuals for whole-genome sequencing. pink, mixed. Unrelated individuals (88 in total) selected for downstream analyses are marked with red stars, and two sequenced Ket individuals with blue crosses. All individuals kept after removing cases of mixed ethnicity and proven relatives are separated by genetic distances larger than that marked by the dashed circle (except for the pair GRC13273898 and GRC13273899).

References
Suppl. Fig. 4.2. GPS predictions for individuals with self-reported ethnicities: Selkup (pink), Nganasan (cyan), Nenets (blue), Ket (green), Evenk (red), Enets (black). Size of the circle is proportional to the prediction uncertainty and points to individuals of mixed origin. The map was plotted using QGIS v.2.8. Fig. 4.3. A. reAdmix analysis of individuals with self-reported Ket identity. The horizontal axis shows sample identifiers and the vertical axis shows a fraction of provenance attributed to a specific identity. All individuals except one were identified either as non-admixed or with unknown minor ancestry. 79% (45 out of 57) were identified as predominantly Ket. B. reAdmix analysis of individuals with self-reported Selkup and mixed Selkup identity. Only 57% of self-reported ethnic Selkups (12 out of 21) were identified as predominantly Selkup by reAdmix. Four individuals were identified as mixed. Individuals marked with the red arrow self-identified as the Ket-Selkup mix, and reAdmix identified them as Ket.

ADMIXTURE analysis
We combined the GenoChip array data with published SNP array datasets to produce a worldwide dataset of 90 populations and 1,624 individuals. The intersection dataset, containing 32,189 SNPs (Suppl . Table 1), was analyzed with the ADMIXTURE software (Alexander et al. 2009) (Fig. 1), selecting the best of 100 iterations and using 10-fold cross-validation criterion. In contrast to some of the previous studies of SNP array data , a unique admixture component, characteristic of the Ket and Selkup individuals, appeared at K≥11 components, which is a low value for a worldwide dataset (Fig. 1A). That discrepancy can be explained by differences in marker selection. The GenoChip array includes a high percentage of ancestry-informative markers that were chosen to maximize FST (Elhaik 2012, Elhaik et al. 2013. Kets and Selkups were previously modeled on a worldwide dataset as a mixture of the North European and In order to verify and explain the geographic distribution of the 'Ket' admixture component, we have performed ADMIXTURE (Alexander et al. 2009) analysis on two additional datasets, differing in populations (Suppl. Table 2) and marker selection (Suppl. Table 1). An admixture component with a geographical distribution closely resembling that discussed above was revealed in all datasets (Suppl. Probably due to the significantly increased dataset size (2,552 individuals vs 1,624 in the GenoChipbased dataset), we observed no minimum on the graph of cross-validation errors, and K=20 was chosen for the final analysis on this dataset since K=19 was used for the GenoChip-based dataset ( Fig.   1). An admixture component with a geographical distribution closely resembling that of the 'Ket' component (in the GenoChip-based dataset) was revealed at K≥13, however it reached its global peak (~98%) in Khanty, being also prominent in Selkups, reference Ket individuals, Nenets, and Kets from the present study ). Similar to the GenoChip-based dataset, secondary peaks of the 'Khanty-Ket' component were observed in the Volgo-Ural region, in South Siberia (e.g., up to ~11% in Tuvinians not included in the GenoChip-based dataset), in East Siberia, in Central and South Asia (e.g., up to ~6% in Burusho not included in GenoChip-based dataset), but not in the North Caucasus. The Saqqaq genome was not included into this dataset.
The dataset based on the HumanOrigins SNP array ) overlapped with the Ket genomic data showed a somewhat different pattern in admixture analysis ).
The minimum of cross-validation errors was reached at K=17 (Suppl. Fig. 5  In summary, our ADMIXTURE analysis of three datasets and two previous studies using Illumina array datasets    (Vajda 2004). It remains to be elucidated whether the observed geographic distribution of this ancestral component was formed by population movements of forest and tundra hunter-gatherers and steppe nomadic groups within the last two millennia, or is a hallmark of far more ancient events. However, the most intriguing is the appearance of the Ket-Uralic component in the Saqqaq Paleo-Eskimo (~4,000 YBP): at a low level of 6.3-7.2%, but consistently in both datasets containing this individual (Suppl. Table 4).
Since the Ket-Uralic admixture component appears almost exclusively in populations having both the Mal'ta (ANE) ancestry and the Siberian ancestry (Suppl. Information, Section 7), it may be tentatively viewed as a correlate of these two ancestries combined. However, the HumanOrigins-     Fig. 15).
Correlation of outgroup f3 statistics between a pair of populations was used as another approximate measure of population relatedness (see, e.g., . file S3) and f4 statistics (Suppl. Information, Section 8), and TreeMix (Suppl. Information, Section 9).
However, a Z-score of 3.4 for f4(Ket, Yoruba; Mal'ta, Loschbour) calculated on the original genomebased dataset (Suppl. Fig. 8.8) suggested that Kets were significantly closer to ANE than to WHG, and emphasized differences between the array-based and the genome-based datasets.
Afanasievo and Andronovo  and Motala12 ) are known to have high levels of Mal'ta ancestry. As compared to Mal'ta, Kets were significantly closer to: Athabaskans, Greenlanders, Late Dorset, Mayans, Mixe, and especially Saqqaq (Z < -2.9, Fig. 5B). The dataset without transitions showed a similar picture, but with generally lower Z-scores (Suppl. Fig. 8.16). Overall, these results are consistent with topologies observed for Kets and Mal'ta in the TreeMix analysis on both dataset versions (Suppl. Information, Section 9): Kets formed a sister-clade for East Asians and Native Americans (as in the trees obtained by , with Mal'ta located in the European clade (Fig. 3A). The proximity of Kets and Bronze Age Karasuk culture (see f3 statistics in Suppl. Figs. 7.5 and 7.6 and a TreeMix tree in Suppl. Figs. 9.1) was supported by f4 statistics f4(Karasuk, Yoruba; Ket, X) (Suppl. Fig. 8.17A,B). Karasuk was significantly closer to Kets with the Z-score cut-off of 2.9, as compared to all populations in the dataset except for Aleutian, Greenlanders, Iron Age Russia, Mal'ta, Mayans, Mixe, and Saqqaq, and none of the non-significant scores were negative, which means that Kets was probably the closest population for Karasuk in that dataset.
Taking into account the tree topologies and migration edges discussed in Suppl. Information, Section 9, we can tentatively model Kets as a two-way mixture of Siberians (related to East Asians) and ancient North Eurasians (represented by the Mal'ta genome). The Loschbour and Mal'ta individuals form a clade relative to East Asians (Fig. 3A), which is supported by |Z| < 2 for most statistics f4(East Asian, outgroup; Loschbour, Mal'ta) (Suppl.  , demonstrated a higher proportion of Mal'ta ancestry in Kets (43%), a bit lower proportion in Altaians (41%), and much lower proportions in Buryats and Nivkhs (9% and 15%, respectively, see Suppl. ), f4-ratios ranged from 25% to 45% (Suppl. Tsimshian. In summary, the level of Mal'ta ancestry in Kets is similar to but not higher than that in Native Americans, and this conclusion is compatible with results obtained with f3(Yoruba, Mal'ta, X) (Suppl. ).

Kets and ancient populations of South Siberia
According to statistics f4(Karasuk, Yoruba; Ket, X) on the genome-based dataset (Suppl. Fig. 8.17A Nganasan, Itelmen, Koryak, Chukchi, Eskimo (populations are sorted in the order of decreasing Z-score, from -3.6 for Tuvinian to -13.9 for Eskimo). In the genome-based dataset lacking Nganasans and closely related populations, according to statistic f4(Ket, Yoruba; Saqqaq, X) Kets were significantly closer to Saqqaq (threshold Z-score of 2.9), as compared to all other populations except for Athabaskans, Mayans, and Mixe (Fig. 5B). Late Dorset Paleo-Eskimo was not counted as the closest relative of Saqqaq. The same statistic on the dataset without transitions gave positive non-significant Z-scores < 2.9 for Athabaskans, Clovis, Greenlanders, Iron Age Russia, Mayans, and Mixe (Suppl. Fig. 8.16). Statistic f4(Saqqaq, Yoruba; Ket, X) had significantly negative Z-scores of -4.3 and -12.8 only for Greenland Inuits and Late Dorset, respectively, with other negative scores >-2 on both versions of the genome-based dataset (Suppl. Fig. 8.18).
For example, f4(Athabaskan, Yoruba; Ket, X) produced highly negative Z-scores below -7.8 for Clovis, Greenland Inuits, Karitiana, Mayans, and Mixe; and a moderately negative score for Saqqaq (Z = -1.9) (Suppl. Fig. 8.19A). On the dataset without transitions Saqqaq demonstrated a similar Z-score of -2.1 (Suppl. Fig. 8.19B). Similarly, on the dataset 'Ket genomes + HumanOrigins array' statistics f4(Chipewyan, Chimp; Ket, X) demonstrated highly negative Z-scores <-10.8 for all First Americans, and significant scores (threshold Z-score of -3.35) for the following East Eurasian populations (sorted in the order of decreasing Z-score, from -4.1 for Ulchi to -13.2 for Eskimo): Ulchi, Yukaghir, Nganasan, Itelmen, Koryak, Chukchi. TreeMix consistently placed Athabaskans as the sister-clade for First Americans, i.e. Clovis, Karitiana, Mayans, and Mixe, with very high bootstrap supports from 94 to 99 at m from 6 to 8 (Fig. 3A, Suppl. Figs. 9.1-9.5), in agreement with TreeMix results in Raghavan et al. (2014 and with the f4 statistic results shown above. Overall, we can conclude that Siberian ancestry in Chipewyans and Athabaskans is more closely related to Nganasans rather than to Kets, and the same is true for Saqqaq.
According to a published admixture graph analysis, Chipewyans were modeled as a mixture of 84% First Americans and 16% Saqqaq . Given the topology suggested in our study, i.e.
Saqqaq forming a clade with certain Asian populations, rather than with First Americans (due to its 31% to 57% of Siberian ancestry, see above), we can estimate the percentage of Saqqaq ancestry in Chipewyans and Athabaskans on the genome-based dataset without transitions and on the HumanOriginsbased datasets. The topology 'Native American, (Saqqaq, Siberian)' was supported by low Z-scores (|Z| < 2) for a number of American-Siberian population combinations in the following f4 statistic set-up: f4(Clovis or Mayan, Yoruba or San; Siberian population, Saqqaq) (Suppl. Table 6). In the case of Athabaskans, included into the genome-based dataset, the f4-ratios equaled: 4(Ket or Nivkh, San; Athabaskan, Clovis or Mayan) 4(Ket or Nivkh, San; Saqqaq, Clovis or Mayan) The Saqqaq ancestry in Athabaskans estimated using this approach was close to 0, judging by the f4-ratios ranging from -0.006 to 0.089. Similarly, the following f4-ratios were used to estimate the Saqqaq ancestry Suppl. Statistics for all edges modelled in the best trees at m from 1 to 8 are presented in Suppl. Table 7. Trees, percentage of explained variance, and residuals are shown below for the following counts of migration edges: 6, 7, and 8 for the original dataset and 6 and 8 for its version without transitions. The tree with 7 migration edges for the dataset without transitions is shown in Fig. 3A. Direction of an edge inferred by TreeMix was not considered critical as it was usually the least stable feature, depending on various dataset parameters (see, e.g., the Beringian -Paleo-Eskimo edges in Suppl.  (Fig. 1C) or 6.3% (Suppl. Fig. 5.6).
However, Siberian ancestry in Paleo-Eskimos modelled by these migration edges is much lower than the sum of all three Siberian admixture components in Saqqaq (32% at K=19 in Fig. 1A and 28% at K=23 in Suppl. Fig. 5.4), or the Siberian admixture component at K=5 in the original Saqqaq paper (~25%, , or Siberian ancestry in Saqqaq calculated using f4-ratios (ranging from 31% to 57% depending on dataset and reference populations, Suppl. Table 6).
In trees made on the original dataset with m=5 and 6 Kets form a clade with the Karasuk culture ancient population from the Bronze Age of the Altai region (Suppl. Fig. 9.1). The low bootstrap support of this clade (46 at m=6) is explained by the fact that Kets alternatively form a clade with another ancient genome from the Altai, Iron Age Russia (Suppl. Fig. 9.2), or with Mari (Suppl. Fig. 9.3), or form a paraphyletic assemblage with the Karasuk, Iron Age Altai and Iron Age Russia samples (Fig. 3A, Suppl. Figs. 9.4,9.5). Genetic proximity of the Karasuk culture to Kets was also demonstrated by f3 statistic (Yoruba; Karasuk, X) on the genome-based dataset (Suppl. Figs. 7.5,7.6), and the Karasuk culture has been tentatively associated with Yeniseian-speaking people using hydronymic data (Chlenova 1975).
Other edge groups (Suppl.

Mitochondrial haplogroups
We have determined mitochondrial and Y-chromosomal haplogroups based on SNP data from GenoChip: approximately 3,300 mitochondrial and 12,000 Y-chromosomal SNPs (see Methods for details). The frequencies of mitochondrial haplogroups in 46 putatively unrelated Kets in this study (Suppl. file S4) were similar to those reported previously for 38 Ket individuals (Derbeneva et al. 2002).
There are just few differences between this study and the previous one: i/ two times higher proportion of haplogroup F according to Derbeneva et al. (2002)

Methods
To estimate the Neanderthal gene flow influence we performed D-statistic analysis as described in Green et al. (2010). Reads for two Yoruba and two Kinh (Vietnamese) individuals were downloaded from the 1000 Genome Project database (McVean et al. 2012). We chose Yoruba samples NA19238 and NA19239, and Kinh Vietnamese samples HG01873 and HG02522 as they had read coverage similar to the Ket samples, and were not genetically related to each other. Ket, Yoruba, and Vietnamese reads were used for calling SNPs with GATK HaplotypeCaller, emitting both reference and nonreference sites, about 1 billion sites per individual. This procedure ensured that genotype calls for each individual were made in exactly the same way. Altai Neanderthal and chimpanzee genotypes were processed as described in Khrameeva et al. (2014). Coordinates of the chimpanzee genome were mapped to the human genome hg19 using UCSC liftOver tool (Rosenbloom et al. 2015).
In further analysis, we considered only homozygous sites different between the chimpanzee (A) and Neanderthal (B) genomes. Then we matched a randomly selected modern human allele to these sites. All sites where a Ket allele matched a Neanderthal allele and a Yoruba allele matched a chimpanzee allele were counted and referred to as #ABBA (termed Neanderthal-like sites). All sites where a Ket allele matched a chimpanzee allele and a Yoruba allele matched a Neanderthal allele were counted and referred to as #BABA. D-statistic = (#ABBA -#BABA)/(#ABBA + #BABA) was calculated and averaged for all possible pairs of Yoruba and Ket samples. As a control, the same analysis was repeated for Vietnamese genotypes instead of Ket genotypes.
Ket and Vietnamese sites used in the D-statistic analysis were assigned to human genes according to coordinates of the longest transcript retrieved from UCSC Genome Browser (Rosenbloom et al. 2015) plus 1,000 nucleotides upstream to include potential regulatory regions. The gene set enrichment analysis (GSEA) algorithm (Subramanian et al. 2005) ranked genes according to difference between #ABBA and #BABA, while four pairs of samples were treated as replicates. We used the MSigDB collection of 825 gene ontology (GO) biological processes (c5.bp.v3.0.symbols.gmt) (Subramanian et al. 2005) to assign genes to functional groups. GO terms with less than 15 or more than 500 genes per term were excluded. The mean and median false discovery rates (the mean FDR and median FDR) were used to estimate the significance of Neanderthal sites enrichment in the functional groups. In GSEA, the mean FDR was obtained by using the mean of the estimated number of false positives in each of 3000 permutations of the sample labels, while the median FDR was calculated as the median of the estimated number of false positives in the same permutations.

Results
To estimate the Neanderthal gene flow influence, we performed D-statistic analysis as described in Green et al. (2010). Given two Ket and two Yoruba individuals, we calculated the statistic D(Neanderthal, Chimp; Ket, Yoruba) for four different pairs of individuals. The mean D-statistic value, 3.85±0.15%, was in good agreement with other studies (Green et al. 2010, Khrameeva et al. 2014. As a control, we replaced the Ket genotypes with Vietnamese genotypes processed using the same procedure. The control D-statistic value was 3.95±0.19% (Suppl . Table 10). Positive D-statistic values reflect higher similarity of Ket rather than Yoruba genotypes to Neanderthal genotypes, as expected for any non-African individuals.
In order to find Ket functional gene groups enriched in Neanderthal alleles we applied the GSEA algorithm to 'biological process' gene ontology (GO) terms (Khrameeva et al. 2014) (Suppl. file S11). No functional groups had mean FDR < 0.05, however two groups had median FDR < 0.05: 'amino acid catabolic process' with 24 genes (mean FDR = 0.6, median FDR = 0); and 'nitrogen compound catabolic process' with 28 genes (mean FDR = 0.7, median FDR = 0). It should be noted that the mean FDR was previously reported to overestimate the true false discovery rate when the sample size was small, while the median FDR was almost unbiased (Hirakawa et al. 2008). As the second functional group includes all genes from the first group, only the top group, 'amino acid catabolic process', is discussed further. ABBA and BABA site counts and D-statistics for individual genes in this group are shown in Suppl. file S11. Detailed inspection of site counts in individual genes showed the following genes with D-statistic > 50%: ASL, argininosuccinate lyase, a urea cycle enzyme crucial for ammonia removal; FAH, fumarylacetoacetate hydrolase, involved in tyrosine catabolism; GAD2, glutamate decarboxylase 2, involved in the synthesis of an important neurotransmitter γ-aminobutirate in the brain; GOT1, cytoplasmic isoform of glutamate-oxaloacetate transaminase, playing a role in amino acid metabolism and the urea and tricarboxylic acid cycles; GSTZ1, glutathione S-transferase zeta 1, involved in tyrosine and phenylalanine catabolism. The observed enrichment of Neanderthal-like sites in catabolic pathways associated with a protein-rich diet suggests that Kets and Neanderthals (Sistiaga et al. 2014) had similar dietary preferences. Indeed, the diet of Kets until today includes a large proportion of meat and fish, and Neanderthals were previously reported to predominantly consume meat (Sistiaga et al. 2014). We suggest that Kets, who abandoned the nomadic hunting lifestyle only in the middle of the 20 th century, are a good model of genetic adaptation to protein-rich diets. However, we note that our results were obtained with a very small sample of Ket genomes, and the analysis has to be repeated when a much larger sample of Ket genomes becomes available.
We found one gene group potentially enriched in Neanderthal-like sites in control Vietnamese samples with median FDR < 0.05 ('response to nutrient' with 17 genes, mean FDR = 1, median FDR = 0). In a previous work (Khrameeva et al. 2014), genes involved in lipid catabolism (GO process 'lipid catabolic process') were shown to be significantly enriched in Neanderthal alleles in populations of European descent only. This effect was not observed for the Kets (FDR-corrected p-value = 1, see also site counts for individual genes in Suppl. file S11).