MtDNA analysis reveals enriched pathogenic mutations in Tibetan highlanders

Tibetan highlanders, including Tibetans, Monpas, Lhobas, Dengs and Sherpas, are considered highly adaptive to severe hypoxic environments. Mitochondrial DNA (mtDNA) might be important in hypoxia adaptation given its role in coding core subunits of oxidative phosphorylation. In this study, we employed 549 complete highlander mtDNA sequences (including 432 random samples) to obtain a comprehensive view of highlander mtDNA profile. In the phylogeny of a total of 36,914 sequences, we identified 21 major haplogroups representing founding events of highlanders, most of which were coalesced in 10 kya. Through founder analysis, we proposed a three-phase model of colonizing the plateau, i.e., pre-LGM Time (30 kya, 4.68%), post-LGM Paleolithic Time (16.8 kya, 29.31%) and Neolithic Time (after 8 kya, 66.01% in total). We observed that pathogenic mutations occurred far more frequently in 22 highlander-specific lineages (five lineages carrying two pathogenic mutations and six carrying one) than in the 6,857 haplogroups of all the 36,914 sequences (P = 4.87 × 10−8). Furthermore, the number of possible pathogenic mutations carried by highlanders (in average 3.18 ± 1.27) were significantly higher than that in controls (2.82 ± 1.40) (P = 1.89 × 10−4). Considering that function-altering and pathogenic mutations are enriched in highlanders, we therefore hypothesize that they may have played a role in hypoxia adaptation.

considerable effect on OXPHOS and even cause maternal genetic diseases like Leber's hereditary optic neuropathy (LHON), deafness, mitochondrial encephalomyopathy with lactic acidosis and stroke-like episodes (MELAS), and etc. 13,14 . Several studies have reported that the pathogenic mutations in lowland populations showed signals of being under positive selection in Tibetans and Sherpas, indicating that these radical and function-altering variants might have effect on OXPHOS and further contribute to hypoxia adaptation for Tibetan highlanders 11,15 . However, there is still no systematic evaluation on the function-altering and pathogenic mutations in mtDNA genomes involving different ethnic groups of the Tibetan Plateau.
In this study, we analyzed 549 highlander sequences (i.e. Tibetans, Lhobas, Dengs, Monpas and Sherpas) including 293 newly sequenced ones, to obtain a comprehensive view of mtDNA profile of Tibetan highlanders. An mtDNA phylogeny was constructed using a total of 36,914 sequences of world populations including highlander sequences. Through the founder analysis based on 432 random highlander samples, we proposed a three-phase model of colonizing the Tibetan Plateau, i.e., pre-LGM Time (30 kya, 4.68%), post-LGM Paleolithic Time (16.8 kya, 29.31%) and Neolithic Time (after 8 kya, 66.01% in total). Next, we observed that pathogenic mutations occurred far more frequently in 22 highlander-specific lineages (five lineages carrying two pathogenic mutations and six carrying one) than in the 6,857 haplogroups of all the 36,914 sequences (P = 4.87 × 10 −8 ). In addition, possible pathogenic mutations in Tibetan highlanders were significantly higher than those in controls from East Asians of 1000 Genomes Projects using different strategies. Thus, considering that function-altering and pathogenic mutations are enriched in highlanders, we hypothesize that they may have played a role in adaptation of hypoxic environments for highlanders.

Results
Overview of mtDNA in Tibetan highlanders. We collected and sequenced 285 Tibetan highlanders (i.e., Tingri Tibetans, Lhobas, Dengs and Monpas) and generated high quality data of the complete mtDNA sequences (average coverage being 3,402× and minimum coverage 186× ). In addition, we obtained 33 complete mitochondrial sequences of Tibetans from fastq files of Yi et al. (average coverage being 15× and minimum coverage 6× ) 9 , and collected all 223 Tibetan highlander sequences (including Sherpas, Monpas and Tibetans) available in the literatures (detail information of these samples was tabulated in Table S1). Eight additional Tibetan sequences generated in our lab were also included for phylogenetic analysis as Tibetan highlanders. In total, 549 mtDNA sequences of Tibetan highlanders were analyzed in this study, of which 432 individuals were randomly sampled from 7 highland populations, including 86 Tingri Tibetans, 91 Lhobas, 91 Dengs, 17 Monpas, 76 Sherpas, 38 Tibetans from Ji et al. 11 (referred as Tibetan1) and 33 Tibetans from Yi et al. 9 (referred as Tibetan2). Finally, 293 mitochondrial genomes were deposited in Genbank (KT725860-KT726152). Variations to rSRS of 33 Tibetan mtDNA from fastq files of Yi et al. were attached as supplemental data (Table S2). Haplogroup assignments were consulted to Phylotree Build 16 16 .

MtDNA phylogeny of Tibetan highlanders. To identify autochthonous lineages and variants in Tibetan
highlanders, a maximum parsimony phylogeny of 36,914 worldwide complete mtDNA sequences including 549 highlander sequences were reconstructed, which was further confirmed by median-joining network. In the reconstructed phylogeny, 21 haplogroups were considered as major haplogroups of Tibetan highlanders since these haplogroups accounted for considerably high frequencies in 432 random highlanders or appeared in at least three ethnic groups of highlanders (in total 63.19% in 432 random samples, Figures S3 and S4 and Table 1). For these major haplogroups are derived from the most recent common ancestor shared with non-highlander populations in the phylogeny respectively, they represent possible distinct founding events by ancestors migrating to the Tibetan Plateau.
We further inferred the demographic history of 432 highlander sequences of coding regions using Bayesian skyline plot (BSP) (Fig. 1A). Three phases of population expansions emerged. First, the effective population size (Ne) of Tibetan highlanders expanded from less than 1,000 to ~20,000 before ~35 kya, assuming 25 years as generation time length. The Ne did not change drastically until 20 kya, probably corresponding to the end of LGM. The second expansion occurred between 20 to 10 kya with Ne expanding to ~50,000, which was followed by a severe population reduction. Finally, the greatest expansion of highlanders began at ~5 kya, with a ~10-fold Ne increase at ~2 kya, but then transit to a recent bottleneck, with Ne decreasing to ~4,600. Further, a founder analysis was conducted by employing highlander sequences as 'sink' samples and other East Eurasians as 'source' samples (See details in Methods). As a result, the probabilistic distribution of founder clusters showed five peaks (i.e. 30.0, 16.8, 8.0, 3.6 and 1.2 kya), indicating five distinct founding events for colonizing the Tibetan Plateau (Fig. 1B). These founding events coincided well with the three phases of expansions observed in BSP.
In summary, we proposed a three-phase model of colonizing the Tibetan Plateau: pre-LGM phase (before 20 kya), post-LGM Paleolithic phase (20-10 kya) and Neolithic phase (after 10 kya). The partition analysis 17 showed that the relative proportion of the three phases were 4.68%, 29.31% and 66.01% of founding lineages in Tibetan highlanders, respectively ( Table 2).
In the phylogeny, we found that there were 22 highlander-specific lineages containing at least 4 haplotypes and 3 highlander samples, which might represent the autochthonous lineages those survived in the highlands (i.e. Haplogroups G3b3a, M13a2, M33b1a1, M62, M7b1a1j, C7a1a2, Z7, M9a1a1c1b, M9a1a2, M9a1b1c, M9a1b1d3, A7, A11a, A15c1a, N11a1, G3b1a, M49a1a1, C4a3b1, D4j1b3, D5a2c2, D5a3a2 and F1c1a1a, see Figure S4). These lineages encompassing 190 individuals accounted for considerable frequencies in 432 random highlanders (43.98%). The defining variants of these highlander-specific lineages may have played a role in the adaptation of all the individuals carrying them. Therefore, we searched the defining variants of the 22 highlander-specific lineages in the database of pathogenic mutations described in MITOMAP (updated on Jan, 8 th , 2015, with synonymous and HVS mutations removed, 568 mutations in total) to identify mutations that may lead to radical changes in RNA or protein function. As a result, of the 22 haplogroups, 5 lineages harbored 2 pathogenic mutations while 6 lineages carried one potential pathogenic mutation ( Table 3 and see networks of 4 major haplogroups in Fig. 2). Among all the 36,914 sequences analyzed, 6,857 haplogroups contained at least 4 haplotypes and 3 samples, of which 701 (10.22%) lineages had one pathogenic mutation while 68 (0.99%) had at least two pathogenic mutations. We found that the pathogenic mutations occurred far more frequently in the highlander-specific lineages (Fig. 3A, Fisher exact test, P = 4.87 × 10 −8 ). In addition, even we compared the frequency of pathogenic mutations between highlander-specific lineages and 481 young haplogroups (ρ ≤ 1) with higher ratio of deleterious mutations of above 6,857 haplogroups, the pathogenic mutations still occurred far more frequently in the highlander-specific lineages (Fisher exact test, P = 1.   Table S5), suggesting that these function-altering mutations may have played a role in adaptation of highlanders in all three phases of colonization. Among the 16 pathogenic mutations harbored in the 11 lineages, 6 of them (T3394C, T4216C, T9101C, G7598A and G13708A in 2 lineages) were implicated in LHON, 5 of them (A636G, T1005C, C1192A, T4363C and T10454C) might cause deafness, and 4 of them (T2352C, A3397G, G7697A and A15924G) might lead to several kinds of mitochondrial myopathies (Table 3). These mutations were located almost in all regions of the mitochondrial genome, including the regions coding 3 complexes (Complex I, Complex IV and Complex V), 4 tRNAs and 2 rRNAs. T3394C, which changed the tyrosine (Y) at 30 th amino acid of ND1 subunit to a histidine (H), was not only located in highlander-specific lineage G3b3a, but also occurred additional 3 times independently in Deng sample DB033 (Haplogroup M33b1a1), Lhoba sample LB345 (Haplogroup R22), and Haplogroup M9, which was the most frequent haplogroup in Tibetan highlanders. Another mutation A15924G, located on the coding region of tRNA Thr , occurred 4 times in highlander-specific lineages (Sherpa sample XEB171 of Haplogroup M9a1a1c1b1a1, Tibetan sample GQ895147 of Haplogroup A11a, Haplogroup M9a1a2a and Sherpa-specific lineage A15c1a). Thus, all the above 16 pathogenic mutations might have certain effect on metabolism and are important candidates for further functional experiments.  Pathogenic mutations in highlander individuals. The enrichment of pathogenic mutations in highlander individuals could also be examined by site-based association analysis. In the following association analysis, we used 367 East Asians (CHB, CHD, CHS and JPT from 1000 Genome Projects) as controls 18 . Among the 1,809 polymorphic sites in a total of 799 samples (432 highlanders and 367 controls), alleles at 127 sites were detected pathogenic in database from MITOMAP. Among the aforementioned 16 mutations harbored on 11 highlander-specific lineages, T3394C (P = 4.48 × 10 −29 ), G7697A (P = 3.72 × 10 −15 ), T10454C (P = 4.71 × 10 −6 ) and A15924G (P = 4.98 × 10 −5 ) reached significant enrichment in highlanders, given adjusted 0.05 level significance (0.05/127 = 3.93 × 10 −4 ) using Bonferroni correction. It should be noted that the most significant allele among all 127 sites is T3394C (P = 4.48 × 10 −29 ), given the fact that there are 28.24% of T3394C carriers in highlanders, ~17 folds to those in controls (1.63%). Next, we compared the amount of accumulated pathogenic mutations in highlander sequences with those in East Asians from 1000 Genomes Projects. First, we detect possible deleterious mutations in an individual based on pathogenic mutation database from MITOMAP. Potential pathogenic mutations in highlanders (in average 3.18 ± 1.27) were found significantly higher than those in controls from 1000 Genomes Projects (2.82 ± 1.40) (Mann-Whitney U test, P = 1.89 × 10 −4 ) (Fig. 3B). Second, we detected possible tRNA pathogenic candidates according to Kondrashov et al. 19 . We found that Tibetan highlanders carried 0.19 (± 0.39) tRNA pathogenic mutations in average, significantly more than those in controls (0.13 ± 0.34, Mann-Whitney U test, P = 0.028). Finally, we evaluated the pathogenic effects of nonsynonymous variants using Mutpred scores (ranging from 0 to 1) 20 . Higher scores indicate a greater probability that the amino acid variation might be function-altering and pathogenic. To avoid arbitrary use of cutoffs, we applied Mutpred scores 0.5, 0.6 and 0.7 as pathogenic cutoffs, respectively. Since there are a limited number of mutations with Mutpred scores over 0.8 (only 23 variants in 799 samples), we did not apply cutoffs higher than 0.8. Using all the three cutoffs, highlanders contained significantly more pathogenic nonsynonymous mutations than controls, respectively (1.73 ± 1.23 vs. 1.20 ± 0.96 for 0.5 cutoff, P = 2.69 × 10 −10 ; 1.00 ± 0.90 vs. 0.51 ± 0.72 for 0.6 cutoff, P = 1.80 × 10 −17 ; 0.36 ± 0.52 vs. 0.15 ± 0.38 for 0.7 cutoff, P = 2.56 × 10 −11 ).
In addition, we also employed two East Asian populations as lowland controls to conduct the same analyses in this section, one of which included 444 Han Chinese from Rugao, Jiangsu Province 21 , another included 550 Taiwan aborigines 22 . Generally, the results employing the two single populations were consistent to those analyses using 367 East Asians as controls (data not shown). In short, we inferred that pathogenic mutations were enriched in Tibetan highlanders. Ratio of nonsynonymous and synonymous mutations. We counted nonsynonymous (N) and synonymous mutations (S) in the highlander-specific lineages and the haplogroups of all the 36,914 sequences in this study. The coalescence time of each haplogroup was estimated based on ρ -statistic value for further grouping to calculate corresponding ratio of nonsynonymous and synonymous mutation (N/S). We showed earlier that the major expansion of Tibetan highlanders and the majority of immigrations to the Tibetan Plateau occurred in the past several thousand years. Thus, we focused on the highlander-specific lineages with recent coalescence (ρ ≤ 4).
In Fig. 3C, N/S of lineages of each ρ unit was calculated. For all the 36,914 sequences, the N/S declined with the increase of the age of lineages. In detail, the N/S of the external lineages (ρ = 0, N/S = 0.58) was significantly higher than that of internal lineages within 10 kya (1 ≤ ρ ≤ 4, N/S = 0.47, P = 1.46 × 10 −17 ). This observation suggested that mtDNA was under purifying selection against deleterious nonsynonymous mutations, coinciding with previous study [23][24][25] . As contrast, the N/S of the highlander-specific haplogroups showed an opposite trend, i.e., increasing with the ages of lineages (Fig. 3C). We found that the N/S of the external lineages (ρ = 0, N/S = 0.61) was lower than that of the internal lineages (1 ≤ ρ ≤ 4, N/S = 0.69), although not statistically significant. This observation suggested that some of the nonsynonymous mutations might have been favorably preserved in the highlander-specific haplogroups, which was a signal of positive selection.
Interestingly, the N/S of the external branches of the highlander-specific lineages (ρ = 0, N/S = 0.61) was very close to that of all the 36,914 sequences (ρ = 0, N/S = 0.58) (Fig. 3C). However, we found that the N/S in the internal branches of the highlander-specific haplogroups (1 ≤ ρ ≤ 4, N/S = 0.69) was nearly significant higher than that of all the 36,914 sequences (1 ≤ ρ ≤ 4, N/S = 0.47, P = 0.058) (Fig. 3D, Table S6). In detail, we grouped these variants according to four complexes and found that the N/S of 3 complexes of the highlander-specific lineages was higher than that of all the 36,914 sequences respectively except for Complex III (Cytochrome bc1 complex), suggesting a mitochondrial genome-wide trend.
Demographic simulations. Considering that N/S might be elevated by the recent bottleneck in Tibetan highlanders but not positive selection, we further conducted forward demographic simulations to rule out such possibility.
We selected 432 random Tibetan highlanders from the current study as a representative of highlanders and the 367 random East Asian samples from 1000 Genomes Projects as a representative of East Asian lowlanders. Time of the mutations was estimated based on the coalescence time of corresponding haplogroups in the mtDNA phylogeny of 36,914 sequences. As a result, the N/S of the external lineages of East Asian lowlanders (ρ = 0, N/S = 0.54) was significantly higher than that of the internal lineages within 10 kya (1 ≤ ρ ≤ 3, N/S = 0.28, P = 0.0005), while the N/S of the external lineages of Tibetan highlanders (ρ = 0, N/S = 0.57) was lower than that of internal lineages (1 ≤ ρ ≤ 3, N/S = 0.64), although not significantly. This result was consistent to the previous observation in the last section, indicating that the mtDNA of lowlanders was under purifying selection against  Based on the BSP plot of 367 East Asians and 432 random Tibetan highlanders, we reconstructed the evolutionary history of lowlanders and highlanders (See Fig. 1 and Zheng et al.) 18 . According to the demographic models from BSPs (Check Figure S5 for details), we employed SFS_CODE to conduct forward simulations with the effect of purifying selection 26 . The detail settings of SFS_CODE were referred in Material and Methods. The results of simulations showed that N/S of the external lineages (ρ = 0) was significantly lower than that of the internal lineages within 10 kya (1 ≤ ρ ≤ 3) under purifying selection in both East Asian lowlanders and Tibetan highlanders (Table 4), coinciding with our observation on East Asians but different with highlanders. This result indicated that it still was of little chance to observe higher N/S in internal lineages for Tibetan highlanders when  bottleneck effect was considered, inferring that the higher N/S in the internal lineages violated the simple models with purifying selection and thus positive selection might help to preserve the internal nonsynonymous mutations. In short, the highlander lineages carried more nonsynonymous and function-altering mutations in the internal branches, indicating a signal of positive selection. Considering all above, we hypothesize that the enriched function-altering and pathogenic mutations in Tibetan highlanders may have played a role in adaptation of hypoxia environment. Nevertheless, there is no direct evidence relating the pathogenic mutations to hypoxia adaptation and other explanations to our observations are possible. Further studies, especially functional experiments are necessary to reveal the mechanism how pathogenic mutations contribute to hypoxia adaptation.

Discussion
In the phylogeny, the 21 major haplogroups of Tibetan highlanders could generally be divided to two types. Some haplogroups are highlander-specific lineages, such as Haplogroups M62, M13a2, M9a1a1c1b, M9a1b1c, M9a1a2 and A11a, indicating that the ancestors of these lineages migrated to the highlands ( Figure S4). Other haplogroups derive considerable immediate non-highlander lineages, such as Haplogroups F1g and D5a2c, implying that these lineages split to lowlands and highlands respectively. Both types of the haplogroups above indicated founding events for colonizing the Tibetan Plateau, which were used as criteria for identification of founder haplotypes assuming sufficient sampling in outgroups (See Methods, and 17 ). Generally, the most recent common ancestor of highlanders and other East Eurasians would be considered as a founder. Founder analysis based on mtDNA hyper-variation regions (HVR) was employed in previous studies to reconstruct the migration history of modern humans 17,27 . With the aggregation and accumulation of mtDNA complete sequences, we tried to conduct founder analysis on whole mtDNA sequences for their high phylogenic resolution. As a result, we proposed a three-phase model of colonizing the Tibetan Plateau, i.e. pre-LGM Time (30 kya, 4.68%), post-LGM Paleolithic Time (16.8 kya, 29.31%) and Neolithic Time (after 8 kya, 66.01% in total). Lines of archeological evidence revealed that modern humans colonized the Tibetan Plateau before LGM 2 and thrived with the advent of agriculture 28 . Previous genetic studies on mtDNA and Y chromosomes also showed that the course of colonization was throughout the Paleolithic and Neolithic Times [29][30][31][32][33] .
Radical mtDNA variations would have considerable effect on OXPHOS and cause mitochondrial diseases, especially in the tissues highly dependent on oxidative metabolism, like retinas, ears, skeleton muscles, hearts and etc. 13,14 . In this study, we found that 16 pathogenic mutations were harbored in 11 highlander-specific lineages, which might have certain effect on OXPHOS and further impact on metabolism. These 16 mutations might have pathogenic effect on lowland people for impairment on OXPHOS function 35 . However, some studies also showed that these mutations occurred in healthy subjects, indicating that these pathogenic mutations might not result in diseases under certain conditions (e.g. autosome backgrounds and environments) 36,37 . In recent years, genomic studies have revealed numerous genes contributing to hypoxia adaptation for highlanders, indicating the complex mechanism underlying the adaptation, including interaction of multiple genes 38,39 , environment effect 28 and even introgression from archaic humans 40 .
A recent paper 41 showed that hypoxia was protective against mitochondrial toxicity in cultured cells and zebrafish models with respiratory chain deficiency. In addition, chronic hypoxia leaded to a marked improvement in survival and disease biomarkers in a genetic mouse model of Leigh syndrome. The normal concentration of oxygen was toxic to patients with respiratory chain deficiency. However, hypoxia might improve the symptoms by triggering innate adaptive programs (such as HIF pathway) and by simultaneously limiting the substrate for oxygen toxicity. The results indicated a natural match for hypoxia environments in Tibetan highlands and mild mitochondrial dysfunction caused by mtDNA deleterious mutation, which provide functional evidence for our work.
HIF pathway, involving two genes (EGLN1 and EPAS1) of significant signals of positive selection, was under extensive focus. Mitochondria could generate ROS and have further effect on HIF pathway. Thus, we hypothesize that function-altering and pathogenic mutations of mtDNA may affect OXPHOS function to generate and accumulate ROS in mitochondria, subsequently inhibit PHD2 or stabilize HIF, and result in contribution in hypoxia adaptation. Nevertheless, there is no direct evidence relating the pathogenic mutations to hypoxia adaptation. Further studies, especially functional experiments are necessary to reveal the detailed mechanism.

Methods
Population and samples. Four ethnic groups (in total 285 samples) residing in the Tibetan Plateau were collected in this study, including 86 Tibetans from Tingri County in Shigatse Prefecture, 91 Lhobas from Mainling and Gongbo'gyamda Counties in Nyingtri Prefecture, 91 Dengs from Zayü County in Nyingrtri Prefecture, and 17 Monpas from Mainling and Gongbo'gyamda Counties in Nyingtri Prefecture. All these samples were maternally unrelated. This research was approved by both the Human Ethics Committees of Tibet University for Nationalities and School of Life Sciences of Fudan University, and was carried out in accordance with the approved guidelines.
Informed consent was obtained from all subjects collected in the current study. The sequencing and variant calling methods for these samples were according to those described previously 15,21 . Thirty-three Tibetan mtDNA sequences were obtained from fastq files of Yi et al. 9 , of which missing sites were further imputed based on the phylogeny. We also collected all 223 Tibetan highlander mtDNA sequences available in the literature 4,11,15,30,31,42 . Eight additional Tibetan sequences generated in our lab were also included for phylogenetic analysis as Tibetan highlanders. In total, 549 mitochondrial sequences of Tibetan highlanders were analyzed in this study, of which 432 sequences composed a random highlander dataset (See Table S1). Finally, 293 mitochondrial genomes were deposited in Genbank (KT725860-KT726152). Variations to rSRS of 33 Tibetan mtDNA from fastq files of Yi et al. were attached as supplemental data (Table S2). A total of 36,914 world-wide mtDNA sequences were analyzed in this study, which included all available sequences from Phylotree database 16,43 and unpublished data in our lab. Data analysis. Complete mtDNA sequences were assigned to haplogroups generally according to PhyloTree Build 16 16,43 . The complete median-joining network of Tibetan highlanders ( Figure S3) was reconstructed by Network v4.6 44 . Considering the network and PhyloTree Build 16, we generated the final phylogeny ( Figure S4). Haplogroups Z3b, Z7, A7 and D5a2 were modified according to updated sequences compared to PhyloTree Build 16. Haplogroups A15c1a, G3b3, G3b1a, G2a1a, M33b1a, D4j1b3, D4h1c1a, D5a3a2, M49a1a, M7b1a1j, M9a1b1d, F1c1a2, F1d1a and F1d5 were newly named in the current study.
Then, the phylogeny of Tibetan highlanders was used to examine the assumption of a molecular clock under the HKY + G mutation model. The null hypothesis of a molecular clock cannot be rejected (P = 1.00) using PAML package v4.4 45 . According to the phylogeny, the coalescence time of lineages of interest was estimated using ρ statistic-based method and maximum likelihood method implemented with all available sequences in these lineages. For ρ statistic-based method, standard deviation was calculated following Saillard et al. 46 . Then the time to TMRCA of each lineage was estimated using Soares rate for complete mtDNA sequences 25 .
Bayesian skyline plots (BSP) for the 7 populations of the Tibetan Plateau and 432 random highlander sequences were generated by BEAST 1.8.1 47 and Tracer 1.5.1 using mtDNA coding regions (576-16023). Each MCMC sample was based on a run of 100 million generations sampled every 10,000 steps with the first 10 million generations regarded as burn-in. We used the HKY+ G model of nucleotide substitution without partitioning the coding region. A strict clock was used and prior substitution rate was set 1.691 × 10 −8 subs/site/year 48 .
Φ ST distances between populations were calculated in Arlequin 3.11 49 using coding regions, and plotted in PAST 1.85 50 with a non-metric multidimensional scaling method (MDS), showing that highlander populations were clustered together and close to other East Asian populations. Tajima's D as well as Fu & Li's D* and F* tests were performed using DnaSP v5.10 51 .
To clarify the colonizing history of the Tibetan Plateau, we conducted a founder analysis, using all 549 highlander sequences as 'sink' samples and other East Eurasians as 'source' samples in the phylogeny. We used an f0 criterion to identify founders 17 , for the majority of highland founding events occurred within several thousand years. We identified candidate founders by searching for 1) identical haplotypes in Tibetan highlanders and other East Eurasians, and 2) reconstructed ancestor haplotype with both derivatives of highlanders and other East Eurasians, which are either a) unsampled haplotypes with both immediate highlander and non-highlander derivatives, or b) haplotypes sampled only in Tibetan highlanders and whose immediate derivatives include at least one other East Eurasian, or c) haplotypes sampled only in non-highlanders and whose immediate derivatives include at least one highlander. Generally, the most recent common ancestor of highlander and other East Eurasians would be considered as a founder.
In the phylogeny, 21 haplogroups were considered as major haplogroups of founding the Tibetan Plateau since these haplogroups accounted for considerably high frequencies in 432 random highlanders or appeared in at least three ethnic groups of Tibetan highlanders ( Figures S3 and S4 and Table 1). Most of these major haplogroups are derived from a single founder respectively except that Haplogroup D4j1a1 might be derived from several but phylogenetically near founders.
We estimated the migration times of all candidate founders using the ρ statistic 52 and a complete mutation rate 25 . Here, the ρ statistic represented the average number of mutations between the founder haplotype and every highlander sequence. Then, the sample size of each founder was adjusted according to 432 random Tibetan highlanders. We scanned the distribution of founder ages for Tibetan highlanders defining equally spaced 200-year intervals for each migration from 0 to 50 kya. A second partitioning analysis was conducted using defined migration time based on the previous scan. This second analysis allowed us to fractionate the extant highlander lineages into each migration and estimate the percentage of lineages in each migration event. Non-highlander samples in Haplogroups A11a, M13b2, M67, M62, N11a1, M9a1a1c1b, M9a1a2 and M9a1b1c were considered to occur due to back migrations. Ruling out the back migrations, these haplogroups were still analyzed as autochthonous in the Tibetan Plateau and highlander-specific.
In the phylogeny, we found that there were 22 highlander-specific lineages containing at least 4 haplotypes and 3 highlander samples, which might represent the lineages those survived in the highlands. We searched the defining variants of the 22 highlander-specific lineages in the database of pathogenic mutations described in MITOMAP (568 pathogenic mutations in total, updated on Jan, 8 th , 2015, with synonymous and HVS mutations removed) 35 to identify mutations that lead to radical changes in RNA or protein function. Among all the 36,914 sequences analyzed, 6,857 haplogroups contained at least 4 haplotypes and 3 samples, of which defining variants were also detected for pathogenic mutations. Next, we compared the frequencies of highlander-specific lineages carrying different numbers of pathogenic mutations to those of the haplogroups of all the 36,914 sequences.
In addition, site-based association analysis was conducted using 432 random Tibetan highlanders as cases and 367 East Asians from 1000 Genomes Projects as controls 18 . Accumulated possible pathogenic mutations in Tibetan highlanders were also compared to those in 367 East Asians from 1000 Genomes Projects based on database of pathogenic mutations described in MITOMAP 35 , tRNA pathogenicity 19 and Mutpred scores 20 , Scientific RepoRts | 6:31083 | DOI: 10.1038/srep31083 respectively. Conservation index of mutations (CI) was consulted to MitoTool 53 . Nonsynonymous/synonymous mutation ratio in highlander-specific lineages was compared to that in the contemporary haplogroups of all the 36,914 sequences according to the coalescence time estimated by the ρ statistic-based method. The nonsynonymous and synonymous mutations were counted based on the branches of the phylogeny and the mutations on a branch were only counted once.
According to the demographic models from BSPs, we employed SFS_CODE 26 to conduct forward simulations with the effect of purifying selection. Generally, we simulated the protein coding regions with length of 11367 base pairs and sampled 300 individuals for each run. Each model was simulated 100 times for East Asians and 1000 times for Tibetan highlanders, respectively. When evaluating models with different strengths of purifying selection, we assumed that selection coefficients for the nonsynonymous sites were drawn from a gamma distribution. To access the distribution of selection coefficients for mutations in 367 East Asians, we employed the strategy referred in Tamuri et al. 54 and found that ~80% of the nonsynonymous mutations were strongly deleterious with population scaled coefficient less than − 10 (data not shown). Thus, we set the selection coefficients in SFS_CODE as − 10, − 20 and − 50 respectively for each demographic model and 80% of the nonsynonymous mutations under the purifying selection. In addition, considering the lethal effect for mitochondrial mutations, we set the non-lethality parameter in SFS_CODE as 0.9.