Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA

Journal name:
Nature
Volume:
512,
Pages:
194–197
Date published:
DOI:
doi:10.1038/nature13408
Received
Accepted
Published online
Corrected online

As modern humans migrated out of Africa, they encountered many new environmental conditions, including greater temperature extremes, different pathogens and higher altitudes. These diverse environments are likely to have acted as agents of natural selection and to have led to local adaptations. One of the most celebrated examples in humans is the adaptation of Tibetans to the hypoxic environment of the high-altitude Tibetan plateau1, 2, 3. A hypoxia pathway gene, EPAS1, was previously identified as having the most extreme signature of positive selection in Tibetans4, 5, 6, 7, 8, 9, 10, and was shown to be associated with differences in haemoglobin concentration at high altitude. Re-sequencing the region around EPAS1 in 40 Tibetan and 40 Han individuals, we find that this gene has a highly unusual haplotype structure that can only be convincingly explained by introgression of DNA from Denisovan or Denisovan-related individuals into humans. Scanning a larger set of worldwide populations, we find that the selected haplotype is only found in Denisovans and in Tibetans, and at very low frequency among Han Chinese. Furthermore, the length of the haplotype, and the fact that it is not found in any other populations, makes it unlikely that the haplotype sharing between Tibetans and Denisovans was caused by incomplete ancestral lineage sorting rather than introgression. Our findings illustrate that admixture with other hominin species has provided genetic variation that helped humans to adapt to new environments.

At a glance

Figures

  1. Genome-wide FST versus maximal allele frequency difference.
    Figure 1: Genome-wide FST versus maximal allele frequency difference.

    The relationship between genome-wide FST (x axis) computed for each pair of the 26 populations and maximal allele frequency difference (y axis), first explored in ref. 19. Maximal allele frequency difference is defined as the largest frequency difference observed for any SNP between a population pair. The 26 populations are from the Human Genome Diversity Panel (HGDP). The labels highlight genes that harbour SNPs previously identified as having strong local adaptation. The grey points represent the observed relationship between population differentiation (FST ) and maximal allele frequency difference; the more differentiated populations tend to have mutations with larger frequency differences. The star symbol and the yellow symbols represent outliers; these are populations that are not highly differentiated but where we find some mutations that have higher frequency differences than expected (light blue line).

  2. Haplotype pattern in a region defined by SNPs that are at high frequency in Tibetans and at low frequency in Han Chinese.
    Figure 2: Haplotype pattern in a region defined by SNPs that are at high frequency in Tibetans and at low frequency in Han Chinese.

    Each column is a polymorphic genomic location (95 in total), each row is a phased haplotype (80 Han and 80 Tibetan haplotypes), and the coloured column on the left denotes the population identity of the individuals. Haplotypes of the Denisovan individual are shown in the top two rows (green). The black cells represent the presence of the derived allele and the grey space represents the presence of the ancestral allele (see Methods). The first and last columns correspond to the first and last positions in Supplementary Table 3, respectively. The red and blue arrows indicate the 32 sites in Supplementary Table 3. The blue arrows represent a five-SNP haplotype block defined by the first five SNPs in the 32.7-kb region. Asterisks indicate sites at which Tibetans share a derived allele with the Denisovan individual.

  3. A haplotype network based on the number of pairwise differences between the 40 most common haplotypes.
    Figure 3: A haplotype network based on the number of pairwise differences between the 40 most common haplotypes.

    The haplotypes were defined from all the SNPs present in the combined 1000 Genomes and Tibetan samples: 515 SNPs in total within the 32.7-kb EPAS1 region. The Denisovan haplotypes were added to the set of the common haplotypes. The R software package pegas23 was used to generate the figure, using pairwise differences as distances. Each pie chart represents one unique haplotype, labelled with Roman numerals, and the radius of the pie chart is proportional to the log2(number of chromosomes with that haplotype) plus a minimum size so that it is easier to see the Denisovan haplotype. The sections in the pie provide the breakdown of the haplotype representation amongst populations. The width of the edges is proportional to the number of pairwise differences between the joined haplotypes; the thinnest edge represents a difference of one mutation. The legend shows all the possible haplotypes among these populations. The numbers (1, 9, 35 and 40) next to an edge (the line connecting two haplotypes) in the bottom right are the number of pairwise differences between the corresponding haplotypes. We added an edge afterwards between the Tibetan haplotype XXXIII and its closest non-Denisovan haplotype (XXI) to indicate its divergence from the other modern human groups. Extended Data Fig. 5a contains all the pairwise differences between the haplotypes presented in this figure. ASW, African Americans from the south western United States; CEU, Utah residents with northern and western European ancestry; GBR, British; FIN, Finnish; JPT, Japanese; LWK, Luhya; CHS, southern Han Chinese; CHB, Han Chinese from Beijing; MXL, Mexican; PUR, Puerto Rican; CLM, Colombian; TSI, Toscani; YRI, Yoruban. Where there is only one line within a pie chart, this indicates that only one population contains the haplotype.

  4. FST calculated for each SNP between Tibetan and Han populations.
    Extended Data Fig. 1: FST calculated for each SNP between Tibetan and Han populations.

    Each dot represents the FST value for each SNP in EPAS1. The x axis is the physical position in the gene. Positions are based on the hg18 build of the human genome. The green box defines a 32.7-kb region where we observe the largest genetic differentiation between Han Chinese and Tibetans. The first and last positions of this 32.7-kb region correspond to the first and last position of the SNPs listed in Supplementary Table 3. For comparison, in ref. 4 the genome-wide FST between Han and Tibetans is 0.02. The site with the largest frequency difference (and therefore largest FST) is circled.

  5. Distribution of fixed differences.
    Extended Data Fig. 2: Distribution of fixed differences.

    The left panel is the distribution of fixed differences between two haplotype groups under a scenario of selection on a de novo mutation (see Methods), and the right panel is the distribution under a scenario of selection on standing variation (see Methods) for a region of size ~32.7 kb. The initial frequency of the selected allele in the SSV model is 1%. Each row of panels corresponds to different selection strengths (2Ns) from 200 to 1,000. The red lines mark the number of fixed differences observed between the two haplotype classes in the real data for the given window size.

  6. Haplotype frequencies for Tibetans, our Han samples and the populations from the 1000 genomes project for the five-SNP motif in the EPAS1 region.
    Extended Data Fig. 3: Haplotype frequencies for Tibetans, our Han samples and the populations from the 1000 genomes project for the five-SNP motif in the EPAS1 region.

    The y axis is the haplotype frequency. The legend shows all the possible haplotypes for the region considered among these populations: ASW, African American from the south western United States; CEU, Utah Residents with Northern and Western European ancestry; CHB, Han Chinese from Beijing; CHS, Southern Han Chinese; CLM, Colombian; FIN, Finnish; GBR, British; HAN, Han Chinese from Beijing; IBS, Iberian; JPT, Japanese; MXL, Mexican; PUR, Puerto Rican; LWK, Luhya; TSI, Toscani; TIB, Tibetan; YRI, Yoruban (see Methods).

  7. Derived allele frequency of the SNPs with the largest frequency difference between Tibetans and the 1000 Genomes Project populations.
    Extended Data Fig. 4: Derived allele frequency of the SNPs with the largest frequency difference between Tibetans and the 1000 Genomes Project populations.

    At these SNPs, the frequency difference between Tibetans and the 1000 Genomes project populations is 0.65 or larger. Positions 46571435, 46579689, 46584859 and 46600358 were not called as SNPs in the 1000 Genomes data, so we assume these positions were fixed for the human reference allele. Note that even though position 46577251, 46588331, 46594122 and 46598025 appear to have a frequency of 0.0 for the populations in the 1000 Genomes data, the derived allele in these SNPs are observed at very low frequency in at least one population (for example, CHB).

  8. Differences between haplotypes.
    Extended Data Fig. 5: Differences between haplotypes.

    a, The full matrix of pairwise differences between all the unique haplotypes in Fig. 3, for the 40 most common haplotypes identified in the 1000 Genomes and the Tibetan samples in the 32.7-kb region of EPAS1. The Denisovan haplotype (of frequency two) was added afterwards for comparison. The unique haplotypes are labelled with Roman numerals (here and in Fig. 3), and the Denisovan haplotype is the first column, haplotype I. Refer to Fig. 3 in the main text and the supplementary material for the representation of populations for each haplotype. b, Illustration of the genealogical structure in a model with gene flow from Denisovans to Tibet. Letters a–k are the labels for the branch lengths and are adjacent to their corresponding branches. The divergence between modern human haplotypes and the introgressed haplotype in Tibetans would be larger than the haplotypes in other modern human populations and the Denisovan haplotype (see Methods and Supplementary Information). TIB, CEU and YRI denote Tibetan, European and Yoruban populations. Note that the lengths i and k are unknown as we do not know when these populations went extinct.

  9. Other haplotype networks.
    Extended Data Fig. 6: Other haplotype networks.

    a, A haplotype network based on the number of pairwise differences between 43 unique haplotypes defined from the 20 most differentiated SNPs between Tibetans and the 14 populations from the 1000 Genomes Project. The R software package pegas (ref. 22) was used to generate the figure. The haplotype distances are from pairwise differences. Each pie chart represents one unique haplotype and the size of the pie chart is proportional to log2(number of chromosomes with that haplotype). The sections in the pie provide the breakdown of the haplotypes amongst populations. The width of the edges is proportional to the number of pairwise differences between the joined haplotypes; the thinnest edge width represents a difference of one mutation. The number 57 next to a Tibetan haplotype is the number of Tibetan chromosomes with that haplotype. Similarly, the number 1,912 is the number of chromosomes (across several populations) with that haplotype. b, The number of pairwise differences between the Denisovan haplotype and the 43 unique haplotypes defined from the 20 most differentiated SNPs between Tibetans and the 14 populations from the 1000 Genomes Project (same haplotypes as in a). Each bar is a unique haplotype, and they are sorted in increasing order of pairwise differences. The colours within each bar represent the proportion of chromosomes with that haplotype broken down by populations. The numbers on top of each bar represent the total number of chromosomes within the 1000 Genomes data set and Tibetans that have the haplotype. Note this is the same data set used to create the haplotype network in panel a. Supplementary Tables 5 and 6 contain the 43 haplotypes and the frequencies within each of the populations.

  10. Number of pairwise differences.
    Extended Data Fig. 7: Number of pairwise differences.

    Red bars are the histograms of the number of pairwise differences between Denisovan and Tibetans. Blue bars are the histograms of the number of pairwise differences between Denisovan and GBR, CHS, FIN, PUR, CLM, IBS, CEU, YRI, CHB, JPT, LWK, ASW, MXL or TSI. All comparisons are within the 32.7-kb region of high differentiation (green box in Extended Data Fig. 1).

  11. Divergence distributions.
    Extended Data Fig. 8: Divergence distributions.

    Modern human–Denisovan divergence (see Methods) for intronic regions of size 32.7 kb is plotted in red. Modern human–modern human divergence for the same intronic regions is plotted in blue. At the EPAS1 32.7-kb region, in green, is plotted the Tibetan–Han divergence. The black arrow points to the number of nucleotide differences between the Denisovan and the most common Tibetan haplotype (0.0038). This value is significantly lower than what we observe between modern human–Denisovan (red curve, P = 0.0028).

  12. Null distributions of D for an assumed Tibet-Han divergence of 3,000 years.
    Extended Data Fig. 9: Null distributions of D for an assumed Tibet–Han divergence of 3,000 years.

    Each histogram corresponds to the D values obtained under null models without gene flow, and the red vertical bar corresponds to the D values observed in the real data. The observed D values are significant (P < 0.001) even when we assume Tibet–Han divergence of 5,000 or 10,000 years (see Methods and Supplementary Tables 8–10) (model abbreviations are given in the Supplementary Information; section on D statistics under models of no gene flow).

  13. S* statistics and PCA plot.
    Extended Data Fig. 10: S* statistics and PCA plot.

    a, A measure of introgression, S*, from ref. 23. Distributions are for 1,000 simulations under the four demographic models described in the Supplementary Information; section on D statistics under models of no gene flow. S* for the Tibetan individuals is shown as a vertical grey line. For all models, the empirical P values are 0.035, 0.028, 0.019 and 0.017, respectively, for each model (top to bottom). b, Plots the first and second principal components using all the CHS (100 individuals) and the CHB (97 individuals) from the 1000 Genomes and the 77 Tibetan individuals from ref. 45 (see Methods). The black circle and the black triangle represent the single CHB and the CHS individuals carrying the five-SNP Tibetan–Denisovan-haplotype (Extended Data Fig. 3). All SNPs in the intersection between the 1000 Genomes populations and the 77 Tibetan individuals from chromosome 2 were used for this analysis.

Accession codes

Primary accessions

Sequence Read Archive

Change history

Corrected online 13 August 2014
The affiliations list has been updated to correct the address of author Kui Li.

References

  1. Moore, L. G., Young, D., McCullough, R. E., Droma, T. & Zamudio, S. Tibetan protection from intrauterine growth restriction (IUGR) and reproductive loss at high altitude. Am. J. Hum. Biol. 13, 635644 (2001)
  2. Niermeyer, S. et al. Child health and living at high altitude. Arch. Dis. Child. 94, 806811 (2009)
  3. Wu, T. et al. Hemoglobin levels in Quinghai-Tibet: different effects of gender for Tibetans vs. Han. J. Appl. Physiol. 98, 598604 (2005)
  4. Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 7578 (2010)
  5. Bigham, A. et al. Identifying signature of natural selection in Tibetan and Andean populations using dense genome scan data. PLoS Genet. 6, e1001116 (2010)
  6. Simonson, T. S. et al. Genetic evidence for high-altitude adaptation in Tibet. Science 329, 7275 (2010)
  7. Beall, C. M. et al. Natural selection on EPAS1 (HIF2a) associated with low hemoglobin concentration in Tibetan highlanders. Proc. Natl Acad. Sci. USA 107, 1145911464 (2010)
  8. Peng, Y. et al. Genetic variations in Tibetan populations and high-altitude adaptation at the Himalayas. Mol. Biol. Evol. 28, 10751081 (2011)
  9. Xu, S. et al. A genome-wide search for signals of high-altitude adaptation in Tibetans. Mol. Biol. Evol. 28, 10031011 (2011)
  10. Wang, B. et al. On the origin of Tibetans and their genetic basis in adapting high-altitude environments. PLoS ONE 6, e17002 (2011)
  11. Moore, L. G. et al. Maternal adaptation to high-altitude pregnancy: an experiment of nature—a review. Placenta 25, S60S71 (2004)
  12. Vargas, E. & Spielvogel, H. Chronic mountain sickness, optimal hemoglobin, and heart disease. High Alt. Med. Biol. 7, 138149 (2006)
  13. Yip, R. Significance of an abnormally low or high hemoglobin concentration during pregnancy: special consideration of iron nutrition1'2'3. Am. J. Clin. Nutr. 72, 272S279S (2000)
  14. Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222226 (2012)
  15. Li, J. Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 11001104 (2008)
  16. Rosenberg, N. A. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70, 841847 (2006)
  17. Soejima, M. & Koda, Y. Population differences of two coding SNPs. in pigmentation-related genes SLC24A5 and SLC45A2. Int. J. Legal Med. 121, 3639 (2007)
  18. Sulem, P. et al. Genetic determinants of hair, eye and skin pigmentation in Europeans. Nature Genet. 39, 14431452 (2007)
  19. Coop, G. et al. The role of geography in human adaptation. PLoS Genet. 5, e1000500 (2009)
  20. Pickrell, J. K. et al. Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19, 826837 (2009)
  21. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012)
  22. Paradis, E. Pegas: an R package for population genetics with an integrated–modular approach. Bioinformatics 26, 419420 (2010)
  23. Vernot, B. & Akey, J. Resurrecting Surviving neandertal lineages from modern human genomes. Science (2014)
  24. Plagnol, V. & Wall, J. D. Possible ancestral structure in human populations. PLoS Genet. 2, e105 (2006)
  25. Reich, D. et al. Genetic history of an archaic hominin group from Denisova cave in Siberia. Nature 468, 10531060 (2010)
  26. Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 4349 (2014)
  27. Skoglund, P. & Jakobsson, M. Archaic human ancestry in East Asia. Proc. Natl Acad. Sci. USA 108, 1830118306 (2011)
  28. Abi-Rached, L. et al. The shaping of modern human immune systems by multiregional admixture with archaic humans. Science 334, 8994 (2011)
  29. Mendez, F. L., Watkins, J. C. & Hammer, M. F. A haplotype at STAT2 introgressed from Neanderthals and serves as a candidate of positive selection in Papua New Guinea. Am. J. Hum. Genet. 91, 265274 (2012)
  30. Sankararaman, S. et al. The genomic landscape of Neanderthal ancestry in present-day humans. Nature (2014)
  31. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713714 (2008)
  32. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 11241132 (2009)
  33. Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 88, 173182 (2011)
  34. Coop, G. et al. The role of geography in human adaptation. PLoS Genet. 5, e1000500 (2009)
  35. Reynolds, J., Weir, B. S. & Cockerham, C. C. Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105, 767779 (1983)
  36. R Development Core Team R: A language and environment for statistical computing http://www.R-project.org/ (R Foundation for Statistical Computing, 2011)
  37. Ewing, G. & Hermisson, J. MSMS: a coalescent simulation program including recombination, demographic structure, and selection at a single locus. Bioinformatics 26, 20642065 (2010)
  38. Myers, S. et al. A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321324 (2005)
  39. Hinch, A. G. et al. The landscape of recombination in African Americans. Nature 476, 170175 (2011)
  40. Scally, A. & Durbin, R. Revising the human mutation rate: implications for understanding human evolution. Nature Rev. Genet. 13, 745753 (2012)
  41. Teshima, K. M. & Innan, H. mbs: modifying Hudson’s ms software to generate samples of DNA sequences with a biallelic site under selection. BMC Bioinformatics 10, 166 (2009)
  42. Hudson, R. R. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18, 337338 (2002)
  43. Sankararaman, S. et al. The date of interbreeding between Neandertals and modern humans. PLoS Genet. 8, e1002947 (2012)
  44. Durand, E. Y. et al. Testing for ancient admixture between closely related populations. Mol. Biol. Evol. 28, 22392252 (2011)
  45. Simonson, T. S. et al. Genetic evidence for high-altitude adaptation in Tibet. Science 329, 7275 (2010)

Download references

Author information

  1. These authors contributed equally to this work.

    • Emilia Huerta-Sánchez,
    • Xin Jin,
    • Asan &
    • Zhuoma Bianba

Affiliations

  1. BGI-Shenzhen, Shenzhen 518083, China

    • Emilia Huerta-Sánchez,
    • Xin Jin,
    • Asan,
    • Yu Liang,
    • Xin Yi,
    • Mingze He,
    • Peixiang Ni,
    • Bo Wang,
    • Xiaohua Ou,
    • Huasang,
    • Jiangbai Luosang,
    • Ye Yin,
    • Wei Wang,
    • Xiuqing Zhang,
    • Xun Xu,
    • Huanming Yang,
    • Yingrui Li,
    • Jian Wang,
    • Jun Wang &
    • Rasmus Nielsen
  2. Department of Integrative Biology, University of California, Berkeley, California 94720 USA

    • Emilia Huerta-Sánchez,
    • Benjamin M. Peter,
    • Nicolas Vinckenbosch &
    • Rasmus Nielsen
  3. School of Natural Sciences, University of California, Merced, California 95343 USA

    • Emilia Huerta-Sánchez
  4. School of Bioscience and Bioengineering, South China University of Technology, Guangzhou 510006, China

    • Xin Jin
  5. Binhai Genomics Institute, BGI-Tianjin, Tianjin 300308, China

    • Asan,
    • Yu Liang &
    • Xin Yi
  6. Tianjin Translational Genomics Center, BGI-Tianjin, Tianjin 300308, China

    • Asan,
    • Yu Liang &
    • Xin Yi
  7. The People’s Hospital of Lhasa, Lhasa 850000, China

    • Zhuoma Bianba
  8. Bioinformatics and Computational Biology Program, Iowa State University, Ames, Iowa 50011, USA

    • Mingze He
  9. Department of Biological Sciences, Middle East Technical University, 06800 Ankara, Turkey

    • Mehmet Somel
  10. The Second People’s Hospital of Tibet Autonomous Region, Lhasa 850000, China

    • Zha Xi Ping Cuo
  11. The People's Hospital of the Tibet Autonomous Region, Lhasa 850000, China

    • Kui Li
  12. The hospital of XiShuangBanNa Dai Nationalities, Autonomous Jinghong, 666100 Yunnan, China

    • Guoyi Gao
  13. The Guangdong Enterprise Key Laboratory of Human Disease Genomics, BGI-Shenzhen, 518083 Shenzhen, China

    • Xiuqing Zhang
  14. Shenzhen Key Laboratory of Transomics Biotechnologies, BGI-Shenzhen, 518083 Shenzhen, China

    • Xiuqing Zhang
  15. Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah 21589, Saudi Arabia

    • Huanming Yang &
    • Jun Wang
  16. James D. Watson Institute of Genome Science, 310008 Hangzhou, China

    • Huanming Yang &
    • Jian Wang
  17. Department of Biology, University of Copenhagen, Ole MaaløesVej 5, 2200 Copenhagen, Denmark

    • Jun Wang
  18. Macau University of Science and Technology, AvenidaWai long, Taipa, Macau 999078, China

    • Jun Wang
  19. Department of Medicine, University of Hong Kong 999077, Hong Kong

    • Jun Wang
  20. Department of Statistics, University of California, Berkeley, California 94720, USA

    • Rasmus Nielsen
  21. Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark

    • Rasmus Nielsen

Contributions

R.N., Ji.W. and Ju.W. supervised the project. X.J., A., Z.B., Y.L., X.Y., M.H., P.N., B.W., X.O., H., J.L., Z.X.P.C., K.L., G.G., Y.Y., W.W., X.Z., X.X., H.Y., Y.L., Ji.W. and Ju.W. collected and generated the data, and performed the preliminary bioinformatic analyses to call SNPs and indels from the raw data. E.H.-S. and N.V. filtered the data and B.M.P. phased the data. E.H.-S. performed the majority of the population genetic analysis with some contributions from B.M.P. and M.S. E.H.-S. and R.N. wrote the manuscript with critical input from all the authors.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Sequence data have been deposited in the Sequence Read Archive under accession number SRP041218.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: FST calculated for each SNP between Tibetan and Han populations. (100 KB)

    Each dot represents the FST value for each SNP in EPAS1. The x axis is the physical position in the gene. Positions are based on the hg18 build of the human genome. The green box defines a 32.7-kb region where we observe the largest genetic differentiation between Han Chinese and Tibetans. The first and last positions of this 32.7-kb region correspond to the first and last position of the SNPs listed in Supplementary Table 3. For comparison, in ref. 4 the genome-wide FST between Han and Tibetans is 0.02. The site with the largest frequency difference (and therefore largest FST) is circled.

  2. Extended Data Figure 2: Distribution of fixed differences. (156 KB)

    The left panel is the distribution of fixed differences between two haplotype groups under a scenario of selection on a de novo mutation (see Methods), and the right panel is the distribution under a scenario of selection on standing variation (see Methods) for a region of size ~32.7 kb. The initial frequency of the selected allele in the SSV model is 1%. Each row of panels corresponds to different selection strengths (2Ns) from 200 to 1,000. The red lines mark the number of fixed differences observed between the two haplotype classes in the real data for the given window size.

  3. Extended Data Figure 3: Haplotype frequencies for Tibetans, our Han samples and the populations from the 1000 genomes project for the five-SNP motif in the EPAS1 region. (102 KB)

    The y axis is the haplotype frequency. The legend shows all the possible haplotypes for the region considered among these populations: ASW, African American from the south western United States; CEU, Utah Residents with Northern and Western European ancestry; CHB, Han Chinese from Beijing; CHS, Southern Han Chinese; CLM, Colombian; FIN, Finnish; GBR, British; HAN, Han Chinese from Beijing; IBS, Iberian; JPT, Japanese; MXL, Mexican; PUR, Puerto Rican; LWK, Luhya; TSI, Toscani; TIB, Tibetan; YRI, Yoruban (see Methods).

  4. Extended Data Figure 4: Derived allele frequency of the SNPs with the largest frequency difference between Tibetans and the 1000 Genomes Project populations. (606 KB)

    At these SNPs, the frequency difference between Tibetans and the 1000 Genomes project populations is 0.65 or larger. Positions 46571435, 46579689, 46584859 and 46600358 were not called as SNPs in the 1000 Genomes data, so we assume these positions were fixed for the human reference allele. Note that even though position 46577251, 46588331, 46594122 and 46598025 appear to have a frequency of 0.0 for the populations in the 1000 Genomes data, the derived allele in these SNPs are observed at very low frequency in at least one population (for example, CHB).

  5. Extended Data Figure 5: Differences between haplotypes. (887 KB)

    a, The full matrix of pairwise differences between all the unique haplotypes in Fig. 3, for the 40 most common haplotypes identified in the 1000 Genomes and the Tibetan samples in the 32.7-kb region of EPAS1. The Denisovan haplotype (of frequency two) was added afterwards for comparison. The unique haplotypes are labelled with Roman numerals (here and in Fig. 3), and the Denisovan haplotype is the first column, haplotype I. Refer to Fig. 3 in the main text and the supplementary material for the representation of populations for each haplotype. b, Illustration of the genealogical structure in a model with gene flow from Denisovans to Tibet. Letters a–k are the labels for the branch lengths and are adjacent to their corresponding branches. The divergence between modern human haplotypes and the introgressed haplotype in Tibetans would be larger than the haplotypes in other modern human populations and the Denisovan haplotype (see Methods and Supplementary Information). TIB, CEU and YRI denote Tibetan, European and Yoruban populations. Note that the lengths i and k are unknown as we do not know when these populations went extinct.

  6. Extended Data Figure 6: Other haplotype networks. (371 KB)

    a, A haplotype network based on the number of pairwise differences between 43 unique haplotypes defined from the 20 most differentiated SNPs between Tibetans and the 14 populations from the 1000 Genomes Project. The R software package pegas (ref. 22) was used to generate the figure. The haplotype distances are from pairwise differences. Each pie chart represents one unique haplotype and the size of the pie chart is proportional to log2(number of chromosomes with that haplotype). The sections in the pie provide the breakdown of the haplotypes amongst populations. The width of the edges is proportional to the number of pairwise differences between the joined haplotypes; the thinnest edge width represents a difference of one mutation. The number 57 next to a Tibetan haplotype is the number of Tibetan chromosomes with that haplotype. Similarly, the number 1,912 is the number of chromosomes (across several populations) with that haplotype. b, The number of pairwise differences between the Denisovan haplotype and the 43 unique haplotypes defined from the 20 most differentiated SNPs between Tibetans and the 14 populations from the 1000 Genomes Project (same haplotypes as in a). Each bar is a unique haplotype, and they are sorted in increasing order of pairwise differences. The colours within each bar represent the proportion of chromosomes with that haplotype broken down by populations. The numbers on top of each bar represent the total number of chromosomes within the 1000 Genomes data set and Tibetans that have the haplotype. Note this is the same data set used to create the haplotype network in panel a. Supplementary Tables 5 and 6 contain the 43 haplotypes and the frequencies within each of the populations.

  7. Extended Data Figure 7: Number of pairwise differences. (338 KB)

    Red bars are the histograms of the number of pairwise differences between Denisovan and Tibetans. Blue bars are the histograms of the number of pairwise differences between Denisovan and GBR, CHS, FIN, PUR, CLM, IBS, CEU, YRI, CHB, JPT, LWK, ASW, MXL or TSI. All comparisons are within the 32.7-kb region of high differentiation (green box in Extended Data Fig. 1).

  8. Extended Data Figure 8: Divergence distributions. (153 KB)

    Modern human–Denisovan divergence (see Methods) for intronic regions of size 32.7 kb is plotted in red. Modern human–modern human divergence for the same intronic regions is plotted in blue. At the EPAS1 32.7-kb region, in green, is plotted the Tibetan–Han divergence. The black arrow points to the number of nucleotide differences between the Denisovan and the most common Tibetan haplotype (0.0038). This value is significantly lower than what we observe between modern human–Denisovan (red curve, P = 0.0028).

  9. Extended Data Figure 9: Null distributions of D for an assumed Tibet–Han divergence of 3,000 years. (259 KB)

    Each histogram corresponds to the D values obtained under null models without gene flow, and the red vertical bar corresponds to the D values observed in the real data. The observed D values are significant (P < 0.001) even when we assume Tibet–Han divergence of 5,000 or 10,000 years (see Methods and Supplementary Tables 8–10) (model abbreviations are given in the Supplementary Information; section on D statistics under models of no gene flow).

  10. Extended Data Figure 10: S* statistics and PCA plot. (234 KB)

    a, A measure of introgression, S*, from ref. 23. Distributions are for 1,000 simulations under the four demographic models described in the Supplementary Information; section on D statistics under models of no gene flow. S* for the Tibetan individuals is shown as a vertical grey line. For all models, the empirical P values are 0.035, 0.028, 0.019 and 0.017, respectively, for each model (top to bottom). b, Plots the first and second principal components using all the CHS (100 individuals) and the CHB (97 individuals) from the 1000 Genomes and the 77 Tibetan individuals from ref. 45 (see Methods). The black circle and the black triangle represent the single CHB and the CHS individuals carrying the five-SNP Tibetan–Denisovan-haplotype (Extended Data Fig. 3). All SNPs in the intersection between the 1000 Genomes populations and the 77 Tibetan individuals from chromosome 2 were used for this analysis.

Supplementary information

PDF files

  1. Supplementary Information (343 KB)

    This file contains Supplementary Text, Supplementary References and Supplementary Tables 1-11.

Additional data