Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Building the sequence map of the human pan-genome

Abstract

Here we integrate the de novo assembly of an Asian and an African genome with the NCBI reference human genome, as a step toward constructing the human pan-genome. We identified 5 Mb of novel sequences not present in the reference genome in each of these assemblies. Most novel sequences are individual or population specific, as revealed by their comparison to all available human DNA sequence and by PCR validation using the human genome diversity cell line panel. We found novel sequences present in patterns consistent with known human migration paths. Cross-species conservation analysis of predicted genes indicated that the novel sequences contain potentially functional coding regions. We estimate that a complete human pan-genome would contain 19–40 Mb of novel sequence not present in the extant reference genome. The extensive amount of novel sequence contributing to the genetic variation of the pan-genome indicates the importance of using complete genome sequencing and de novo assembly.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Population-specific patterns in novel sequences.
Figure 2: Examples of novel sequences with variant frequencies across populations.
Figure 3: Cumulative length of individual-specific sequences resulting from sequentially adding genomes to the pan-genome.
Figure 4: Distribution of sequence identity (in percentage) calculated from multiple alignments between human, chimpanzee, macaque and mouse genomes.

Similar content being viewed by others

Accession codes

Primary accessions

GenBank/EMBL/DDBJ

References

  1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  2. Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).

    Article  CAS  Google Scholar 

  3. Sachidanandam, R. et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001).

    Article  CAS  Google Scholar 

  4. Hinds, D.A. et al. Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072–1079 (2005).

    Article  CAS  Google Scholar 

  5. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  6. Hirschhorn, J.N. & Daly, M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005).

    Article  CAS  Google Scholar 

  7. Khaja, R. et al. Genome assembly comparison identifies structural variants in the human genome. Nat. Genet. 38, 1413–1418 (2006).

    Article  CAS  Google Scholar 

  8. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).

    Article  CAS  Google Scholar 

  9. Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).

    Article  CAS  Google Scholar 

  10. Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 (2004).

    Article  CAS  Google Scholar 

  11. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    Article  Google Scholar 

  12. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

    Article  CAS  Google Scholar 

  13. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).

    Article  CAS  Google Scholar 

  14. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    Article  CAS  Google Scholar 

  15. Li, Y. & Wang, J. Faster human genome sequencing. Nat. Biotechnol. 27, 820–821 (2009).

    Article  CAS  Google Scholar 

  16. Li, R. et al. De novo assembly of the human genomes with massively parallel short read sequencing. Genome Res. (in the press).

  17. Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nat. Genet. 40, 96–101 (2008).

    Article  CAS  Google Scholar 

  18. Cann, H.M. et al. A human genome diversity cell line panel. Science 296, 261–262 (2002).

    Article  CAS  Google Scholar 

  19. Cavalli-Sforza, L.L. The Human Genome Diversity Project: past, present and future. Nat. Rev. Genet. 6, 333–340 (2005).

    Article  CAS  Google Scholar 

  20. Tishkoff, S.A. et al. The genetic structure and history of Africans and African Americans. Science 324, 1035–1044 (2009).

    Article  CAS  Google Scholar 

  21. Wang, S. et al. Genetic variation and population structure in native Americans. PLoS Genet. 3, e185 (2007).

    Article  Google Scholar 

  22. Rosenberg, N.A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).

    Article  CAS  Google Scholar 

  23. Li, J.Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100–1104 (2008).

    Article  CAS  Google Scholar 

  24. Jakobsson, M. et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451, 998–1003 (2008).

    Article  CAS  Google Scholar 

  25. Falush, D., Stephens, M. & Pritchard, J.K. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Underhill, P.A. & Kivisild, T. Use of y chromosome and mitochondrial DNA population structure in tracing human migrations. Annu. Rev. Genet. 41, 539–564 (2007).

    Article  CAS  Google Scholar 

  27. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22, 231–238 (1999).

    Article  CAS  Google Scholar 

  28. Ahn, S.M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 1622–1629 (2009).

    Article  CAS  Google Scholar 

  29. Wong, G.K. et al. A population threshold for functional polymorphisms. Genome Res. 13, 1873–1879 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Beckers, M. et al. Active genes in junk DNA? Characterization of DUX genes embedded within 3.3 kb repeated elements. Gene 264, 51–57 (2001).

    Article  CAS  Google Scholar 

  31. Holland, P.W., Booth, H.A. & Bruford, E.A. Classification and nomenclature of all human homeobox genes. BMC Biol. 5, 47 (2007).

    Article  Google Scholar 

  32. Dekker, J., Rossen, J.W., Buller, H.A. & Einerhand, A.W. The MUC family: an obituary. Trends Biochem. Sci. 27, 126–131 (2002).

    Article  CAS  Google Scholar 

  33. Krishna, S.S., Majumdar, I. & Grishin, N.V. Structural classification of zinc fingers: survey and summary. Nucleic Acids Res. 31, 532–550 (2003).

    Article  CAS  Google Scholar 

  34. Young, J.M. et al. Extensive copy-number variation of the human olfactory receptor gene family. Am. J. Hum. Genet. 83, 228–242 (2008).

    Article  CAS  Google Scholar 

  35. Kent, W.J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    Article  CAS  Google Scholar 

  36. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  Google Scholar 

  37. Falush, D., Stephens, M. & Pritchard, J.K. Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol. Ecol. Notes 7, 574–578 (2007).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This project is supported by the Chinese Academy of Science (GJHZ0701-6), the National Natural Science Foundation of China (30725008; 30890032), Shenzhen local government, the Danish Platform for Integrative Biology, the Ole Rømer grant from the Danish Natural Science Research Council. L. Goodman edited the manuscript. J. Sun, M. Zhao, Y. Liu, Y. Zheng and H. Wang helped on designing the primers. W. Jin helped on experimental validation. San A, J. Wang, Y. Huang, M. Jian, M. Chen, Y. Huang, Xiaoli Ren, H. Liang, H. Zheng, S. Lin helped on the data production.

Author information

Authors and Affiliations

Authors

Contributions

Ruiq. L., Y.L., Ha. Z. and Ruib. L. contributed equally to this work. H.Y., Ju. W. and Ji. W. managed the project. Ju. W., Ruiq. L., L.B. and Y.L. designed the analyses. Ju. W., Ruiq. L., Y.L., Ha. Z., Ruib. L., Ho. Z., Q.L., W.Q., G.Z., H.W., J.Q., X.J., D.L., Hon. C., S.L. and K.K. performed the data analyses. H.B. and How. C. contributed the DNA samples. Y.R., X.H. and Xu. Z. performed PCR validation. G.T., J. L., Xi. Z. performed sequencing. Ju. W., Ruiq. L., Y.L. and Ruib. L. wrote the paper.

Corresponding authors

Correspondence to Jun Wang or Jian Wang.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, R., Li, Y., Zheng, H. et al. Building the sequence map of the human pan-genome. Nat Biotechnol 28, 57–63 (2010). https://doi.org/10.1038/nbt.1596

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.1596

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing