Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Genetic variation and the de novo assembly of human genomes

Key Points

  • Complete de novo assembly of a genome is guaranteed to allow assessment of the full range of genetic variation, although the only mammalian genome assemblies completed to date are for human and mouse. Assemblies using massively parallel sequencing (MPS) have increased the diversity of draft genomes that are available but do not completely resolve genomes.

  • When designing a de novo assembly project, the most-suitable assembly approach to use differs depending on the characteristics of the sequencing reads. MPS methods have relied on de Bruijn graphs, whereas single-molecule sequencing (SMS) reads require pairwise overlaps encoded in overlap or string graphs.

  • A component of 'missing heritability' is missed sequence variation. Approximately 5–40 Mb of sequence are absent from any given human reference genome owing to structural polymorphism, and standard resequencing has missed detection of diseases such as medullary cystic kidney disease type 1, amyotrophic lateral sclerosis and facioscapulohumeral muscular dystrophy.

  • Single-molecule long-read sequencing is currently driving gains in genome assembly accuracy and completeness, but new technologies are being developed to generate long-range information, such as optical maps and dilution pool sequencing, that may aid in scaffolding complex regions.

Abstract

The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Types of genome assembly gaps.
Figure 2: Sequencing and assembly statistics from different platforms.
Figure 3: Genome assembly algorithms.
Figure 4: Assembly of complex regions of human genetic variation.
Figure 5: Human genetic variation detected with local assembly of single molecules.

Similar content being viewed by others

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

  1. Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  2. Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

    PubMed  PubMed Central  Google Scholar 

  3. Ng, S. B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–35 (2010).

    CAS  PubMed  Google Scholar 

  4. Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015). Long-read sequencing paired with local assembly reveals structural variation and closes or extends ~50% of the gaps in the reference human genome.

    CAS  PubMed  Google Scholar 

  6. Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

    PubMed  PubMed Central  Google Scholar 

  7. Steinberg, K. M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012). High-quality sequencing of the 17q21.31 region reveals a complex haplotype polymorphic region in which certain structural haplotypes predispose for disease.

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Boettger, L. M., Handsaker, R. E., Zody, M. C. & McCarroll, S. A. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012). Uses population genetics to infer the architecture and evolutionary history of chromosome 17q21.31 haplotypes.References 7 and 8 show a rapid rise of a particular inverted haplotype in European and Middle Eastern individuals that is consistent with adaptive selection.

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Dennis, M. Y. et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012). Shows that genes potentially responsible for unique aspects of human neuronal development were missing from the reference human genome, highlighting the importance of focusing on obtaining higher-quality reference sequences.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Motahari, A. S., Bresler, G. & Tse, D. N. C. Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59, 6273–6289 (2013).

    Google Scholar 

  11. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

    CAS  PubMed  Google Scholar 

  12. Weber, J. L. & Myers, E. W. Human whole-genome shotgun sequencing. Genome Res. 7, 401–409 (1997).

    CAS  PubMed  Google Scholar 

  13. Church, D. M. et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 7, e1000112 (2009).

    PubMed  PubMed Central  Google Scholar 

  14. Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).

    CAS  PubMed  Google Scholar 

  15. Nagarajan, N. & Pop, M. Sequence assembly demystified. Nat. Rev. Genet. 14, 157–167 (2013). A review of algorithmic details of fragment assembly.

    CAS  PubMed  Google Scholar 

  16. Myers, E. W. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995).

    CAS  PubMed  Google Scholar 

  17. Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Huang, X., Wang, J., Aluru, S., Yang, S.-P. & Hillier, L. PCAP: a whole-genome assembly program. Genome Res. 13, 2164–2170 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011).

    CAS  PubMed  Google Scholar 

  20. Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18 (2012).

    PubMed  PubMed Central  Google Scholar 

  21. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013). Describes the method of correcting sequencing error in long SMRT sequences with short SMRT sequences so that they may be assembled using the Celera assembler and consensus called with the Quiver method.

    CAS  PubMed  Google Scholar 

  24. Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015). Introduces one of the first SMS assemblers. Draft genomes on par with the original human draft sequence may be efficiently assembled with SMS reads.

    CAS  PubMed  Google Scholar 

  25. Myers, G. in Algorithms in Bioinformatics (eds Raphael, B. & Tang, J.) 52–67 (Springer, 2014).

  26. Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).

    CAS  PubMed  Google Scholar 

  27. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015). The first practical study using a graphical representation of the genome to encode the structural diversity of the major histocompatibility complex region.

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Williams, L. J. et al. Paired-end sequencing of Fosmid libraries by Illumina. Genome Res. 22, 2241–2249 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Yim, H. S. et al. Minke whale genome and aquatic adaptation in cetaceans. Nat. Genet. 46, 88–92 (2014).

    CAS  PubMed  Google Scholar 

  30. Parker, J. et al. Genome-wide signatures of convergent evolution in echolocating mammals. Nature 502, 228–231 (2013).

    CAS  PubMed  Google Scholar 

  31. Zhang, G. et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science 346, 1311–1320 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Dong, Y. et al. Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31, 135–141 (2013).

    CAS  PubMed  Google Scholar 

  33. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    CAS  PubMed  Google Scholar 

  34. Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).

    CAS  PubMed  Google Scholar 

  35. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    CAS  PubMed  Google Scholar 

  36. Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl Acad. Sci. USA 101, 1916–1921 (2004).

    CAS  PubMed  Google Scholar 

  37. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).

    CAS  PubMed  Google Scholar 

  38. Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).

    CAS  PubMed  Google Scholar 

  39. Sharp, A. J. et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  41. Antonacci, F. et al. Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability. Nat. Genet. 46, 1293–1302 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Pyo, C. W. et al. Recombinant structures expand and contract inter and intragenic diversification at the KIR locus. BMC Genomics 14, 89 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Zody, M. C. et al. Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 40, 1076–1083 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Altemose, N., Miga, K. H., Maggioni, M. & Willard, H. F. Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput. Biol. 10, e1003628 (2014).

    PubMed  PubMed Central  Google Scholar 

  45. Eichler, E. E. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661–669 (2001).

    CAS  PubMed  Google Scholar 

  46. Church, D. M. et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. Raymond, C. K. et al. Ancient haplotypes of the HLA Class II region. Genome Res. 15, 1250–1257 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Fuchshuber, A. et al. Refinement of the gene locus for autosomal dominant medullary cystic kidney disease type 1 (MCKD1) and construction of a physical and partial transcriptional map of the region. Genomics 72, 278–284 (2001).

    CAS  PubMed  Google Scholar 

  51. Kirby, A. et al. Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing. Nat. Genet. 45, 299–303 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Renton, A. E. et al. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD. Neuron 72, 257–268 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. DeJesus-Hernandez, M. et al. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS. Neuron 72, 245–256 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. Eichler, E. E. et al. Haplotype and interspersion analysis of the FMR1 CGG repeat identifies two different mutational pathways for the origin of the fragile X syndrome. Hum. Mol. Genet. 5, 319–330 (1996).

    CAS  PubMed  Google Scholar 

  55. Lemmers, R. J. et al. Digenic inheritance of an SMCHD1 mutation and an FSHD-permissive D4Z4 allele causes facioscapulohumeral muscular dystrophy type 2. Nat. Genet. 44, 1370–1374 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. Ryan, D. P. et al. Mutations in potassium channel Kir2.6 cause susceptibility to thyrotoxic hypokalemic periodic paralysis. Cell 140, 88–98 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837–847 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. Genovese, G. et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 45, 406–414 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010). Describes how the draft assembly of a personal genome using MPS uncovered 19–40 Mb of sequence missing from the reference.

    CAS  PubMed  Google Scholar 

  60. Falchi, M. et al. Low copy number of the salivary amylase gene predisposes to obesity. Nat. Genet. 46, 492–497 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. Yang, Y. et al. Gene copy-number variation and associated polymorphisms of complement component C4 in human systemic lupus erythematosus (SLE): low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans. Am. J. Hum. Genet. 80, 1037–1054 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Shen, S., Pyo, C. W., Vu, Q., Wang, R. & Geraghty, D. E. The essential detail: the genetics and genomics of the primate immune response. ILAR J. 54, 181–195 (2013).

    CAS  PubMed  Google Scholar 

  63. Hollox, E. J. & Hoh, B. P. Human gene copy number variation and infectious disease. Hum. Genet. 133, 1217–1233 (2014).

    CAS  PubMed  Google Scholar 

  64. Usher, C. L. et al. Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity. Nat. Genet. 47, 921–925 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. Stefansson, H. et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).

    CAS  PubMed  Google Scholar 

  66. Koolen, D. A. et al. A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat. Genet. 38, 999–1001 (2006).

    CAS  PubMed  Google Scholar 

  67. Charrier, C. et al. Inhibition of SRGAP2 function by its human-specific paralogs induces neoteny during spine maturation. Cell 149, 923–935 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  68. Florio, M. et al. Human-specific gene ARHGAP11B promotes basal progenitor amplification and neocortex expansion. Science 347, 1465–1470 (2015).

    CAS  PubMed  Google Scholar 

  69. Chikhi, R. & Rizk, G. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013).

    PubMed  PubMed Central  Google Scholar 

  70. Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. Weisenfeld, N. I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014). Shows that MPS deduces more variation than do resequencing methods.

    CAS  PubMed  PubMed Central  Google Scholar 

  73. Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. Nurk, S. et al. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products. J. Comput. Biol. 20, 714–737 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. Howe, A. C. et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl Acad. Sci. USA 111, 4904–4909 (2014).

    CAS  PubMed  Google Scholar 

  76. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. Adey, A. et al. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 24, 2041–2049 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  78. Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  79. Selvaraj, S., Dixon, J. R., Bansal, V. & Ren, B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  80. Amini, S. et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet. 46, 1343–1349 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  81. Snyder, M. W., Adey, A., Kitzman, J. O. & Shendure, J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015).

    CAS  PubMed  Google Scholar 

  82. Kaper, F. et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl Acad. Sci. USA 110, 5552–5557 (2013).

    CAS  PubMed  Google Scholar 

  83. Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  84. Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).

    CAS  PubMed  Google Scholar 

  85. Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri. eLife 2, e00569 (2013).

    PubMed  PubMed Central  Google Scholar 

  86. McCoy, R. C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).

    PubMed  PubMed Central  Google Scholar 

  87. Schwartz, D. C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).

    CAS  PubMed  Google Scholar 

  88. Onmus-Leone, F. et al. Enhanced de novo assembly of high throughput pyrosequencing data using whole genome mapping. PLoS ONE 8, e61762 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  89. Lam, E. T. et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).

    CAS  PubMed  Google Scholar 

  90. O'Bleness, M. et al. Finished sequence and assembly of the DUF1220-rich 1q21 region using a haploid human genome. BMC Genomics 15, 387 (2014).

    PubMed  PubMed Central  Google Scholar 

  91. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    CAS  PubMed  Google Scholar 

  92. Rosenstein, J. K., Wanunu, M., Merchant, C. A., Drndic, M. & Shepard, K. L. Integrated nanopore sensing platform with sub-microsecond temporal resolution. Nat. Methods 9, 487–492 (2012).

    CAS  PubMed  Google Scholar 

  93. Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  94. Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  95. Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  96. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature http://www.dx.doi.org/10.1038/nature15393 (2015).

  97. Sharp, A. J. et al. A recurrent 15q13.3 microdeletion syndrome associated with mental retardation and seizures. Nat. Genet. 40, 322–328 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  98. Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  99. Quick, J., Quinlan, A. R. & Loman, N. J. A reference bacterial genome dataset generated on the MinION portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014).

    PubMed  PubMed Central  Google Scholar 

  100. Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688–696 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  101. Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 30, 701–707 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  102. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  103. Prjibelski, A. D. et al. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics 30, i293–301 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  104. English, A. C. et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  105. Callaway, E. 'Platinum' genome takes on disease. Nature 515, 323 (2014).

    CAS  PubMed  Google Scholar 

  106. Human Genome Structural Variation Consortium. The phase 3 structural variant dataset. 1000 Genomes [online], (2015).

  107. Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  108. Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank T. Brown for assistance in editing this manuscript. This work was supported, in part, by a US National Institutes of Health grant (2R01HG002385) to E.E.E.. E.E.E. is an investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mark J. P. Chaisson or Evan E. Eichler.

Ethics declarations

Competing interests

E.E.E. is on the scientific advisory board of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST), as part of the 1,000 China Talent Program. M.J.P.C. is a former employee and shareholder of Pacific Biosciences. R.K.W. declares no competing interests.

Related links

PowerPoint slides

Glossary

Resequencing

Characterizing a sample genome and its associated variation by mapping and aligning sequence reads to a reference genome sequence.

Massively parallel sequencing

(MPS). A general term for a form of DNA sequencing that measures trace signals from millions to hundreds of millions of amplified sequences at once, most frequently referring to sequencing produced by Illumina, Life Technologies and Complete Genomics platforms. Often referred to as next-generation or second-generation sequencing to distinguish it from long-read sequencing approaches (for example, single-molecule sequencing), which are sometimes referred to as third-generation sequencing.

Structural variation

Large insertion, deletion or inversion differences between homologous chromosomes, or translocation differences involving non-homologous chromosomes. Operationally defined as events >50 bp in size to distinguish from smaller insertion and deletion events.

Coverage bias

Regions with an excess or deficiency in the number of sequence reads originating as a result of platform differences in sequence chemistry, amplification or cloning.

Phase

The assignment of genetic variants or alleles to one of two homologous chromosomes.

De novo assembly

The action of constructing the sequence of a genome from overlapping DNA sequences without guidance from a reference genome.

Haplotypes

Sets of genetic variants or alleles found on the same chromosome that are inherited together until disrupted by recombination.

Whole-genome shotgun sequencing and assembly

(WGSA). The reconstruction of a genome from reads redundantly sampled at random, often with the aid of paired-end sequencing.

Contigs

Continuous (or 'contiguous') sequences produced in a de novo assembly, free of any gaps.

Scaffolds

Sets of ordered and oriented contigs, with the approximate distances between contigs estimated by traversing paired-end sequences that anchor to different contigs. Scaffolds consist of both sequence contigs and gaps.

Bacterial artificial chromosomes

(BACs). Vectors with an F-plasmid origin of replication used to clonally propagate an organism's DNA (typically 150–250 kb) by transfection into Escherichia coli.

Single-molecule sequencing

(SMS). A form of DNA sequencing in which signals are derived from single molecules, frequently referring to sequencing produced by Pacific Biosciences and Oxford Nanopore Technologies platforms.

Paired-end

Two reads sequenced from opposite ends of the same fragment.

N50 length

A statistic in genomics defined as the shortest contig at which half the total length of the assembly is made of contigs of that length or greater. It is commonly used as a metric to summarize the contiguity of an assembly.

Fragment library

A set of DNA fragments of approximately the same length that are paired-end sequenced.

Segmental duplication

When a sequence is represented two or more times in a genome with high sequence identity and did not arise by retrotransposition. Often defined as paralogous sequences that share ≥90% sequence identity and are ≥1 kb in length.

Short tandem repeats

(STRs). Tandem repeats in which the individual unit of repetition is less than 10 bp long and varies in length between different individuals in a population.

Variable number of tandem repeats

(VNTR). Any tandem array of repeated sequence motifs that are highly variable in different individuals of a population. Historically, these were originally used in reference to tandem repeats that varied on the scale of thousands of base pairs over the length of the array.

Centromeric

Referring to the primary cytogenetic constriction on metaphase chromosomes where the kinetochore forms and spindle fibre attaches during cell division. In humans the centromere is made up primarily of repetitions of higher-order alpha-satellite DNA.

Heterochromatic DNA

Portions of chromosomes that stain densely, are typically gene poor and are rich in satellite sequences.

Acrocentric

Relating to a type of chromosome in which the centromere maps very close to the short arm. Acrocentric chromosomes in humans are enriched in beta-satellite and ribosomal DNA sequences, which are repeated as hundreds of copies.

Secondary constrictions

A cytogenetic term referring to metaphase chromosome constrictions outside the centromere, typically rich in satellites and used to help identify chromosomes.

Satellite DNA

Highly repetitive DNA composed of thousands to tens of thousands of tandem repeats, usually between 100–300 bp in length, and frequently associated with heterochromatin.

Muted gaps

Regions that have been incorrectly closed in a genome assembly despite additional sequences being present at these sites in the source genome.

Coalescence

The genealogy of a region of the genome in which all alleles trace back to a common ancestral sequence.

Missing heritability

The observation that only a portion of estimated genetic contribution to disease (for example, heritability of a trait from twin studies) can be explained by our current understanding of genetic variation and its transmission properties.

Exome sequencing

A method for enrichment and targeted sequencing of the protein-coding portions of the genome using massively parallel sequencing.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chaisson, M., Wilson, R. & Eichler, E. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015). https://doi.org/10.1038/nrg3933

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3933

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing