Review Article | Published:

Genetic variation and the de novo assembly of human genomes

Nature Reviews Genetics volume 16, pages 627640 (2015) | Download Citation

Abstract

The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.

Key points

  • Complete de novo assembly of a genome is guaranteed to allow assessment of the full range of genetic variation, although the only mammalian genome assemblies completed to date are for human and mouse. Assemblies using massively parallel sequencing (MPS) have increased the diversity of draft genomes that are available but do not completely resolve genomes.

  • When designing a de novo assembly project, the most-suitable assembly approach to use differs depending on the characteristics of the sequencing reads. MPS methods have relied on de Bruijn graphs, whereas single-molecule sequencing (SMS) reads require pairwise overlaps encoded in overlap or string graphs.

  • A component of 'missing heritability' is missed sequence variation. Approximately 5–40 Mb of sequence are absent from any given human reference genome owing to structural polymorphism, and standard resequencing has missed detection of diseases such as medullary cystic kidney disease type 1, amyotrophic lateral sclerosis and facioscapulohumeral muscular dystrophy.

  • Single-molecule long-read sequencing is currently driving gains in genome assembly accuracy and completeness, but new technologies are being developed to generate long-range information, such as optical maps and dilution pool sequencing, that may aid in scaffolding complex regions.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Accessions

References

  1. 1.

    Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  2. 2.

    et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

  3. 3.

    et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–35 (2010).

  4. 4.

    et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

  5. 5.

    et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015). Long-read sequencing paired with local assembly reveals structural variation and closes or extends ~50% of the gaps in the reference human genome.

  6. 6.

    et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

  7. 7.

    et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012). High-quality sequencing of the 17q21.31 region reveals a complex haplotype polymorphic region in which certain structural haplotypes predispose for disease.

  8. 8.

    , , & Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012). Uses population genetics to infer the architecture and evolutionary history of chromosome 17q21.31 haplotypes.References 7 and 8 show a rapid rise of a particular inverted haplotype in European and Middle Eastern individuals that is consistent with adaptive selection.

  9. 9.

    et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012). Shows that genes potentially responsible for unique aspects of human neuronal development were missing from the reference human genome, highlighting the importance of focusing on obtaining higher-quality reference sequences.

  10. 10.

    , & Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59, 6273–6289 (2013).

  11. 11.

    et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

  12. 12.

    & Human whole-genome shotgun sequencing. Genome Res. 7, 401–409 (1997).

  13. 13.

    et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 7, e1000112 (2009).

  14. 14.

    The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).

  15. 15.

    & Sequence assembly demystified. Nat. Rev. Genet. 14, 157–167 (2013). A review of algorithmic details of fragment assembly.

  16. 16.

    Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995).

  17. 17.

    et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).

  18. 18.

    , , , & PCAP: a whole-genome assembly program. Genome Res. 13, 2164–2170 (2003).

  19. 19.

    et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011).

  20. 20.

    et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18 (2012).

  21. 21.

    et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

  22. 22.

    , & De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).

  23. 23.

    et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013). Describes the method of correcting sequencing error in long SMRT sequences with short SMRT sequences so that they may be assembled using the Celera assembler and consensus called with the Quiver method.

  24. 24.

    et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015). Introduces one of the first SMS assemblers. Draft genomes on par with the original human draft sequence may be efficiently assembled with SMS reads.

  25. 25.

    in Algorithms in Bioinformatics (eds Raphael, B. & Tang, J.) 52–67 (Springer, 2014).

  26. 26.

    , & A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).

  27. 27.

    , , , & Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015). The first practical study using a graphical representation of the genome to encode the structural diversity of the major histocompatibility complex region.

  28. 28.

    et al. Paired-end sequencing of Fosmid libraries by Illumina. Genome Res. 22, 2241–2249 (2012).

  29. 29.

    et al. Minke whale genome and aquatic adaptation in cetaceans. Nat. Genet. 46, 88–92 (2014).

  30. 30.

    et al. Genome-wide signatures of convergent evolution in echolocating mammals. Nature 502, 228–231 (2013).

  31. 31.

    et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science 346, 1311–1320 (2014).

  32. 32.

    et al. Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31, 135–141 (2013).

  33. 33.

    et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  34. 34.

    , & CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).

  35. 35.

    et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

  36. 36.

    et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl Acad. Sci. USA 101, 1916–1921 (2004).

  37. 37.

    & Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).

  38. 38.

    , & Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).

  39. 39.

    et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).

  40. 40.

    et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).

  41. 41.

    et al. Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability. Nat. Genet. 46, 1293–1302 (2014).

  42. 42.

    et al. Recombinant structures expand and contract inter and intragenic diversification at the KIR locus. BMC Genomics 14, 89 (2013).

  43. 43.

    et al. Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 40, 1076–1083 (2008).

  44. 44.

    , , & Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput. Biol. 10, e1003628 (2014).

  45. 45.

    Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661–669 (2001).

  46. 46.

    et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).

  47. 47.

    et al. Ancient haplotypes of the HLA Class II region. Genome Res. 15, 1250–1257 (2005).

  48. 48.

    Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

  49. 49.

    et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

  50. 50.

    et al. Refinement of the gene locus for autosomal dominant medullary cystic kidney disease type 1 (MCKD1) and construction of a physical and partial transcriptional map of the region. Genomics 72, 278–284 (2001).

  51. 51.

    et al. Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing. Nat. Genet. 45, 299–303 (2013).

  52. 52.

    et al. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD. Neuron 72, 257–268 (2011).

  53. 53.

    et al. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS. Neuron 72, 245–256 (2011).

  54. 54.

    et al. Haplotype and interspersion analysis of the FMR1 CGG repeat identifies two different mutational pathways for the origin of the fragile X syndrome. Hum. Mol. Genet. 5, 319–330 (1996).

  55. 55.

    et al. Digenic inheritance of an SMCHD1 mutation and an FSHD-permissive D4Z4 allele causes facioscapulohumeral muscular dystrophy type 2. Nat. Genet. 44, 1370–1374 (2012).

  56. 56.

    et al. Mutations in potassium channel Kir2.6 cause susceptibility to thyrotoxic hypokalemic periodic paralysis. Cell 140, 88–98 (2010).

  57. 57.

    et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837–847 (2010).

  58. 58.

    et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 45, 406–414 (2013).

  59. 59.

    et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010). Describes how the draft assembly of a personal genome using MPS uncovered 19–40 Mb of sequence missing from the reference.

  60. 60.

    et al. Low copy number of the salivary amylase gene predisposes to obesity. Nat. Genet. 46, 492–497 (2014).

  61. 61.

    et al. Gene copy-number variation and associated polymorphisms of complement component C4 in human systemic lupus erythematosus (SLE): low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans. Am. J. Hum. Genet. 80, 1037–1054 (2007).

  62. 62.

    , , , & The essential detail: the genetics and genomics of the primate immune response. ILAR J. 54, 181–195 (2013).

  63. 63.

    & Human gene copy number variation and infectious disease. Hum. Genet. 133, 1217–1233 (2014).

  64. 64.

    et al. Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity. Nat. Genet. 47, 921–925 (2015).

  65. 65.

    et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).

  66. 66.

    et al. A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat. Genet. 38, 999–1001 (2006).

  67. 67.

    et al. Inhibition of SRGAP2 function by its human-specific paralogs induces neoteny during spine maturation. Cell 149, 923–935 (2012).

  68. 68.

    et al. Human-specific gene ARHGAP11B promotes basal progenitor amplification and neocortex expansion. Science 347, 1465–1470 (2015).

  69. 69.

    & Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013).

  70. 70.

    & Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).

  71. 71.

    et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).

  72. 72.

    et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014). Shows that MPS deduces more variation than do resequencing methods.

  73. 73.

    Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).

  74. 74.

    et al. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products. J. Comput. Biol. 20, 714–737 (2013).

  75. 75.

    et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl Acad. Sci. USA 111, 4904–4909 (2014).

  76. 76.

    et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

  77. 77.

    et al. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 24, 2041–2049 (2014).

  78. 78.

    et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

  79. 79.

    , , & Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118 (2013).

  80. 80.

    et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet. 46, 1343–1349 (2014).

  81. 81.

    , , & Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015).

  82. 82.

    et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl Acad. Sci. USA 110, 5552–5557 (2013).

  83. 83.

    et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

  84. 84.

    et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).

  85. 85.

    et al. The genome sequence of the colonial chordate, Botryllus schlosseri. eLife 2, e00569 (2013).

  86. 86.

    et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).

  87. 87.

    et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).

  88. 88.

    et al. Enhanced de novo assembly of high throughput pyrosequencing data using whole genome mapping. PLoS ONE 8, e61762 (2013).

  89. 89.

    et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).

  90. 90.

    et al. Finished sequence and assembly of the DUF1220-rich 1q21 region using a haploid human genome. BMC Genomics 15, 387 (2014).

  91. 91.

    et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

  92. 92.

    , , , & Integrated nanopore sensing platform with sub-microsecond temporal resolution. Nat. Methods 9, 487–492 (2012).

  93. 93.

    , , & Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).

  94. 94.

    et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

  95. 95.

    , , & Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).

  96. 96.

    The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature (2015).

  97. 97.

    et al. A recurrent 15q13.3 microdeletion syndrome associated with mental retardation and seizures. Nat. Genet. 40, 322–328 (2008).

  98. 98.

    & Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

  99. 99.

    , & A reference bacterial genome dataset generated on the MinION portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014).

  100. 100.

    et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688–696 (2014).

  101. 101.

    et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 30, 701–707 (2012).

  102. 102.

    et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

  103. 103.

    et al. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics 30, i293–301 (2014).

  104. 104.

    et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).

  105. 105.

    'Platinum' genome takes on disease. Nature 515, 323 (2014).

  106. 106.

    Human Genome Structural Variation Consortium. The phase 3 structural variant dataset. 1000 Genomes , (2015).

  107. 107.

    , , , & Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).

  108. 108.

    et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).

Download references

Acknowledgements

The authors thank T. Brown for assistance in editing this manuscript. This work was supported, in part, by a US National Institutes of Health grant (2R01HG002385) to E.E.E.. E.E.E. is an investigator of the Howard Hughes Medical Institute.

Author information

Affiliations

  1. Department of Genome Sciences, University of Washington, Foege Building S-413A, Box 355065, 3720 15th Ave NE, Seattle, Washington 98195, USA.

    • Mark J. P. Chaisson
    •  & Evan E. Eichler
  2. McDonnell Genome Institute, Department of Medicine, Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA.

    • Richard K. Wilson
  3. Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA.

    • Evan E. Eichler

Authors

  1. Search for Mark J. P. Chaisson in:

  2. Search for Richard K. Wilson in:

  3. Search for Evan E. Eichler in:

Competing interests

E.E.E. is on the scientific advisory board of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST), as part of the 1,000 China Talent Program. M.J.P.C. is a former employee and shareholder of Pacific Biosciences. R.K.W. declares no competing interests.

Corresponding authors

Correspondence to Mark J. P. Chaisson or Evan E. Eichler.

Glossary

Resequencing

Characterizing a sample genome and its associated variation by mapping and aligning sequence reads to a reference genome sequence.

Massively parallel sequencing

(MPS). A general term for a form of DNA sequencing that measures trace signals from millions to hundreds of millions of amplified sequences at once, most frequently referring to sequencing produced by Illumina, Life Technologies and Complete Genomics platforms. Often referred to as next-generation or second-generation sequencing to distinguish it from long-read sequencing approaches (for example, single-molecule sequencing), which are sometimes referred to as third-generation sequencing.

Structural variation

Large insertion, deletion or inversion differences between homologous chromosomes, or translocation differences involving non-homologous chromosomes. Operationally defined as events >50 bp in size to distinguish from smaller insertion and deletion events.

Coverage bias

Regions with an excess or deficiency in the number of sequence reads originating as a result of platform differences in sequence chemistry, amplification or cloning.

Phase

The assignment of genetic variants or alleles to one of two homologous chromosomes.

De novo assembly

The action of constructing the sequence of a genome from overlapping DNA sequences without guidance from a reference genome.

Haplotypes

Sets of genetic variants or alleles found on the same chromosome that are inherited together until disrupted by recombination.

Whole-genome shotgun sequencing and assembly

(WGSA). The reconstruction of a genome from reads redundantly sampled at random, often with the aid of paired-end sequencing.

Contigs

Continuous (or 'contiguous') sequences produced in a de novo assembly, free of any gaps.

Scaffolds

Sets of ordered and oriented contigs, with the approximate distances between contigs estimated by traversing paired-end sequences that anchor to different contigs. Scaffolds consist of both sequence contigs and gaps.

Bacterial artificial chromosomes

(BACs). Vectors with an F-plasmid origin of replication used to clonally propagate an organism's DNA (typically 150–250 kb) by transfection into Escherichia coli.

Single-molecule sequencing

(SMS). A form of DNA sequencing in which signals are derived from single molecules, frequently referring to sequencing produced by Pacific Biosciences and Oxford Nanopore Technologies platforms.

Paired-end

Two reads sequenced from opposite ends of the same fragment.

N50 length

A statistic in genomics defined as the shortest contig at which half the total length of the assembly is made of contigs of that length or greater. It is commonly used as a metric to summarize the contiguity of an assembly.

Fragment library

A set of DNA fragments of approximately the same length that are paired-end sequenced.

Segmental duplication

When a sequence is represented two or more times in a genome with high sequence identity and did not arise by retrotransposition. Often defined as paralogous sequences that share ≥90% sequence identity and are ≥1 kb in length.

Short tandem repeats

(STRs). Tandem repeats in which the individual unit of repetition is less than 10 bp long and varies in length between different individuals in a population.

Variable number of tandem repeats

(VNTR). Any tandem array of repeated sequence motifs that are highly variable in different individuals of a population. Historically, these were originally used in reference to tandem repeats that varied on the scale of thousands of base pairs over the length of the array.

Centromeric

Referring to the primary cytogenetic constriction on metaphase chromosomes where the kinetochore forms and spindle fibre attaches during cell division. In humans the centromere is made up primarily of repetitions of higher-order alpha-satellite DNA.

Heterochromatic DNA

Portions of chromosomes that stain densely, are typically gene poor and are rich in satellite sequences.

Acrocentric

Relating to a type of chromosome in which the centromere maps very close to the short arm. Acrocentric chromosomes in humans are enriched in beta-satellite and ribosomal DNA sequences, which are repeated as hundreds of copies.

Secondary constrictions

A cytogenetic term referring to metaphase chromosome constrictions outside the centromere, typically rich in satellites and used to help identify chromosomes.

Satellite DNA

Highly repetitive DNA composed of thousands to tens of thousands of tandem repeats, usually between 100–300 bp in length, and frequently associated with heterochromatin.

Muted gaps

Regions that have been incorrectly closed in a genome assembly despite additional sequences being present at these sites in the source genome.

Coalescence

The genealogy of a region of the genome in which all alleles trace back to a common ancestral sequence.

Missing heritability

The observation that only a portion of estimated genetic contribution to disease (for example, heritability of a trait from twin studies) can be explained by our current understanding of genetic variation and its transmission properties.

Exome sequencing

A method for enrichment and targeted sequencing of the protein-coding portions of the genome using massively parallel sequencing.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg3933

Further reading