Genetic variation and the de novo assembly of human genomes

Chaisson, Mark J. P.; Wilson, Richard K.; Eichler, Evan E.

doi:10.1038/nrg3933

Review Article
Published: 07 October 2015

Genetic variation and the de novo assembly of human genomes

Mark J. P. Chaisson¹,
Richard K. Wilson² &
Evan E. Eichler^1,3

Nature Reviews Genetics volume 16, pages 627–640 (2015)Cite this article

24k Accesses
225 Citations
84 Altmetric
Metrics details

Subjects

Key Points

Complete de novo assembly of a genome is guaranteed to allow assessment of the full range of genetic variation, although the only mammalian genome assemblies completed to date are for human and mouse. Assemblies using massively parallel sequencing (MPS) have increased the diversity of draft genomes that are available but do not completely resolve genomes.
When designing a de novo assembly project, the most-suitable assembly approach to use differs depending on the characteristics of the sequencing reads. MPS methods have relied on de Bruijn graphs, whereas single-molecule sequencing (SMS) reads require pairwise overlaps encoded in overlap or string graphs.
A component of 'missing heritability' is missed sequence variation. Approximately 5–40 Mb of sequence are absent from any given human reference genome owing to structural polymorphism, and standard resequencing has missed detection of diseases such as medullary cystic kidney disease type 1, amyotrophic lateral sclerosis and facioscapulohumeral muscular dystrophy.
Single-molecule long-read sequencing is currently driving gains in genome assembly accuracy and completeness, but new technologies are being developed to generate long-range information, such as optical maps and dilution pool sequencing, that may aid in scaffolding complex regions.

Abstract

The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Types of genome assembly gaps.**

**Figure 2: Sequencing and assembly statistics from different platforms.**

**Figure 3: Genome assembly algorithms.**

**Figure 4: Assembly of complex regions of human genetic variation.**

**Figure 5: Human genetic variation detected with local assembly of single molecules.**

CoCas9 is a compact nuclease from the human microbiome for efficient and precise genome editing

Article Open access 24 April 2024

Single-cell analysis reveals context-dependent, cell-level selection of mtDNA

Article Open access 24 April 2024

Genome-wide association studies

Article 26 August 2021

Accession codes

Accessions

GenBank/EMBL/DDBJ

GCA_ 000002125.2
GCA_ 000004845.2
GCA_ 000185165.1
GCA_ 000772585
GCA_000001515.4
GCA_000001635.6
GCA_000001895.4
GCA_000004335.1
GCA_000004665.1
GCA_000004845.2
GCA_000146795.3
GCA_000151865.3
GCA_000164805.2
GCA_000181295.3
GCA_000181335.3
GCA_000185165.1
GCA_000235385.1
GCA_000241425.1
GCA_000258655.1
GCA_000264685.1
GCA_000298355.1
GCA_000325575.1
GCA_000442215.1
GCA_000464555.1
GCA_000472045.1
GCA_000493695.1
GCA_000687225.1
GCA_000772465.1
GCA_000772585
GCF_000001545.4

References

Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
PubMed PubMed Central Google Scholar
Ng, S. B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–35 (2010).
CAS PubMed Google Scholar
Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).
CAS PubMed PubMed Central Google Scholar
Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015). Long-read sequencing paired with local assembly reveals structural variation and closes or extends ~50% of the gaps in the reference human genome.
CAS PubMed Google Scholar
Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).
PubMed PubMed Central Google Scholar
Steinberg, K. M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012). High-quality sequencing of the 17q21.31 region reveals a complex haplotype polymorphic region in which certain structural haplotypes predispose for disease.
CAS PubMed PubMed Central Google Scholar
Boettger, L. M., Handsaker, R. E., Zody, M. C. & McCarroll, S. A. Structural haplotypes and recent evolution of the human 17q21.31 region. Nat. Genet. 44, 881–885 (2012). Uses population genetics to infer the architecture and evolutionary history of chromosome 17q21.31 haplotypes.References 7 and 8 show a rapid rise of a particular inverted haplotype in European and Middle Eastern individuals that is consistent with adaptive selection.
CAS PubMed PubMed Central Google Scholar
Dennis, M. Y. et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012). Shows that genes potentially responsible for unique aspects of human neuronal development were missing from the reference human genome, highlighting the importance of focusing on obtaining higher-quality reference sequences.
CAS PubMed PubMed Central Google Scholar
Motahari, A. S., Bresler, G. & Tse, D. N. C. Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59, 6273–6289 (2013).
Google Scholar
Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
CAS PubMed Google Scholar
Weber, J. L. & Myers, E. W. Human whole-genome shotgun sequencing. Genome Res. 7, 401–409 (1997).
CAS PubMed Google Scholar
Church, D. M. et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 7, e1000112 (2009).
PubMed PubMed Central Google Scholar
Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).
CAS PubMed Google Scholar
Nagarajan, N. & Pop, M. Sequence assembly demystified. Nat. Rev. Genet. 14, 157–167 (2013). A review of algorithmic details of fragment assembly.
CAS PubMed Google Scholar
Myers, E. W. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 275–290 (1995).
CAS PubMed Google Scholar
Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).
CAS PubMed PubMed Central Google Scholar
Huang, X., Wang, J., Aluru, S., Yang, S.-P. & Hillier, L. PCAP: a whole-genome assembly program. Genome Res. 13, 2164–2170 (2003).
CAS PubMed PubMed Central Google Scholar
Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011).
CAS PubMed Google Scholar
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18 (2012).
PubMed PubMed Central Google Scholar
Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
CAS PubMed PubMed Central Google Scholar
Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).
CAS PubMed PubMed Central Google Scholar
Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013). Describes the method of correcting sequencing error in long SMRT sequences with short SMRT sequences so that they may be assembled using the Celera assembler and consensus called with the Quiver method.
CAS PubMed Google Scholar
Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015). Introduces one of the first SMS assemblers. Draft genomes on par with the original human draft sequence may be efficiently assembled with SMS reads.
CAS PubMed Google Scholar
Myers, G. in Algorithms in Bioinformatics (eds Raphael, B. & Tang, J.) 52–67 (Springer, 2014).
Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).
CAS PubMed Google Scholar
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015). The first practical study using a graphical representation of the genome to encode the structural diversity of the major histocompatibility complex region.
CAS PubMed PubMed Central Google Scholar
Williams, L. J. et al. Paired-end sequencing of Fosmid libraries by Illumina. Genome Res. 22, 2241–2249 (2012).
CAS PubMed PubMed Central Google Scholar
Yim, H. S. et al. Minke whale genome and aquatic adaptation in cetaceans. Nat. Genet. 46, 88–92 (2014).
CAS PubMed Google Scholar
Parker, J. et al. Genome-wide signatures of convergent evolution in echolocating mammals. Nature 502, 228–231 (2013).
CAS PubMed Google Scholar
Zhang, G. et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science 346, 1311–1320 (2014).
CAS PubMed PubMed Central Google Scholar
Dong, Y. et al. Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31, 135–141 (2013).
CAS PubMed Google Scholar
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
CAS PubMed Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
CAS PubMed Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
CAS PubMed Google Scholar
Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl Acad. Sci. USA 101, 1916–1921 (2004).
CAS PubMed Google Scholar
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).
CAS PubMed Google Scholar
Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).
CAS PubMed Google Scholar
Sharp, A. J. et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).
CAS PubMed PubMed Central Google Scholar
Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).
CAS PubMed PubMed Central Google Scholar
Antonacci, F. et al. Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability. Nat. Genet. 46, 1293–1302 (2014).
CAS PubMed PubMed Central Google Scholar
Pyo, C. W. et al. Recombinant structures expand and contract inter and intragenic diversification at the KIR locus. BMC Genomics 14, 89 (2013).
CAS PubMed PubMed Central Google Scholar
Zody, M. C. et al. Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat. Genet. 40, 1076–1083 (2008).
CAS PubMed PubMed Central Google Scholar
Altemose, N., Miga, K. H., Maggioni, M. & Willard, H. F. Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput. Biol. 10, e1003628 (2014).
PubMed PubMed Central Google Scholar
Eichler, E. E. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661–669 (2001).
CAS PubMed Google Scholar
Church, D. M. et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).
CAS PubMed PubMed Central Google Scholar
Raymond, C. K. et al. Ancient haplotypes of the HLA Class II region. Genome Res. 15, 1250–1257 (2005).
CAS PubMed PubMed Central Google Scholar
Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
CAS PubMed PubMed Central Google Scholar
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
CAS PubMed PubMed Central Google Scholar
Fuchshuber, A. et al. Refinement of the gene locus for autosomal dominant medullary cystic kidney disease type 1 (MCKD1) and construction of a physical and partial transcriptional map of the region. Genomics 72, 278–284 (2001).
CAS PubMed Google Scholar
Kirby, A. et al. Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing. Nat. Genet. 45, 299–303 (2013).
CAS PubMed PubMed Central Google Scholar
Renton, A. E. et al. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD. Neuron 72, 257–268 (2011).
CAS PubMed PubMed Central Google Scholar
DeJesus-Hernandez, M. et al. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS. Neuron 72, 245–256 (2011).
CAS PubMed PubMed Central Google Scholar
Eichler, E. E. et al. Haplotype and interspersion analysis of the FMR1 CGG repeat identifies two different mutational pathways for the origin of the fragile X syndrome. Hum. Mol. Genet. 5, 319–330 (1996).
CAS PubMed Google Scholar
Lemmers, R. J. et al. Digenic inheritance of an SMCHD1 mutation and an FSHD-permissive D4Z4 allele causes facioscapulohumeral muscular dystrophy type 2. Nat. Genet. 44, 1370–1374 (2012).
CAS PubMed PubMed Central Google Scholar
Ryan, D. P. et al. Mutations in potassium channel Kir2.6 cause susceptibility to thyrotoxic hypokalemic periodic paralysis. Cell 140, 88–98 (2010).
CAS PubMed PubMed Central Google Scholar
Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837–847 (2010).
CAS PubMed PubMed Central Google Scholar
Genovese, G. et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 45, 406–414 (2013).
CAS PubMed PubMed Central Google Scholar
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010). Describes how the draft assembly of a personal genome using MPS uncovered 19–40 Mb of sequence missing from the reference.
CAS PubMed Google Scholar
Falchi, M. et al. Low copy number of the salivary amylase gene predisposes to obesity. Nat. Genet. 46, 492–497 (2014).
CAS PubMed PubMed Central Google Scholar
Yang, Y. et al. Gene copy-number variation and associated polymorphisms of complement component C4 in human systemic lupus erythematosus (SLE): low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans. Am. J. Hum. Genet. 80, 1037–1054 (2007).
CAS PubMed PubMed Central Google Scholar
Shen, S., Pyo, C. W., Vu, Q., Wang, R. & Geraghty, D. E. The essential detail: the genetics and genomics of the primate immune response. ILAR J. 54, 181–195 (2013).
CAS PubMed Google Scholar
Hollox, E. J. & Hoh, B. P. Human gene copy number variation and infectious disease. Hum. Genet. 133, 1217–1233 (2014).
CAS PubMed Google Scholar
Usher, C. L. et al. Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity. Nat. Genet. 47, 921–925 (2015).
CAS PubMed PubMed Central Google Scholar
Stefansson, H. et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).
CAS PubMed Google Scholar
Koolen, D. A. et al. A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat. Genet. 38, 999–1001 (2006).
CAS PubMed Google Scholar
Charrier, C. et al. Inhibition of SRGAP2 function by its human-specific paralogs induces neoteny during spine maturation. Cell 149, 923–935 (2012).
CAS PubMed PubMed Central Google Scholar
Florio, M. et al. Human-specific gene ARHGAP11B promotes basal progenitor amplification and neocortex expansion. Science 347, 1465–1470 (2015).
CAS PubMed Google Scholar
Chikhi, R. & Rizk, G. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013).
PubMed PubMed Central Google Scholar
Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).
CAS PubMed PubMed Central Google Scholar
Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).
CAS PubMed PubMed Central Google Scholar
Weisenfeld, N. I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014). Shows that MPS deduces more variation than do resequencing methods.
CAS PubMed PubMed Central Google Scholar
Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).
CAS PubMed PubMed Central Google Scholar
Nurk, S. et al. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products. J. Comput. Biol. 20, 714–737 (2013).
CAS PubMed PubMed Central Google Scholar
Howe, A. C. et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl Acad. Sci. USA 111, 4904–4909 (2014).
CAS PubMed Google Scholar
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
CAS PubMed PubMed Central Google Scholar
Adey, A. et al. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 24, 2041–2049 (2014).
CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
CAS PubMed PubMed Central Google Scholar
Selvaraj, S., Dixon, J. R., Bansal, V. & Ren, B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118 (2013).
CAS PubMed PubMed Central Google Scholar
Amini, S. et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat. Genet. 46, 1343–1349 (2014).
CAS PubMed PubMed Central Google Scholar
Snyder, M. W., Adey, A., Kitzman, J. O. & Shendure, J. Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015).
CAS PubMed Google Scholar
Kaper, F. et al. Whole-genome haplotyping by dilution, amplification, and sequencing. Proc. Natl Acad. Sci. USA 110, 5552–5557 (2013).
CAS PubMed Google Scholar
Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).
CAS PubMed PubMed Central Google Scholar
Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
CAS PubMed Google Scholar
Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri. eLife 2, e00569 (2013).
PubMed PubMed Central Google Scholar
McCoy, R. C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).
PubMed PubMed Central Google Scholar
Schwartz, D. C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).
CAS PubMed Google Scholar
Onmus-Leone, F. et al. Enhanced de novo assembly of high throughput pyrosequencing data using whole genome mapping. PLoS ONE 8, e61762 (2013).
CAS PubMed PubMed Central Google Scholar
Lam, E. T. et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).
CAS PubMed Google Scholar
O'Bleness, M. et al. Finished sequence and assembly of the DUF1220-rich 1q21 region using a haploid human genome. BMC Genomics 15, 387 (2014).
PubMed PubMed Central Google Scholar
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
CAS PubMed Google Scholar
Rosenstein, J. K., Wanunu, M., Merchant, C. A., Drndic, M. & Shepard, K. L. Integrated nanopore sensing platform with sub-microsecond temporal resolution. Nat. Methods 9, 487–492 (2012).
CAS PubMed Google Scholar
Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).
CAS PubMed PubMed Central Google Scholar
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
CAS PubMed PubMed Central Google Scholar
Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).
CAS PubMed PubMed Central Google Scholar
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature http://www.dx.doi.org/10.1038/nature15393 (2015).
Sharp, A. J. et al. A recurrent 15q13.3 microdeletion syndrome associated with mental retardation and seizures. Nat. Genet. 40, 322–328 (2008).
CAS PubMed PubMed Central Google Scholar
Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
CAS PubMed PubMed Central Google Scholar
Quick, J., Quinlan, A. R. & Loman, N. J. A reference bacterial genome dataset generated on the MinION^™ portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014).
PubMed PubMed Central Google Scholar
Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688–696 (2014).
CAS PubMed PubMed Central Google Scholar
Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 30, 701–707 (2012).
CAS PubMed PubMed Central Google Scholar
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).
CAS PubMed PubMed Central Google Scholar
Prjibelski, A. D. et al. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics 30, i293–301 (2014).
CAS PubMed PubMed Central Google Scholar
English, A. C. et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).
CAS PubMed PubMed Central Google Scholar
Callaway, E. 'Platinum' genome takes on disease. Nature 515, 323 (2014).
CAS PubMed Google Scholar
Human Genome Structural Variation Consortium. The phase 3 structural variant dataset. 1000 Genomes [online], (2015).
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).
CAS PubMed PubMed Central Google Scholar
Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).
CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank T. Brown for assistance in editing this manuscript. This work was supported, in part, by a US National Institutes of Health grant (2R01HG002385) to E.E.E.. E.E.E. is an investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington, Foege Building S-413A, Box 355065, 3720 15th Ave NE, Seattle, 98195, Washington, USA
Mark J. P. Chaisson & Evan E. Eichler
Department of Medicine, Department of Genetics, McDonnell Genome Institute, Washington University School of Medicine, St. Louis, 63108, Missouri, USA
Richard K. Wilson
Howard Hughes Medical Institute, University of Washington, Seattle, 98195, Washington, USA
Evan E. Eichler

Authors

Mark J. P. Chaisson
View author publications
You can also search for this author in PubMed Google Scholar
Richard K. Wilson
View author publications
You can also search for this author in PubMed Google Scholar
Evan E. Eichler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mark J. P. Chaisson or Evan E. Eichler.

Ethics declarations

Competing interests

E.E.E. is on the scientific advisory board of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST), as part of the 1,000 China Talent Program. M.J.P.C. is a former employee and shareholder of Pacific Biosciences. R.K.W. declares no competing interests.

Glossary

Resequencing: Characterizing a sample genome and its associated variation by mapping and aligning sequence reads to a reference genome sequence.
Massively parallel sequencing: (MPS). A general term for a form of DNA sequencing that measures trace signals from millions to hundreds of millions of amplified sequences at once, most frequently referring to sequencing produced by Illumina, Life Technologies and Complete Genomics platforms. Often referred to as next-generation or second-generation sequencing to distinguish it from long-read sequencing approaches (for example, single-molecule sequencing), which are sometimes referred to as third-generation sequencing.
Structural variation: Large insertion, deletion or inversion differences between homologous chromosomes, or translocation differences involving non-homologous chromosomes. Operationally defined as events >50 bp in size to distinguish from smaller insertion and deletion events.
Coverage bias: Regions with an excess or deficiency in the number of sequence reads originating as a result of platform differences in sequence chemistry, amplification or cloning.
Phase: The assignment of genetic variants or alleles to one of two homologous chromosomes.
De novo assembly: The action of constructing the sequence of a genome from overlapping DNA sequences without guidance from a reference genome.
Haplotypes: Sets of genetic variants or alleles found on the same chromosome that are inherited together until disrupted by recombination.
Whole-genome shotgun sequencing and assembly: (WGSA). The reconstruction of a genome from reads redundantly sampled at random, often with the aid of paired-end sequencing.
Contigs: Continuous (or 'contiguous') sequences produced in a de novo assembly, free of any gaps.
Scaffolds: Sets of ordered and oriented contigs, with the approximate distances between contigs estimated by traversing paired-end sequences that anchor to different contigs. Scaffolds consist of both sequence contigs and gaps.
Bacterial artificial chromosomes: (BACs). Vectors with an F-plasmid origin of replication used to clonally propagate an organism's DNA (typically 150–250 kb) by transfection into Escherichia coli.
Single-molecule sequencing: (SMS). A form of DNA sequencing in which signals are derived from single molecules, frequently referring to sequencing produced by Pacific Biosciences and Oxford Nanopore Technologies platforms.
Paired-end: Two reads sequenced from opposite ends of the same fragment.
N50 length: A statistic in genomics defined as the shortest contig at which half the total length of the assembly is made of contigs of that length or greater. It is commonly used as a metric to summarize the contiguity of an assembly.
Fragment library: A set of DNA fragments of approximately the same length that are paired-end sequenced.
Segmental duplication: When a sequence is represented two or more times in a genome with high sequence identity and did not arise by retrotransposition. Often defined as paralogous sequences that share ≥90% sequence identity and are ≥1 kb in length.
Short tandem repeats: (STRs). Tandem repeats in which the individual unit of repetition is less than 10 bp long and varies in length between different individuals in a population.
Variable number of tandem repeats: (VNTR). Any tandem array of repeated sequence motifs that are highly variable in different individuals of a population. Historically, these were originally used in reference to tandem repeats that varied on the scale of thousands of base pairs over the length of the array.
Centromeric: Referring to the primary cytogenetic constriction on metaphase chromosomes where the kinetochore forms and spindle fibre attaches during cell division. In humans the centromere is made up primarily of repetitions of higher-order alpha-satellite DNA.
Heterochromatic DNA: Portions of chromosomes that stain densely, are typically gene poor and are rich in satellite sequences.
Acrocentric: Relating to a type of chromosome in which the centromere maps very close to the short arm. Acrocentric chromosomes in humans are enriched in beta-satellite and ribosomal DNA sequences, which are repeated as hundreds of copies.
Secondary constrictions: A cytogenetic term referring to metaphase chromosome constrictions outside the centromere, typically rich in satellites and used to help identify chromosomes.
Satellite DNA: Highly repetitive DNA composed of thousands to tens of thousands of tandem repeats, usually between 100–300 bp in length, and frequently associated with heterochromatin.
Muted gaps: Regions that have been incorrectly closed in a genome assembly despite additional sequences being present at these sites in the source genome.
Coalescence: The genealogy of a region of the genome in which all alleles trace back to a common ancestral sequence.
Missing heritability: The observation that only a portion of estimated genetic contribution to disease (for example, heritability of a trait from twin studies) can be explained by our current understanding of genetic variation and its transmission properties.
Exome sequencing: A method for enrichment and targeted sequencing of the protein-coding portions of the genome using massively parallel sequencing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaisson, M., Wilson, R. & Eichler, E. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet 16, 627–640 (2015). https://doi.org/10.1038/nrg3933

Download citation

Published: 07 October 2015
Issue Date: November 2015
DOI: https://doi.org/10.1038/nrg3933

This article is cited by

Long-read sequencing and optical mapping generates near T2T assemblies that resolves a centromeric translocation
- Esmee ten Berk de Boer
- Adam Ameur
- Anna Lindstrand
Scientific Reports (2024)
Bread wheat satellitome: a complex scenario in a huge genome
- Ana Gálvez-Galván
- Manuel A. Garrido-Ramos
- Pilar Prieto
Plant Molecular Biology (2024)
Identification and high-throughput genotyping of single nucleotide polymorphism markers in a non-model conifer (Abies nordmanniana (Steven) Spach)
- Kedra Ousmael
- Ross W. Whetten
- Ole K. Hansen
Scientific Reports (2023)
USAT: a bioinformatic toolkit to facilitate interpretation and comparative visualization of tandem repeat sequences
- Xuewen Wang
- Bruce Budowle
- Jianye Ge
BMC Bioinformatics (2022)
False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
- Juwan Kim
- Chul Lee
- Erich D. Jarvis
Genome Biology (2022)