The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation.
Complete de novo assembly of a genome is guaranteed to allow assessment of the full range of genetic variation, although the only mammalian genome assemblies completed to date are for human and mouse. Assemblies using massively parallel sequencing (MPS) have increased the diversity of draft genomes that are available but do not completely resolve genomes.
When designing a de novo assembly project, the most-suitable assembly approach to use differs depending on the characteristics of the sequencing reads. MPS methods have relied on de Bruijn graphs, whereas single-molecule sequencing (SMS) reads require pairwise overlaps encoded in overlap or string graphs.
A component of 'missing heritability' is missed sequence variation. Approximately 5–40 Mb of sequence are absent from any given human reference genome owing to structural polymorphism, and standard resequencing has missed detection of diseases such as medullary cystic kidney disease type 1, amyotrophic lateral sclerosis and facioscapulohumeral muscular dystrophy.
Single-molecule long-read sequencing is currently driving gains in genome assembly accuracy and completeness, but new technologies are being developed to generate long-range information, such as optical maps and dilution pool sequencing, that may aid in scaffolding complex regions.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The authors thank T. Brown for assistance in editing this manuscript. This work was supported, in part, by a US National Institutes of Health grant (2R01HG002385) to E.E.E.. E.E.E. is an investigator of the Howard Hughes Medical Institute.
Characterizing a sample genome and its associated variation by mapping and aligning sequence reads to a reference genome sequence.
- Massively parallel sequencing
(MPS). A general term for a form of DNA sequencing that measures trace signals from millions to hundreds of millions of amplified sequences at once, most frequently referring to sequencing produced by Illumina, Life Technologies and Complete Genomics platforms. Often referred to as next-generation or second-generation sequencing to distinguish it from long-read sequencing approaches (for example, single-molecule sequencing), which are sometimes referred to as third-generation sequencing.
- Structural variation
Large insertion, deletion or inversion differences between homologous chromosomes, or translocation differences involving non-homologous chromosomes. Operationally defined as events >50 bp in size to distinguish from smaller insertion and deletion events.
- Coverage bias
Regions with an excess or deficiency in the number of sequence reads originating as a result of platform differences in sequence chemistry, amplification or cloning.
The assignment of genetic variants or alleles to one of two homologous chromosomes.
- De novo assembly
The action of constructing the sequence of a genome from overlapping DNA sequences without guidance from a reference genome.
Sets of genetic variants or alleles found on the same chromosome that are inherited together until disrupted by recombination.
- Whole-genome shotgun sequencing and assembly
(WGSA). The reconstruction of a genome from reads redundantly sampled at random, often with the aid of paired-end sequencing.
Continuous (or 'contiguous') sequences produced in a de novo assembly, free of any gaps.
Sets of ordered and oriented contigs, with the approximate distances between contigs estimated by traversing paired-end sequences that anchor to different contigs. Scaffolds consist of both sequence contigs and gaps.
- Bacterial artificial chromosomes
(BACs). Vectors with an F-plasmid origin of replication used to clonally propagate an organism's DNA (typically 150–250 kb) by transfection into Escherichia coli.
- Single-molecule sequencing
(SMS). A form of DNA sequencing in which signals are derived from single molecules, frequently referring to sequencing produced by Pacific Biosciences and Oxford Nanopore Technologies platforms.
Two reads sequenced from opposite ends of the same fragment.
- N50 length
A statistic in genomics defined as the shortest contig at which half the total length of the assembly is made of contigs of that length or greater. It is commonly used as a metric to summarize the contiguity of an assembly.
- Fragment library
A set of DNA fragments of approximately the same length that are paired-end sequenced.
- Segmental duplication
When a sequence is represented two or more times in a genome with high sequence identity and did not arise by retrotransposition. Often defined as paralogous sequences that share ≥90% sequence identity and are ≥1 kb in length.
- Short tandem repeats
(STRs). Tandem repeats in which the individual unit of repetition is less than 10 bp long and varies in length between different individuals in a population.
- Variable number of tandem repeats
(VNTR). Any tandem array of repeated sequence motifs that are highly variable in different individuals of a population. Historically, these were originally used in reference to tandem repeats that varied on the scale of thousands of base pairs over the length of the array.
Referring to the primary cytogenetic constriction on metaphase chromosomes where the kinetochore forms and spindle fibre attaches during cell division. In humans the centromere is made up primarily of repetitions of higher-order alpha-satellite DNA.
- Heterochromatic DNA
Portions of chromosomes that stain densely, are typically gene poor and are rich in satellite sequences.
Relating to a type of chromosome in which the centromere maps very close to the short arm. Acrocentric chromosomes in humans are enriched in beta-satellite and ribosomal DNA sequences, which are repeated as hundreds of copies.
- Secondary constrictions
A cytogenetic term referring to metaphase chromosome constrictions outside the centromere, typically rich in satellites and used to help identify chromosomes.
- Satellite DNA
Highly repetitive DNA composed of thousands to tens of thousands of tandem repeats, usually between 100–300 bp in length, and frequently associated with heterochromatin.
- Muted gaps
Regions that have been incorrectly closed in a genome assembly despite additional sequences being present at these sites in the source genome.
The genealogy of a region of the genome in which all alleles trace back to a common ancestral sequence.
- Missing heritability
The observation that only a portion of estimated genetic contribution to disease (for example, heritability of a trait from twin studies) can be explained by our current understanding of genetic variation and its transmission properties.
- Exome sequencing
A method for enrichment and targeted sequencing of the protein-coding portions of the genome using massively parallel sequencing.
About this article
Nature Communications (2019)