Advances in sequencing technologies and increased access to sequencing have led to recent renewed interest in sequence assembly algorithms and tools.
Assembly continues to be a computationally challenging problem in which engineering 'details' play a more important part than the choice of a specific assembly paradigm in defining the performance and accuracy of assemblers.
Modern sequence assemblers continue to explore new ways to capture and to analyse graph structures to carry out assembly in a time- and memory-efficient manner.
Most assembly programs are based on heuristics and ad hoc techniques and provide no guarantees on the correctness of the reconstructed sequence. Recent tools have sought to address this need by focusing on assembly tasks in which exact algorithms are feasible.
The availability of multiple sequencing technologies and library preparation protocols has brought into focus the importance of experimental design in sequence assembly.
Coupling of experimental design with the development of assembly algorithms may be key to optimizing assembly results in the future.
A combination of in silico assessment and validation using independent experimental data is currently used to assess the reliability of sequence assembly, although computational tools for assembly validation are still limited in number.
Sequence assembly is increasingly used for applications other than the traditional role of assembling genomes, including transcriptome analysis, reconstruction of microbial communities (metagenomics) and the discovery of genomic variants.
Application-specific assemblers, which exploit characteristics of the sequences to be reconstructed, have emerged as an important area of focus for assembly research.
Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
IMA Fungus Open Access 21 November 2022
BMC Bioinformatics Open Access 27 November 2021
BMC Bioinformatics Open Access 30 October 2021
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Conway, T. C. & Bromage, A. J. Succinct data structures for assembling large genomes. Bioinformatics 27, 479–486 (2011).
Ye, C., Ma, Z. S., Cannon, C. H., Pop, M. & Yu, D. W. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 13 (Suppl. 6), S1 (2012).
Koren, S., Treangen, T. J. & Pop, M. Bambus 2: scaffolding metagenomes. Bioinformatics 27, 2964–2971 (2011).
Namiki, T., Hachiya, T., Tanaka, H. & Sakakibara, Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012).
Peng, Y., Leung, H. C., Yiu, S. M. & Chin, F. Y. Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27, i94–i101 (2011).
Peng, Y., Leung, H. C., Yiu, S. M. & Chin, F. Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012). This paper describes new assembly algorithms that are targeted at data generated in single-cell experiments through whole-genome amplification. The authors had to develop strategies for dealing with the highly uneven coverage of the data as well as numerous experimental errors.
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652 (2011). Presented here is a collection of tools, called Trinity, for de novo assembly-based analysis of transcriptome data. This paper demonstrates that complete transcripts, including their splice forms, can be reconstructed from RNA-seq data.
Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nature Methods 7, 909–912 (2010).
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. 30, 693–700 (2012).
Ribeiro, F. J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).
Wetzel, J., Kingsford, C. & Pop, M. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 12, 95 (2011).
Pham, S. K. et al. Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly. J. Comput. Biol. 17 Jul 2012 (doi:10.1089/cmb.2012.0098).
Nagarajan, N. & Pop, M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16, 897–908 (2009). An overview is provided here of the algorithmic challenges that underlie genome assembly; the paper has a specific focus on the interplay between read length and the size of repeats that can be correctly assembled.
Peltola, H., Soderlund, H. & Ukkonen, E. SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res. 12, 307–321 (1984).
Peltola, H., Sonderlund, H., Tarhio, J. & Ukkonen, E. in IFIP 9th World Computer Congress (ed. Mason, R. E. A.) 53–64 (North-Holland, 1983).
Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).
Ronen, R., Boucher, C., Chitsaz, H. & Pevzner, P. SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28, i188–i196 (2012).
Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008). The Velvet assembler is the first widely used de Bruijn graph assembler, and this is the first paper to demonstrate that high-quality assembly of ultra-short reads is feasible.
Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009). The assembler described in this study, ABySS, is the first parallel genome assembler capable of assembling human-sized data sets.
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Kelley, D. R., Schatz, M. C. & Salzberg, S. L. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010).
Salmela, L. & Schroder, J. Correcting errors in short reads by multiple alignments. Bioinformatics 27, 1455–1461 (2011).
Ferragina, P. & Manzini, G. in Proc. 41st Annu. Symp. Foundations Comput. Sci. 390–398 (2000).
Liu, Y., Schmidt, B. & Maskell, D. L. Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics 12, 354 (2011).
Xing, L. PASQUAL: parallel techniques for next generation genome sequence assembly. IEEE Trans. Parallel Distrib. Syst. 10 Aug 2012 (doi:10.1109/TPDS.2012.190).
Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl Acad. Sci. USA 109, 13272–13277 (2012).
Pevzner, P. A. & Tang, H. Fragment assembly with double-barreled data. Bioinformatics 17 (Suppl. 1), S225–S233 (2001). This paper introduces the de Bruijn graph paradigm for assembly and the Euler assembler. The concepts described here have formed the basis for almost all de Bruijn-graph-based assemblers that are available in the community.
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genet. 44, 226–232 (2012).
Pop, M., Kosack, D. S. & Salzberg, S. L. Hierarchical scaffolding with Bambus. Genome Res. 14, 149–159 (2004).
Dayarian, A., Michael, T. P. & Sengupta, A. M. SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics 11, 345 (2010).
Gao, S., Sung, W. K. & Nagarajan, N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 18, 1681–1691 (2011). In this study, it is demonstrated that the genome scaffolding problem can be solved exactly for commonly encountered data despite the computational intractability of this problem. This paper also introduces the scaffolder Opera, which outperforms other stand-alone scaffolding packages.
Tsai, I. J., Otto, T. D. & Berriman, M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 11, R41 (2010).
Gao, S., Bertrand, D. & Nagarajan, N. FinIS: improved in silico finishing using an exact quadratic programming formulation. Lect. Notes Comput. Sci. 7534, 314–325 (2012).
Medvedev, P., Georgiou, K., Myers, G. & Brudno, M. Computability of models for sequence assembly. Lect. Notes Comput. Sci. 4645, 289–301 (2007).
Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nature Methods 8, 61–65 (2011). The many errors found in a de novo assembly of the human genome are highlighted here, and the authors argue for the continued development of experimental techniques aimed at fully reconstructing genomes.
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011). This paper introduces the ALLPATHS-LG assembler, which is the first assembler that is specifically designed in concert with a specific 'recipe' for the sequencing experiment.
Bashir, A., Bansal, V. & Bafna, V. Designing deep sequencing experiments: structural variation, haplotype assembly, and transcript abundance. BMC Genomics 11, 385 (2010).
Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011). The Assemblathon competition compared the performance of modern genome assemblers on a simulated human-sized diploid genome. The assemblies were contributed by the community, thus reflecting the best results that could be obtained with the corresponding assemblers. The paper also includes a detailed description of methods for validating the quality of the resulting assemblies.
Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012). The GAGE competition compared the performance of several modern genome assemblers on real sequencing data from bacterial to eukaryotic genomes. The assemblies were carried out by the authors of the study, and the validation of the assemblies was done by comparison to known references for the genomes included. In addition, the paper provides full 'assembly recipes', which allow readers directly to reproduce the results presented.
Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Zhou, S. et al. A whole-genome shotgun optical map of Yersinia pestis strain KIM. Appl. Environ. Microbiol. 68, 6321–6331 (2002).
Nagarajan, N., Read, T. D. & Pop, M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 24, 1229–1235 (2008).
Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl Acad. Sci. USA 101, 1916–1921 (2004).
Zimin, A. V. et al. A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol. 10, R42 (2009).
Meader, S., Hillier, L. W., Locke, D., Ponting, C. P. & Lunter, G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 20, 675–684 (2010).
Gnerre, S., Lander, E. S., Lindblad-Toh, K. & Jaffe, D. B. Assisted assembly: how to improve a de novo genome assembly by using related species. Genome Biol. 10, R88 (2009).
Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).
Huson, D. et al. in Proc. First Int. Workshop Algorithms Bioinf. 294–306 (2001).
Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Prufer, K. et al. The bonobo genome compared with the chimpanzee and human genomes. Nature 486, 527–531 (2012).
Blakesley, R. W. et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235–2244 (2004).
Choi, J. H. et al. A machine-learning approach to combined evidence validation of genome assemblies. Bioinformatics 24, 744–750 (2008).
Schatz, M. C. et al. Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief. Bioinform. 23 Dec 2012 (doi:10.1093/bib/bbr074).
Narzisi, G. & Mishra, B. Comparing de novo genome assembly: the long and short of it. PLoS ONE 6, e19175 (2011).
Haiminen, N., Kuhn, D. N., Parida, L. & Rigoutsos, I. Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS ONE 6, e24182 (2011).
Lin, Y. et al. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27, 2031–2037 (2011).
Zhang, W. et al. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS ONE 6, e17915 (2011).
Barthelson, R., McFarlin, A. J., Rounsley, S. D. & Young, S. Plantagora: modeling whole genome sequencing and assembly of plant genomes. PLoS ONE 6, e28436 (2011).
Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009).
Tyson, G. W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010). This is a large-scale catalogue of metagenomic data generated through de novo assembly of short read sequencing data. This paper is the first to demonstrate that metagenomic data can be effectively analysed through next-generation sequencing technologies.
Laserson, J., Jojic, V. & Koller, D. Genovo: de novo assembly for metagenomes. J. Computat. Biol. 18, 429–443 (2011).
Dean, F. B. et al. Comprehensive human genome amplification using multiple displacement amplification. Proc. Natl Acad. Sci. USA 99, 5261–5266 (2002).
Raghunathan, A. et al. Genomic DNA amplification from a single bacterium. Appl. Environ. Microbiol. 71, 3342–3347 (2005).
Chitsaz, H. et al. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nature Biotech. 29, 915–921 (2011).
Hansen, K. D., Brenner, S. E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).
Surget-Groba, Y. & Montoya-Burgos, J. I. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 20, 1432–1440 (2010).
Schulz, M. H., Zerbino, D. R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092 (2012).
Zhao, Q. Y. et al. Optimizing de novo transcriptome assembly from short-read RNA-seq data: a comparative study. BMC Bioinformatics 12 (Suppl. 14), S2 (2011).
Feldmeyer, B., Wheat, C. W., Krezdorn, N., Rotter, B. & Pfenninger, M. Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics 12, 317 (2011).
Charuvaka, A. & Rangwala, H. Evaluation of short read metagenomic assembly. BMC Genomics 12 (Suppl. 2), S8 (2011).
The Human Microbiome Project Consortium. A framework for human microbiome research. Nature 486, 215–221 (2012).
Weinstock, G. M. Genomic approaches to studying the human microbiota. Nature 489, 250–256 (2012).
The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Hajirasouliha, I. et al. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics 26, 1277–1283 (2010).
Newman, T. L. et al. A genome-wide survey of structural variation between human and chimpanzee. Genome Res. 15, 1344–1356 (2005).
Khaja, R. et al. Genome assembly comparison identifies structural variants in the human genome. Nature Genet. 38, 1413–1418 (2006).
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 6, 677–681 (2009).
Chen, K. et al. BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end sequencing data. Bioinformatics 28, 1923–1924 (2012).
Warren, R. L. & Holt, R. A. Targeted assembly of short sequence reads. PLoS ONE 6, e19816 (2011).
Aguiar, D. & Istrail, S. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. Comput. Biol. 19, 577–590 (2012).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).
Eriksson, N. et al. Viral population estimation using pyrosequencing. PLoS Comput. Biol. 4, e1000074 (2008).
Prosperi, M. C. et al. Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinformatics 12, 5 (2011).
Astrovskaya, I. et al. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics 12 (Suppl. 6), S1 (2011).
Prosperi, M. C. & Salemi, M. QuRe: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics 28, 132–133 (2012).
Fullwood, M. J., Wei, C. L., Liu, E. T. & Ruan, Y. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19, 521–532 (2009).
Schwartz, D. C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).
Miller, J. M., Malenfant, R. M., Moore, S. S. & Coltman, D. W. Short reads, circular genome: skimming solid sequence to construct the bighorn sheep mitochondrial genome. J. Hered. 103, 140–146 (2012).
Loman, N. J. et al. Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotech. 30, 434–439 (2012).
Sutton, G. G., White, O., Adams, M. D. & Kerlavage, A. R. TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Technol. 1, 9–19 (1995).
Jeck, W. R. et al. Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 2942–2944 (2007).
N.N. was supported by the Agency for Science, Technology and Research (A*STAR), Singapore. M.P. was supported in part by the US National Science Foundation (grants IIS-1117247 and IIS-0844494) and by the Bill and Melinda Gates Foundation.
The authors declare no competing financial interests.
- Paired-end data
Data from a pair of reads sequenced from ends of the same DNA fragment. The genomic distance between the reads is approximately known and is used to constrain assembly solutions. See also 'mate-pair read'.
- Mate-pair data
Data from a pair of reads sequenced from the same circularized DNA fragment. The circularization step allows for larger fragments sizes to be used. They provide the same information as paired-end reads to the assembler.
- Contiguous sequence
(Contig). A sequence reconstructed by assembling together multiple reads.
The sequence generated by a sequencing machine from a DNA fragment.
The relationship between two reads, the ends of which have highly similar sequences. The minimum length allowed for the corresponding sequence is an important parameter in assembly.
An ordered collection of contiguous sequences (contigs), the relative placement of which is typically inferred from mate-pair reads and other information. The sequence within the gaps between the contigs is usually not known.
A collection of paired-end or mate-pair reads derived from DNA fragments with a tightly controlled size range.
- Depth of coverage
The average number of reads covering a particular base in the sequence being assembled.
A statistic used for assessing the contiguity of a genome assembly. The contigs in an assembly are sorted by size and added, starting with the largest. The size of the contig is reported that makes the total greater than or equal to 50% of the genome size.
- Isolate genome
The genome of a single organism isolated through culture, for which a substantial quantity of DNA can be obtained.
Strings of k consecutive letters extracted from a longer sequence, such as a read or a reference assembly.
About this article
Cite this article
Nagarajan, N., Pop, M. Sequence assembly demystified. Nat Rev Genet 14, 157–167 (2013). https://doi.org/10.1038/nrg3367
This article is cited by
IMA Fungus (2022)
Third-generation sequencing and metabolome analysis reveal candidate genes and metabolites with altered levels in albino jackfruit seedlings
BMC Genomics (2021)
BMC Genomics (2021)
Genome Biology (2021)
BMC Bioinformatics (2021)