The most important first step in understanding next-generation sequencing data is the initial alignment or assembly that determines whether an experiment has succeeded and provides a first glimpse into the results. In parallel with the growth of new sequencing technologies, several algorithms that align or assemble the large data output of today's sequencing machines have been developed. We discuss the current algorithmic approaches and future directions of these fundamental tools and provide specific examples for some commonly used tools.
Subscribe to Journal
Get full journal access for 1 year
only $20.17 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6, S13–S20 (2009).
Pepke, S., Wold, B. & Mortazavi, A. Computational approaches to the analysis of ChIP-seq and RNA-seq data. Nat. Methods 6, S22–S32 (2009).
Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005).
McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009)
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Batzoglou, S. The many faces of sequence alignment. Brief Bioinform. 6, 6–22 (2005).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).
Rumble, S.M. et al. SHRiMP: accurate mapping of short color-space reads. PLOS Comput. Biol. 5, e1000386 (2009).
Lin, H., Zhang, Z., Zhang, M.Q., Ma, B. & Li, M. ZOOM! Zillions of oligos mapped. Bioinformatics 24, 2431–2437 (2008).
Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002). PatternHunter was the first alignment program to implement the method of finding alignments by scanning with 'spaced seeds' that require exact matching positions to seed the alignments but do not require these seeds to be consecutive. This method is extremely effective for the mapping short sequencing reads and has been adopted by most hash-based alignment methods.
Rasmussen, K.R., Stoye, J. & Myers, E.W. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296–308 (2006).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Burrows, M. & Wheeler, D.J. A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994).
Ferragina, P. & Manzini, G. Opportunistic data structures with applications; doi:10.1109/SFCS.2000.892127 in Proceedings of the 41st Symposium on Foundation of Computer Science (FOCS 2000) 390–398 (IEEE Computer Society, 2000). The FMindex of the BWT sequence first described in this paper is the fundamental result that has been leveraged by each of BWT-based alignment programs. The sequencing matching algorithm described here has been incorporated into each of the methods, with extensions to handle the specific problems of mismatches, gaps and paired reads.
Gräf, S. et al. Optimized design and assessment of whole genome tiling arrays. Bioinformatics 23, i195–i204 (2007).
Kärkkäinen, J. Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387, 249–257 (2007).
Flicek, P. The need for speed. Genome Biol. 10, 212 (2009).
Staden, R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 6, 2601–2610 (1979).
Staden, R., Beal, K.F. & Bonfield, J.K. in Computer methods in molecular biology. in Bioinformatics Methods and Protocols vol. 132 (eds. Misener, S. & Krawetz, S.A.) 115–130 (Humana, Totowa, New Jersey, USA, 1998).
Pevzner, P.A., Borodovsky, M.Y. & Mironov, A.A. Linguistics of nucleotide sequences. II: Stationary words in genetic texts and the zonal structure of DNA. J. Biomol. Struct. Dyn. 6, 1027–1038 (1989).
Idury, R.M. & Waterman, M.S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995). Idury and Waterman first presented the fundamental algorithm for sequence assembly by k-mer extension. The representation of algorithm with the de Bruijn graph data structure is at the heart of the assembly method described here.
Pevzner, P.A. & Tang, H. Fragment assembly with double-barreled data. Bioinformatics 17 (suppl. 1), S225–S233 (2001).
Dohm, J.C., Lottaz, C., Borodina, T. & Himmelbauer, H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17, 1697–1706 (2007).
Jeck, W.R. et al. Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 2942–2944 (2007).
Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Chaisson, M.J. & Pevzner, P.A. Short read fragment assembly of bacterial genomes. Genome Res. 18, 324–330 (2008).
Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18, 802–809 (2008).
Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).
Korf, I. Serial BLAST searching. Bioinformatics 19, 1492–1496 (2003).
Li, H. et al. The Sequence Alignment/Map (SAM) format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
The authors acknowledge D. Zerbino and support by the Wellcome Trust and the European Molecular Biology Laboratory.
The authors declare no competing financial interests.
About this article
Cite this article
Flicek, P., Birney, E. Sense from sequence reads: methods for alignment and assembly. Nat Methods 6, S6–S12 (2009). https://doi.org/10.1038/nmeth.1376
Journal of Clinical Medicine (2020)
Human Genomics (2019)
BMC Genomics (2019)