Sense from sequence reads: methods for alignment and assembly

Flicek, Paul; Birney, Ewan

doi:10.1038/nmeth.1376

Review Article
Published: 15 October 2009

Sense from sequence reads: methods for alignment and assembly

Paul Flicek¹ &
Ewan Birney¹

Nature Methods volume 6, pages S6–S12 (2009)Cite this article

11k Accesses
228 Citations
21 Altmetric
Metrics details

A Corrigendum to this article was published on 01 June 2010

This article has been updated

Abstract

The most important first step in understanding next-generation sequencing data is the initial alignment or assembly that determines whether an experiment has succeeded and provides a first glimpse into the results. In parallel with the growth of new sequencing technologies, several algorithms that align or assemble the large data output of today's sequencing machines have been developed. We discuss the current algorithmic approaches and future directions of these fundamental tools and provide specific examples for some commonly used tools.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Schematic of a hash table–based alignment strategy.**

**Figure 2: The Burrows-Wheeler transform for genomic sequence data.**

**Figure 3: Constructing and visualizing a de Bruijn graph of a DNA sequence.**

CoCas9 is a compact nuclease from the human microbiome for efficient and precise genome editing

Article Open access 24 April 2024

Single-cell analysis reveals context-dependent, cell-level selection of mtDNA

Article Open access 24 April 2024

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Article Open access 30 April 2024

Change history

06 May 2010
In the version of this article initially published online, the caption to Figure 3b was mislabeled. It shows a de Bruijn graph of two plasmids partially overlapping in sequence. The error has been corrected online in the HTML and PDF versions of the article.

References

Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6, S13–S20 (2009).
Article CAS Google Scholar
Pepke, S., Wold, B. & Mortazavi, A. Computational approaches to the analysis of ChIP-seq and RNA-seq data. Nat. Methods 6, S22–S32 (2009).
Article CAS Google Scholar
Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322 (2008).
Article CAS Google Scholar
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005).
Article CAS Google Scholar
McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009)
Article CAS Google Scholar
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS Google Scholar
Batzoglou, S. The many faces of sequence alignment. Brief Bioinform. 6, 6–22 (2005).
Article CAS Google Scholar
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008).
Article CAS Google Scholar
Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).
Article CAS Google Scholar
Rumble, S.M. et al. SHRiMP: accurate mapping of short color-space reads. PLOS Comput. Biol. 5, e1000386 (2009).
Article Google Scholar
Lin, H., Zhang, Z., Zhang, M.Q., Ma, B. & Li, M. ZOOM! Zillions of oligos mapped. Bioinformatics 24, 2431–2437 (2008).
Article CAS Google Scholar
Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002). PatternHunter was the first alignment program to implement the method of finding alignments by scanning with 'spaced seeds' that require exact matching positions to seed the alignments but do not require these seeds to be consecutive. This method is extremely effective for the mapping short sequencing reads and has been adopted by most hash-based alignment methods.
Article CAS Google Scholar
Rasmussen, K.R., Stoye, J. & Myers, E.W. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296–308 (2006).
Article CAS Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Article CAS Google Scholar
Burrows, M. & Wheeler, D.J. A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994).
Ferragina, P. & Manzini, G. Opportunistic data structures with applications; doi:10.1109/SFCS.2000.892127 in Proceedings of the 41st Symposium on Foundation of Computer Science (FOCS 2000) 390–398 (IEEE Computer Society, 2000). The FMindex of the BWT sequence first described in this paper is the fundamental result that has been leveraged by each of BWT-based alignment programs. The sequencing matching algorithm described here has been incorporated into each of the methods, with extensions to handle the specific problems of mismatches, gaps and paired reads.
Chapter Google Scholar
Gräf, S. et al. Optimized design and assessment of whole genome tiling arrays. Bioinformatics 23, i195–i204 (2007).
Article Google Scholar
Kärkkäinen, J. Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387, 249–257 (2007).
Article Google Scholar
Flicek, P. The need for speed. Genome Biol. 10, 212 (2009).
Article Google Scholar
Staden, R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 6, 2601–2610 (1979).
Article CAS Google Scholar
Staden, R., Beal, K.F. & Bonfield, J.K. in Computer methods in molecular biology. in Bioinformatics Methods and Protocols vol. 132 (eds. Misener, S. & Krawetz, S.A.) 115–130 (Humana, Totowa, New Jersey, USA, 1998).
Google Scholar
Pevzner, P.A., Borodovsky, M.Y. & Mironov, A.A. Linguistics of nucleotide sequences. II: Stationary words in genetic texts and the zonal structure of DNA. J. Biomol. Struct. Dyn. 6, 1027–1038 (1989).
Article CAS Google Scholar
Idury, R.M. & Waterman, M.S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995). Idury and Waterman first presented the fundamental algorithm for sequence assembly by k-mer extension. The representation of algorithm with the de Bruijn graph data structure is at the heart of the assembly method described here.
Article CAS Google Scholar
Pevzner, P.A. & Tang, H. Fragment assembly with double-barreled data. Bioinformatics 17 (suppl. 1), S225–S233 (2001).
Article Google Scholar
Dohm, J.C., Lottaz, C., Borodina, T. & Himmelbauer, H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17, 1697–1706 (2007).
Article CAS Google Scholar
Jeck, W.R. et al. Extending assembly of short DNA sequences to handle error. Bioinformatics 23, 2942–2944 (2007).
Article CAS Google Scholar
Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Article CAS Google Scholar
Chaisson, M.J. & Pevzner, P.A. Short read fragment assembly of bacterial genomes. Genome Res. 18, 324–330 (2008).
Article CAS Google Scholar
Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18, 802–809 (2008).
Article CAS Google Scholar
Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Article CAS Google Scholar
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).
Article CAS Google Scholar
Korf, I. Serial BLAST searching. Bioinformatics 19, 1492–1496 (2003).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map (SAM) format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge D. Zerbino and support by the Wellcome Trust and the European Molecular Biology Laboratory.

Author information

Authors and Affiliations

European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Paul Flicek & Ewan Birney

Authors

Paul Flicek
View author publications
You can also search for this author in PubMed Google Scholar
Ewan Birney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Flicek.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Flicek, P., Birney, E. Sense from sequence reads: methods for alignment and assembly. Nat Methods 6 (Suppl 11), S6–S12 (2009). https://doi.org/10.1038/nmeth.1376

Download citation

Published: 15 October 2009
Issue Date: November 2009
DOI: https://doi.org/10.1038/nmeth.1376

This article is cited by

Manual Annotation Studio (MAS): a collaborative platform for manual functional annotation of viral and microbial genomes
- Matthew R. Lueder
- Regina Z. Cer
- Kimberly A. Bishop-Lilly
BMC Genomics (2021)
Towards Accelerated Genome Informatics on Parallel HPC Platforms: The ReneGENE-GI Perspective
- Santhi Natarajan
- Krishna Kumar N.
- S. K. Nandy
Journal of Signal Processing Systems (2020)
Using Apache Spark on genome assembly for scalable overlap-graph reduction
- Alexander J. Paul
- Dylan Lawrence
- Tae-Hyuk Ahn
Human Genomics (2019)
A high-throughput SNP discovery strategy for RNA-seq data
- Yun Zhao
- Ke Wang
- Chang-jie Xu
BMC Genomics (2019)
Fixed Block Compression Boosting in FM-Indexes: Theory and Practice
- Simon Gog
- Juha Kärkkäinen
- Simon J. Puglisi
Algorithmica (2019)

Sense from sequence reads: methods for alignment and assembly

Abstract

Access options

Similar content being viewed by others

CoCas9 is a compact nuclease from the human microbiome for efficient and precise genome editing

Single-cell analysis reveals context-dependent, cell-level selection of mtDNA

Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries

Change history

06 May 2010

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Manual Annotation Studio (MAS): a collaborative platform for manual functional annotation of viral and microbial genomes

Towards Accelerated Genome Informatics on Parallel HPC Platforms: The ReneGENE-GI Perspective

Using Apache Spark on genome assembly for scalable overlap-graph reduction

A high-throughput SNP discovery strategy for RNA-seq data

Fixed Block Compression Boosting in FM-Indexes: Theory and Practice

Search

Quick links

Abstract

Access options

Similar content being viewed by others

Change history

06 May 2010

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links