Abstract
Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae
Nature Genetics Open Access 31 July 2023
-
Locality-sensitive bucketing functions for the edit distance
Algorithms for Molecular Biology Open Access 24 July 2023
-
Applications of long-read sequencing to Mendelian genetics
Genome Medicine Open Access 14 June 2023
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout




Change history
06 October 2015
In the version of this article initially published, equation 9 appeared incorrectly. The equation has been corrected in the HTML and PDF versions of the article.
References
Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010).
Nagarajan, N. & Pop, M. Sequence assembly demystified. Nat. Rev. Genet. 14, 157–167 (2013).
Denton, J.F. et al. Extensive error in the number of genes inferred from draft genome assemblies. PLOS Comput. Biol. 10, e1003998 (2014).
Ukkonen, E. Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992).
Schatz, M.C., Delcher, A.L. & Salzberg, S.L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).
Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol. 4, 265–270 (2009).
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
Lee, H. et al. Error correction and assembly complexity of single molecule sequencing reads. bioRxiv 10.1101/006395 (2014).
Quick, J., Quinlan, A.R. & Loman, N.J. A reference bacterial genome dataset generated on the MinION portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014).
Koren, S. et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14, R101 (2013).
Koren, S. & Phillippy, A.M. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 23, 110–120 (2015).
English, A.C. et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).
Ribeiro, F.J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).
Chin, C.S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Ross, M.G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).
Chaisson, M.J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2014).
Lam, K.K., Khalak, A. & Tse, D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinformatics 15 (suppl. 9), S4 (2014).
PacBio. Data Release: Preliminary de novo Haploid and Diploid Assemblies of Drosophila melanogasterhttp://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html (2014).
Broder, A.Z. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997. Proceedings 21–29 (1997).
Broder, A.Z. Identifying and filtering near-duplicate documents. Combinatorial pattern matching 1–10 (2000).
Chum, O., Philbin, J. & Zisserman, A. Near duplicate image detection: min-Hash and tf-idf weighting. British Machine Vision Conference 810, 812–815 (2008).
Buhler, J. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17, 419–428 (2001).
Narayanan, M. & Karp, R.M. Gapped local similarity search with provable guarantees. Algorithms Bioinform. 3240, 74–86 (2004).
Yang, X. et al. De novo assembly of highly diverse viral populations. BMC Genomics 13, 475 (2012).
Rasheed, Z. & Rangwala, H. Mc-minh: Metagenome clustering using minwise based hashing. SIAM International Conference in Data Mining (2013).
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M. & Yorke, J.A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).
Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
Myers, G. Efficient local alignment discovery amongst noisy long reads. Algorithms Bioinform. 8701, 52–67 (2014).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
Zaharia, M. et al. Faster and more accurate sequence alignment with SNAP. arXiv preprint arXiv:1111.5572 (2011).
Weese, D., Holtgrewe, M. & Reinert, K. RazerS 3: faster, fully sensitive read mapping. Bioinformatics 28, 2592–2599 (2012).
Myers, E.W. AnO(ND) difference algorithm and its variations. Algorithmica 1, 251–266 (1986).
Myers, E.W.A. Whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Kim, K.E. et al. Long-read, whole-genome shotgun sequence data for five model organisms. Scientific Data 1, 140045 (2014).
Ralser, M. et al. The Saccharomyces cerevisiae W303–K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt. Open Biol. 2, 120093 (2012).
Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Hoskins, R.A. et al. Sequence finishing and mapping of Drosophila melanogaster heterochromatin. Science 316, 1625–1628 (2007).
Weber, J.L. & Myers, E.W. Human whole-genome shotgun sequencing. Genome Res. 7, 401–409 (1997).
Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Steinberg, K.M. et al. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 24, 2066–2076 (2014).
The MHC Sequencing Consortium. Complete sequence and gene map of a human major histocompatibility complex. Nature 401, 921–923 (1999).
Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688–696 (2014).
Phillippy, A.M., Schatz, M.C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).
Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
Salzberg, S.L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).
Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).
Kaminker, J.S. et al. The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 3, research0084 (2002).
McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).
Mewes, H.W. et al. Overview of the yeast genome. Nature 387, 7–65 (1997).
Blasco, M.A. Telomeres and human disease: ageing, cancer and beyond. Nat. Rev. Genet. 6, 611–622 (2005).
George, J.A., DeBaryshe, P.G., Traverse, K.L., Celniker, S.E. & Pardue, M.L. Genomic organization of the Drosophila telomere retrotransposable elements. Genome Res. 16, 1231–1240 (2006).
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467 (2005).
Koch, P., Platzer, M. & Downie, B.R. RepARK–de novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Res. 42, e80 (2014).
Schwartz, D.C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).
Burton, J.N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Kaplan, N. & Dekker, J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat. Biotechnol. 31, 1143–1147 (2013).
Böhringer, S., Gödde, R., Böhringer, D., Schulte, T. & Epplen, J.T. A software package for drawing ideograms automatically. Online J. Bioinform. 1, 51–61 (2002).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
PacBio DevNet. Pacific Biosciences DevNet Datasets https://github.com/PacificBiosciences/DevNet/wiki/Datasets (2014).
Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Rasmussen, K.R., Stoye, J. & Myers, E.W. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296–308 (2006).
Manber, U. & Myers, G. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22.5, 935–348 (1993).
Cheng, R.C.H. & Amin, N.A.K. Estimating parameters in continuous univariate distributions with a shifted origin. J. R. Stat. Soc., B 45, 394–403 (1983).
Lee, C., Grasso, C. & Sharlow, M.F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).
Anson, E.L. & Myers, E.W. ReAligner: a program for refining DNA sequence multi-alignments. J. Comput. Biol. 4, 369–383 (1997).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Acknowledgements
We are indebted to C. Bergman of the University of Manchester for his considered advice throughout this project and editing of an early version of this manuscript. We also thank Pacific Biosciences and all those involved in generating and freely releasing the data analyzed here. The contributions of S.K. and A.M.P. were funded under Agreement No. HSHQDC-07-C-00020 awarded by the Department of Homeland Security Science and Technology Directorate (DHS/S&T) for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the US Department of Homeland Security. In no event shall the DHS, NBACC or Battelle National Biodefense Institute (BNBI) have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. The Department of Homeland Security does not endorse any products or commercial services mentioned in this publication.
Author information
Authors and Affiliations
Contributions
K.B. and S.K. conceived, designed and implemented the MHAP algorithm. C.S.C. and J.P.D. conceived, designed and implemented the consensus algorithms. S.K. ran and analyzed the genome assemblies. J.M.L. coordinated data release and assisted with pipeline executions. C.S.C. and S.K. performed cloud-computing experiments. K.B., S.K. and A.M.P. drafted the manuscript. A.M.P. coordinated the project. All authors read and approved the final manuscript.
Corresponding author
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–13, Supplementary Tables 1, 3–5, and Supplementary Notes 1–11 (PDF 3100 kb)
Rights and permissions
About this article
Cite this article
Berlin, K., Koren, S., Chin, CS. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33, 623–630 (2015). https://doi.org/10.1038/nbt.3238
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt.3238
This article is cited by
-
A survey of mapping algorithms in the long-reads era
Genome Biology (2023)
-
Locality-sensitive bucketing functions for the edit distance
Algorithms for Molecular Biology (2023)
-
Advances in sequencing technologies for amyotrophic lateral sclerosis research
Molecular Neurodegeneration (2023)
-
Applications of long-read sequencing to Mendelian genetics
Genome Medicine (2023)
-
Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach
Scientific Reports (2023)