Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

Journal name:
Nature Biotechnology
Year published:
Published online
Corrected online


Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

At a glance


  1. Rapid overlapping of noisy reads using MinHash sketches.
    Figure 1: Rapid overlapping of noisy reads using MinHash sketches.

    (a) To create a MinHash sketch of a DNA sequence S, we first decomposed the sequence into its constituent k-mers. In the example shown, k = 3, resulting in 12 k-mers each for S1 and S2. (b) All k-mers are then converted to integer fingerprints by multiple hash functions. The number of hash functions determines the resulting sketch size H. Here, where H = 4, four independent hash sets are generated for each sequence (Γ1...H). In MHAP, after the initial hash (Γ1), subsequent fingerprints are generated using an XORShift pseudo-random number generator (Γ2...H). The k-mer generating the minimum value for each hash is referred to as the min-mer for that hash. (c) The sketch of a sequence is composed of the ordered set of its H min-mer fingerprints, which is much smaller than the set of all k-mers. In this example, the sketches of S1 and S2 share the same minimum fingerprints (underlined) for Γ1 and Γ2. (d) The fraction of entries shared between the sketches of two sequences S1 and S2 (0.5) serves as an estimate of their true Jaccard similarity (0.22), with the error bound controlled by H. In practice, H >> 4 is required to obtain accurate estimates. (e) If sufficient similarity is detected between two sketches, the shared min-mers (ACC and CCG in this case) are located in the original sequences and the median difference in their positions is computed to determine the overlap offset (0) for S1 and S2.

  2. Simulated MHAP performance for various sketch sizes and read lengths.
    Figure 2: Simulated MHAP performance for various sketch sizes and read lengths.

    10 kbp reads were randomly extracted from the human reference genome, and errors were introduced to simulate a SMRT sequencing error model (10% insertion, 2% deletion and 1% substitution)13. (a) For k = 10, it is common to find at least three min-mer matches by chance. (b) For k = 16, at least three min-mer matches are sufficient to separate random matches from true matches. Three scenarios are shown for ≥1 and ≥3 matching min-mers: unrelated sequences (Rand), reads overlapping by exactly 2 kbp (Olap) and reads mapped to a perfect reference (Map). The expected Jaccard similarity between a pair of random and nonrandom reads was estimated based on 50,000 independent trials and equation (9) in Online Methods. (c) The total number of 16-mers processed by MHAP decreases exponentially relative to the direct approach (shown for sketch sizes of 512 and 1,256). The number of accesses is normalized by the maximum value observed during the simulations, and given on a log-scaled y axis. Random fluctuations are an artifact of the random read sampling.

  3. Single-contig assembly of D. melanogaster chromosome arm 3L.
    Figure 3: Single-contig assembly of D. melanogaster chromosome arm 3L.

    A single ~25 Mbp contig from the MHAP D. melanogaster assembly covers the full euchromatic region of chromosome arm 3L. (Top) All 100-bp exact repeats across the length of the 3L assembly are shown using a self-alignment dotplot. Red dots indicate forward repeats, and blue dots inverted repeats. Points nearer to the hypotenuse indicate repeat copies nearer to each other in the genome. (Bottom left) All 20 bp exact repeats are shown for the first 12 kbp of the assembly illustrating a peritelomeric tandem repeat. (Bottom right) All 20 bp exact repeats are shown for the last 2 Mbp of the assembly, which is composed of an elevated repeat density, characteristic of the pericentromeric region.

  4. Continuity and putative GRCh38 gap closures of the CHM1 assembly.
    Figure 4: Continuity and putative GRCh38 gap closures of the CHM1 assembly.

    Human chromosomes are painted with assembled CHM1 contigs using the colored Chromosomes package59. Alternating shades indicate adjacent contigs, so each vertical transition from gray to black represents a contig boundary or alignment breakpoint. The left half of each chromosome shows the MHAP assembly of the SMRT data set and the right half shows the Illumina-based assembly42. The SMRT assembly is considerably more continuous, with an average of less than 100 contigs per chromosome. Putative GRCh38 gap closures are shown as red dashes next to the spanned gap position. Most SMRT closures fall near the telomeres and centromeres. The three gaps spanned by the Illumina assembly are not associated with the primary chromosomes and cannot be displayed. Tiling gaps, in white, coincide with missing assembly sequence and/or regions of uncharacterized reference sequence (i.e., long stretches of N's) to which no contigs could be mapped. A full list of putative gap closures is given in Supplementary Table 6.

Change history

Corrected online 06 October 2015
In the version of this article initially published, equation 9 appeared incorrectly. The equation has been corrected in the HTML and PDF versions of the article.


  1. Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315327 (2010).
  2. Nagarajan, N. & Pop, M. Sequence assembly demystified. Nat. Rev. Genet. 14, 157167 (2013).
  3. Denton, J.F. et al. Extensive error in the number of genes inferred from draft genome assemblies. PLOS Comput. Biol. 10, e1003998 (2014).
  4. Ukkonen, E. Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191211 (1992).
  5. Schatz, M.C., Delcher, A.L. & Salzberg, S.L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 11651173 (2010).
  6. Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol. 4, 265270 (2009).
  7. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133138 (2009).
  8. Lee, H. et al. Error correction and assembly complexity of single molecule sequencing reads. bioRxiv 10.1101/006395 (2014).
  9. Quick, J., Quinlan, A.R. & Loman, N.J. A reference bacterial genome dataset generated on the MinION portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014).
  10. Koren, S. et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14, R101 (2013).
  11. Koren, S. & Phillippy, A.M. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 23, 110120 (2015).
  12. English, A.C. et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).
  13. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693700 (2012).
  14. Ribeiro, F.J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 22702277 (2012).
  15. Chin, C.S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563569 (2013).
  16. Ross, M.G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).
  17. Chaisson, M.J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608611 (2014).
  18. Lam, K.K., Khalak, A. & Tse, D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinformatics 15 (suppl. 9), S4 (2014).
  19. PacBio. Data Release: Preliminary de novo Haploid and Diploid Assemblies of Drosophila melanogaster (2014).
  20. Broder, A.Z. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997. Proceedings 2129 (1997).
  21. Broder, A.Z. Identifying and filtering near-duplicate documents. Combinatorial pattern matching 110 (2000).
  22. Chum, O., Philbin, J. & Zisserman, A. Near duplicate image detection: min-Hash and tf-idf weighting. British Machine Vision Conference 810, 812815 (2008).
  23. Buhler, J. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17, 419428 (2001).
  24. Narayanan, M. & Karp, R.M. Gapped local similarity search with provable guarantees. Algorithms Bioinform. 3240, 7486 (2004).
  25. Yang, X. et al. De novo assembly of highly diverse viral populations. BMC Genomics 13, 475 (2012).
  26. Rasheed, Z. & Rangwala, H. Mc-minh: Metagenome clustering using minwise based hashing. SIAM International Conference in Data Mining (2013).
  27. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M. & Yorke, J.A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 33633369 (2004).
  28. Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
  29. Myers, G. Efficient local alignment discovery amongst noisy long reads. Algorithms Bioinform. 8701, 5267 (2014).
  30. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
  31. Zaharia, M. et al. Faster and more accurate sequence alignment with SNAP. arXiv preprint arXiv:1111.5572 (2011).
  32. Weese, D., Holtgrewe, M. & Reinert, K. RazerS 3: faster, fully sensitive read mapping. Bioinformatics 28, 25922599 (2012).
  33. Myers, E.W. AnO(ND) difference algorithm and its variations. Algorithmica 1, 251266 (1986).
  34. Myers, E.W.A. Whole-genome assembly of Drosophila. Science 287, 21962204 (2000).
  35. Kim, K.E. et al. Long-read, whole-genome shotgun sequence data for five model organisms. Scientific Data 1, 140045 (2014).
  36. Ralser, M. et al. The Saccharomyces cerevisiae W303–K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt. Open Biol. 2, 120093 (2012).
  37. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796815 (2000).
  38. Hoskins, R.A. et al. Sequence finishing and mapping of Drosophila melanogaster heterochromatin. Science 316, 16251628 (2007).
  39. Weber, J.L. & Myers, E.W. Human whole-genome shotgun sequencing. Genome Res. 7, 401409 (1997).
  40. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860921 (2001).
  41. Venter, J.C. et al. The sequence of the human genome. Science 291, 13041351 (2001).
  42. Steinberg, K.M. et al. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 24, 20662076 (2014).
  43. The MHC Sequencing Consortium. Complete sequence and gene map of a human major histocompatibility complex. Nature 401, 921923 (1999).
  44. Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688696 (2014).
  45. Phillippy, A.M., Schatz, M.C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).
  46. Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 21852195 (2000).
  47. Salzberg, S.L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557567 (2012).
  48. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175185 (1998).
  49. Kaminker, J.S. et al. The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 3, research0084 (2002).
  50. McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).
  51. Mewes, H.W. et al. Overview of the yeast genome. Nature 387, 765 (1997).
  52. Blasco, M.A. Telomeres and human disease: ageing, cancer and beyond. Nat. Rev. Genet. 6, 611622 (2005).
  53. George, J.A., DeBaryshe, P.G., Traverse, K.L., Celniker, S.E. & Pardue, M.L. Genomic organization of the Drosophila telomere retrotransposable elements. Genome Res. 16, 12311240 (2006).
  54. Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462467 (2005).
  55. Koch, P., Platzer, M. & Downie, B.R. RepARK–de novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Res. 42, e80 (2014).
  56. Schwartz, D.C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110114 (1993).
  57. Burton, J.N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 11191125 (2013).
  58. Kaplan, N. & Dekker, J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat. Biotechnol. 31, 11431147 (2013).
  59. Böhringer, S., Gödde, R., Böhringer, D., Schulte, T. & Epplen, J.T. A software package for drawing ideograms automatically. Online J. Bioinform. 1, 5161 (2002).
  60. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455477 (2012).
  61. PacBio DevNet. Pacific Biosciences DevNet Datasets (2014).
  62. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195197 (1981).
  63. Rasmussen, K.R., Stoye, J. & Myers, E.W. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296308 (2006).
  64. Manber, U. & Myers, G. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22.5, 935348 (1993).
  65. Cheng, R.C.H. & Amin, N.A.K. Estimating parameters in continuous univariate distributions with a shifted origin. J. R. Stat. Soc., B 45, 394403 (1983).
  66. Lee, C., Grasso, C. & Sharlow, M.F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452464 (2002).
  67. Anson, E.L. & Myers, E.W. ReAligner: a program for refining DNA sequence multi-alignments. J. Comput. Biol. 4, 369383 (1997).
  68. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

Download references

Author information

  1. These authors contributed equally to this work.

    • Konstantin Berlin &
    • Sergey Koren


  1. Department of Chemistry and Biochemistry, University of Maryland, College Park, Maryland, USA.

    • Konstantin Berlin
  2. Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland, USA.

    • Konstantin Berlin
  3. Invincea Labs, Arlington, Virginia, USA.

    • Konstantin Berlin
  4. National Biodefense Analysis and Countermeasures Center, Frederick, Maryland, USA.

    • Sergey Koren &
    • Adam M Phillippy
  5. Pacific Biosciences of California, Inc., Menlo Park, California, USA.

    • Chen-Shan Chin,
    • James P Drake &
    • Jane M Landolin


K.B. and S.K. conceived, designed and implemented the MHAP algorithm. C.S.C. and J.P.D. conceived, designed and implemented the consensus algorithms. S.K. ran and analyzed the genome assemblies. J.M.L. coordinated data release and assisted with pipeline executions. C.S.C. and S.K. performed cloud-computing experiments. K.B., S.K. and A.M.P. drafted the manuscript. A.M.P. coordinated the project. All authors read and approved the final manuscript.

Competing financial interests

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (4.2 MB)

    Supplementary Figures 1–13, Supplementary Tables 1, 3–5, and Supplementary Notes 1–11

Excel files

  1. Supplementary Table 2 (196 KB)
  2. Supplementary Table 6 (47 KB)

Tape archive files

  1. Supplementary Software (51.7 MB)

Additional data