Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

A Corrigendum to this article was published on 08 October 2015

This article has been updated

Abstract

Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Rapid overlapping of noisy reads using MinHash sketches.
Figure 2: Simulated MHAP performance for various sketch sizes and read lengths.
Figure 3: Single-contig assembly of D. melanogaster chromosome arm 3L.
Figure 4: Continuity and putative GRCh38 gap closures of the CHM1 assembly.

Change history

  • 06 October 2015

    In the version of this article initially published, equation 9 appeared incorrectly. The equation has been corrected in the HTML and PDF versions of the article.

References

  1. Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010).

    CAS  Article  Google Scholar 

  2. Nagarajan, N. & Pop, M. Sequence assembly demystified. Nat. Rev. Genet. 14, 157–167 (2013).

    CAS  Article  Google Scholar 

  3. Denton, J.F. et al. Extensive error in the number of genes inferred from draft genome assemblies. PLOS Comput. Biol. 10, e1003998 (2014).

    Article  Google Scholar 

  4. Ukkonen, E. Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992).

    Article  Google Scholar 

  5. Schatz, M.C., Delcher, A.L. & Salzberg, S.L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).

    CAS  Article  Google Scholar 

  6. Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol. 4, 265–270 (2009).

    CAS  Article  Google Scholar 

  7. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    CAS  Article  Google Scholar 

  8. Lee, H. et al. Error correction and assembly complexity of single molecule sequencing reads. bioRxiv 10.1101/006395 (2014).

  9. Quick, J., Quinlan, A.R. & Loman, N.J. A reference bacterial genome dataset generated on the MinION portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014).

    Article  Google Scholar 

  10. Koren, S. et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14, R101 (2013).

    Article  Google Scholar 

  11. Koren, S. & Phillippy, A.M. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 23, 110–120 (2015).

    CAS  Article  Google Scholar 

  12. English, A.C. et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).

    CAS  Article  Google Scholar 

  13. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

    CAS  Article  Google Scholar 

  14. Ribeiro, F.J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).

    CAS  Article  Google Scholar 

  15. Chin, C.S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    CAS  Article  Google Scholar 

  16. Ross, M.G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

    Article  Google Scholar 

  17. Chaisson, M.J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2014).

    Article  Google Scholar 

  18. Lam, K.K., Khalak, A. & Tse, D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinformatics 15 (suppl. 9), S4 (2014).

    Article  Google Scholar 

  19. PacBio. Data Release: Preliminary de novo Haploid and Diploid Assemblies of Drosophila melanogasterhttp://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html (2014).

  20. Broder, A.Z. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997. Proceedings 21–29 (1997).

  21. Broder, A.Z. Identifying and filtering near-duplicate documents. Combinatorial pattern matching 1–10 (2000).

  22. Chum, O., Philbin, J. & Zisserman, A. Near duplicate image detection: min-Hash and tf-idf weighting. British Machine Vision Conference 810, 812–815 (2008).

    Google Scholar 

  23. Buhler, J. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17, 419–428 (2001).

    CAS  Article  Google Scholar 

  24. Narayanan, M. & Karp, R.M. Gapped local similarity search with provable guarantees. Algorithms Bioinform. 3240, 74–86 (2004).

    Article  Google Scholar 

  25. Yang, X. et al. De novo assembly of highly diverse viral populations. BMC Genomics 13, 475 (2012).

    CAS  Article  Google Scholar 

  26. Rasheed, Z. & Rangwala, H. Mc-minh: Metagenome clustering using minwise based hashing. SIAM International Conference in Data Mining (2013).

  27. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M. & Yorke, J.A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).

    CAS  Article  Google Scholar 

  28. Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

    CAS  Article  Google Scholar 

  29. Myers, G. Efficient local alignment discovery amongst noisy long reads. Algorithms Bioinform. 8701, 52–67 (2014).

    Article  Google Scholar 

  30. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).

  31. Zaharia, M. et al. Faster and more accurate sequence alignment with SNAP. arXiv preprint arXiv:1111.5572 (2011).

  32. Weese, D., Holtgrewe, M. & Reinert, K. RazerS 3: faster, fully sensitive read mapping. Bioinformatics 28, 2592–2599 (2012).

    CAS  Article  Google Scholar 

  33. Myers, E.W. AnO(ND) difference algorithm and its variations. Algorithmica 1, 251–266 (1986).

    Article  Google Scholar 

  34. Myers, E.W.A. Whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

    CAS  Article  Google Scholar 

  35. Kim, K.E. et al. Long-read, whole-genome shotgun sequence data for five model organisms. Scientific Data 1, 140045 (2014).

    CAS  Article  Google Scholar 

  36. Ralser, M. et al. The Saccharomyces cerevisiae W303–K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt. Open Biol. 2, 120093 (2012).

    Article  Google Scholar 

  37. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).

  38. Hoskins, R.A. et al. Sequence finishing and mapping of Drosophila melanogaster heterochromatin. Science 316, 1625–1628 (2007).

    CAS  Article  Google Scholar 

  39. Weber, J.L. & Myers, E.W. Human whole-genome shotgun sequencing. Genome Res. 7, 401–409 (1997).

    CAS  Article  Google Scholar 

  40. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    CAS  Article  Google Scholar 

  41. Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    CAS  Article  Google Scholar 

  42. Steinberg, K.M. et al. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 24, 2066–2076 (2014).

    CAS  Article  Google Scholar 

  43. The MHC Sequencing Consortium. Complete sequence and gene map of a human major histocompatibility complex. Nature 401, 921–923 (1999).

  44. Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688–696 (2014).

    CAS  Article  Google Scholar 

  45. Phillippy, A.M., Schatz, M.C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).

    Article  Google Scholar 

  46. Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).

    Article  Google Scholar 

  47. Salzberg, S.L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).

    CAS  Article  Google Scholar 

  48. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).

    CAS  Article  Google Scholar 

  49. Kaminker, J.S. et al. The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 3, research0084 (2002).

    Article  Google Scholar 

  50. McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9, e106689 (2014).

    Article  Google Scholar 

  51. Mewes, H.W. et al. Overview of the yeast genome. Nature 387, 7–65 (1997).

    Article  Google Scholar 

  52. Blasco, M.A. Telomeres and human disease: ageing, cancer and beyond. Nat. Rev. Genet. 6, 611–622 (2005).

    CAS  Article  Google Scholar 

  53. George, J.A., DeBaryshe, P.G., Traverse, K.L., Celniker, S.E. & Pardue, M.L. Genomic organization of the Drosophila telomere retrotransposable elements. Genome Res. 16, 1231–1240 (2006).

    CAS  Article  Google Scholar 

  54. Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467 (2005).

    CAS  Article  Google Scholar 

  55. Koch, P., Platzer, M. & Downie, B.R. RepARK–de novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Res. 42, e80 (2014).

    CAS  Article  Google Scholar 

  56. Schwartz, D.C. et al. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science 262, 110–114 (1993).

    CAS  Article  Google Scholar 

  57. Burton, J.N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

    CAS  Article  Google Scholar 

  58. Kaplan, N. & Dekker, J. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat. Biotechnol. 31, 1143–1147 (2013).

    CAS  Article  Google Scholar 

  59. Böhringer, S., Gödde, R., Böhringer, D., Schulte, T. & Epplen, J.T. A software package for drawing ideograms automatically. Online J. Bioinform. 1, 51–61 (2002).

    Google Scholar 

  60. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    CAS  Article  Google Scholar 

  61. PacBio DevNet. Pacific Biosciences DevNet Datasets https://github.com/PacificBiosciences/DevNet/wiki/Datasets (2014).

  62. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

    CAS  Article  Google Scholar 

  63. Rasmussen, K.R., Stoye, J. & Myers, E.W. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13, 296–308 (2006).

    CAS  Article  Google Scholar 

  64. Manber, U. & Myers, G. Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22.5, 935–348 (1993).

    Article  Google Scholar 

  65. Cheng, R.C.H. & Amin, N.A.K. Estimating parameters in continuous univariate distributions with a shifted origin. J. R. Stat. Soc., B 45, 394–403 (1983).

    Google Scholar 

  66. Lee, C., Grasso, C. & Sharlow, M.F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).

    CAS  Article  Google Scholar 

  67. Anson, E.L. & Myers, E.W. ReAligner: a program for refining DNA sequence multi-alignments. J. Comput. Biol. 4, 369–383 (1997).

    CAS  Article  Google Scholar 

  68. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

    Article  Google Scholar 

Download references

Acknowledgements

We are indebted to C. Bergman of the University of Manchester for his considered advice throughout this project and editing of an early version of this manuscript. We also thank Pacific Biosciences and all those involved in generating and freely releasing the data analyzed here. The contributions of S.K. and A.M.P. were funded under Agreement No. HSHQDC-07-C-00020 awarded by the Department of Homeland Security Science and Technology Directorate (DHS/S&T) for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the US Department of Homeland Security. In no event shall the DHS, NBACC or Battelle National Biodefense Institute (BNBI) have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. The Department of Homeland Security does not endorse any products or commercial services mentioned in this publication.

Author information

Authors and Affiliations

Authors

Contributions

K.B. and S.K. conceived, designed and implemented the MHAP algorithm. C.S.C. and J.P.D. conceived, designed and implemented the consensus algorithms. S.K. ran and analyzed the genome assemblies. J.M.L. coordinated data release and assisted with pipeline executions. C.S.C. and S.K. performed cloud-computing experiments. K.B., S.K. and A.M.P. drafted the manuscript. A.M.P. coordinated the project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Sergey Koren.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Tables 1, 3–5, and Supplementary Notes 1–11 (PDF 3100 kb)

Supplementary Table 2 (XLSX 195 kb)

Supplementary Table 6 (XLSX 46 kb)

Supplementary Software (TAR 52520 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Berlin, K., Koren, S., Chin, CS. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33, 623–630 (2015). https://doi.org/10.1038/nbt.3238

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.3238

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing