Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

De novo sequencing and variant calling with nanopores using PoreSeq

Abstract

The accuracy of sequencing single DNA molecules with nanopores is continually improving, but de novo genome sequencing and assembly using only nanopore data remain challenging. Here we describe PoreSeq, an algorithm that identifies and corrects errors in nanopore sequencing data and improves the accuracy of de novo genome assembly with increasing coverage depth. The approach relies on modeling the possible sources of uncertainty that occur as DNA transits through the nanopore and finds the sequence that best explains multiple reads of the same region. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85% to 99% at 100× coverage. We also use the algorithm to assemble Escherichia coli with 30× coverage and the λ genome at a range of coverages from 3× to 50×. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Nanopore sequencing fundamentals.
Figure 2: PoreSeq algorithm.
Figure 3: PoreSeq performance.

Similar content being viewed by others

Accession codes

Accessions

European Nucleotide Archive

References

  1. Jain, M. et al. Improved data analysis for the MinION nanopore sequencer. Nat. Methods 12, 351–356 (2015).

    Article  CAS  Google Scholar 

  2. Lieberman, K.R. et al. Processive replication of single DNA molecules in a nanopore catalyzed by phi29 DNA polymerase. J. Am. Chem. Soc. 132, 17961–17972 (2010).

    Article  CAS  Google Scholar 

  3. Laszlo, A.H. et al. Decoding long nanopore sequencing reads of natural DNA. Nat. Biotechnol. 32, 829–833 (2014).

    Article  CAS  Google Scholar 

  4. Loman, N.J. & Quinlan, A.R. Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics 30, 3399–3401 (2014).

    Article  CAS  Google Scholar 

  5. Quick, J., Quinlan, A.R. & Loman, N.J. A reference bacterial genome dataset generated on the MinION(TM) portable single-molecule nanopore sequencer. Gigascience 3, 22 (2014).

    Article  Google Scholar 

  6. Ashton, P.M. et al. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat. Biotechnol. 33, 296–300 (2014).

    Article  Google Scholar 

  7. Bayley, H. Nanopore sequencing: from imagination to reality. Clin. Chem. 61, 25–31 (2014).

    Article  Google Scholar 

  8. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

    Article  CAS  Google Scholar 

  9. Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    Article  CAS  Google Scholar 

  10. Manrao, E.A. et al. Reading DNA at single-nucleotide resolution with a mutant MspA nanopore and phi29 DNA polymerase. Nat. Biotechnol. 30, 349–353 (2012).

    Article  CAS  Google Scholar 

  11. Manrao, E.A., Derrington, I.M., Pavlenok, M., Niederweis, M. & Gundlach, J.H. Nucleotide discrimination with DNA immobilized in the MspA nanopore. PLoS ONE 6, e25723 (2011).

    Article  CAS  Google Scholar 

  12. Cherf, G.M. et al. Automated forward and reverse ratcheting of DNA in a nanopore at 5-Å precision. Nat. Biotechnol. 30, 344–348 (2012).

    Article  CAS  Google Scholar 

  13. Bellman, R. Dynamic Programming: A Bibliography of Theory and Application (Dover Publications, Reprint Edition (2003), 1957).

    Google Scholar 

  14. Viterbi, A.J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 260–269 (1967).

    Article  Google Scholar 

  15. Timp, W., Comer, J. & Aksimentiev, A. DNA base-calling from a nanopore using a Viterbi algorithm. Biophys. J. 102, L37–L39 (2012).

    Article  CAS  Google Scholar 

  16. Vintsyuk, T.K. Speech discrimination by dynamic programming. Cybernetics 4, 52–57 (1972).

    Article  Google Scholar 

  17. Gotoh, O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264, 823–838 (1996).

    Article  CAS  Google Scholar 

  18. Brudno, M. & Morgenstern, B. Fast and sensitive alignment of large genomic sequences. Proc. IEEE Comput. Soc. Bioinform. Conf. 1, 138–147 (2002).

    Article  Google Scholar 

  19. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

    Article  CAS  Google Scholar 

  20. Sachidanandam, R. et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001).

    Article  CAS  Google Scholar 

  21. Schreiber, J. et al. Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands. Proc. Natl. Acad. Sci. USA 110, 18910–18915 (2013).

    Article  CAS  Google Scholar 

  22. Wescoe, Z.L., Schreiber, J. & Akeson, M. Nanopores discriminate among five C5-cytosine variants in DNA. J. Am. Chem. Soc. 136, 16582–16587 (2014).

    Article  CAS  Google Scholar 

  23. Gollnick, B. et al. Probing DNA helicase kinetics with temperature-controlled magnetic tweezers. Small 11, 1273–1284 (2015).

    Article  CAS  Google Scholar 

  24. Howorka, S., Cheley, S. & Bayley, H. Sequence-specific detection of individual DNA strands using engineered nanopores. Nat. Biotechnol. 19, 636–639 (2001).

    Article  CAS  Google Scholar 

  25. Butler, T.Z., Pavlenok, M., Derrington, I.M., Niederweis, M. & Gundlach, J.H. Single-molecule DNA detection with an engineered MspA protein nanopore. Proc. Natl. Acad. Sci. USA 105, 20647–20652 (2008).

    Article  CAS  Google Scholar 

  26. Maglia, G., Restrepo, M.R., Mikhailova, E. & Bayley, H. Enhanced translocation of single DNA molecules through alpha-hemolysin nanopores by manipulation of internal charge. Proc. Natl. Acad. Sci. USA 105, 19720–19725 (2008).

    Article  CAS  Google Scholar 

  27. Loman, N.J., Quick, J. & Simpson, J.T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).

    Article  CAS  Google Scholar 

  28. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank E. Brandin for molecule preparation, D. Branton for obtaining MinION sequencers, S. Fleming for helpful algorithmic discussions and Figure 1a, and A. Kuan and M. Burns for feedback on this manuscript. The computations in this paper were run on the Odyssey cluster supported by the Faculty of Arts and Sciences Division of Science, Research Computing Group at Harvard University, and the work was supported by the National Institutes of Health Award no. R01HG003703 to J.A. Golovchenko and D. Branton.

Author information

Authors and Affiliations

Authors

Contributions

T.S.: algorithm development, data analysis and interpretation, writing of manuscript; J.A.G.: data analysis and interpretation, writing of manuscript.

Corresponding author

Correspondence to Jene A Golovchenko.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Allowed 5-mer transitions

An illustration of all of the 5-mer transitions from a single state (GCTAT), including normal steps (black), skips (red) and stays (green).

Supplementary Figure 2 Local likelihood alignment and mutation finding

a) The local (differential) log-likelihood from the candidate and an alternate strand, showing the high degree of noise in the data. b) The cumulative sum approach applied to the above data highlights the regions with beneficial mutations, shown with green shading.

Supplementary Figure 3 Optimizations used in PoreSeq

a) Matrix banding optimization shown, where the cells are only calculated in the blue band near the previous or estimated alignment (black line), while the rest of the matrix is implicitly set to 0 (white). b) Forward-backward optimization, where the matrix is calculated in both directions so that the full alignment score can be calculated from a single column in both matrices. In order to test a mutation in the orange region, only that handful of columns need to be recalculated in the forward direction.

Supplementary Figure 4 Flowcell runs used in this work

Details about all MinION flowcell runs used in the manuscript. Note in particular the larger spread in error for the λ DNA run as a result of an older sequencing kit (SQK-MAP003) being used, as well as the wider distribution of lengths due to the g-Tube shearing protocol. The λ run had 6831 reads total, of which 761 had 2D sequences and 700 aligned to the reference for a total coverage of around 125X. The M13 and CS runs had pass/fail filtering that selected for only 2D reads; M13 had 1195 reads total with 1113 aligned, for a depth of 1086, while the CS run had 907 reads with 860 aligned for an average depth of 720. While we recognize that other MAP participants have seen better performance and higher yield from certain flowcell runs, we found that the number of reads were sufficient for our purposes, and many of the issues we encountered (bubbles in the flowcells) have since been fixed by the manufacturer.

Supplementary Figure 5 Error analysis results for M13mp18

a) All errors are shown (labeled as % of total), binned by the base in the de novo sequence from all trials at 50× coverage against the true base in M13. b) The fraction of all deletions of a particular base that are part of a homopolymer region, defined as 5 or more identical contiguous bases (top), and the total number of each base belonging to homopolymer regions in M13 (bottom). Note that the distribution of homopolymer-related errors in (a) and (b) is largely a result of the underlying base-specific prevalence of homopolymer regions.

Supplementary information

Supplementary Text and Figures

Supplementary Notes 1–6; Supplementary Figures 1–5 (PDF 4266 kb)

Supplementary Data 1 (XLS 30 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Szalay, T., Golovchenko, J. De novo sequencing and variant calling with nanopores using PoreSeq. Nat Biotechnol 33, 1087–1091 (2015). https://doi.org/10.1038/nbt.3360

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.3360

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing