Hybrid error correction and de novo assembly of single-molecule sequencing reads

Article metrics

Abstract

Single-molecule sequencing instruments can generate multikilobase sequences with the potential to greatly improve genome and transcriptome assembly. However, the error rates of single-molecule reads are high, which has limited their use thus far to resequencing bacteria. To address this limitation, we introduce a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced genome of the parrot Melopsittacus undulatus, as well as for RNA-Seq reads of the corn (Zea mays) transcriptome. Our long-read correction achieves >99.9% base-call accuracy, leading to substantially better assemblies than current sequencing strategies: in the best example, the median contig size was quintupled relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: The PBcR single-molecule read correction and assembly method.
Figure 2: Long-reads yield assembly improvements, even at low coverage.
Figure 3: Contig sizes for various combinations of sequencing technologies.
Figure 4: Error correction of RNA-Seq data provides more accurate mapping of transcripts.

Accession codes

Primary accessions

Sequence Read Archive

References

  1. 1

    Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005).

  2. 2

    Bentley, D. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552 (2006).

  3. 3

    Sanger, F., Nicklen, S. & Coulson, A. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74, 5463–5467 (1977).

  4. 4

    Niu, B., Fu, L., Sun, S. & Li, W. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 11, 187 (2010).

  5. 5

    Dohm, J., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36, e105 (2008).

  6. 6

    Kingsford, C., Schatz, M. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010).

  7. 7

    Schadt, E.E., Turner, S. & Kasarskis, A. A window into third-generation sequencing. Hum. Mol. Genet. 19, R227–R240 (2010).

  8. 8

    Chin, C.-S. The origin of the Haitian cholera outbreak strain. N. Engl. J. Med. 364, 33–42 (2011).

  9. 9

    Rasko, D.A. et al. Origins of the E. coli strain causing an outbreak of hemolytic–uremic syndrome in Germany. N. Engl. J. Med. 365, 709–717 (2011).

  10. 10

    Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

  11. 11

    Miller, J.R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008).

  12. 12

    Salzberg, S.L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 557–567 (2012).

  13. 13

    Pop, M. Genome assembly reborn: recent computational challenges. Brief. Bioinform. 10, 354 (2009).

  14. 14

    Miller, J., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010).

  15. 15

    Phillippy, A., Schatz, M. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).

  16. 16

    Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001).

  17. 17

    Schatz, M.C., Witkowski, J. & McCombie, W.R. Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 13, 243 (2012).

  18. 18

    Nagarajan, N. & Pop, M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16, 897–908 (2009).

  19. 19

    Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 1513–1518 (2011).

  20. 20

    Pop, M., Phillippy, A., Delcher, A.L. & Salzberg, S.L. Comparative genome assembly. Brief. Bioinform. 5, 237–248 (2004).

  21. 21

    Schatz, M.C. et al. Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief. Bioinform. published online, doi: 10.1093/bib/bbr074 (23 December 2011).

  22. 22

    Sommer, D., Delcher, A., Salzberg, S. & Pop, M. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics 8, 64 (2007).

  23. 23

    Schatz, M.C., Delcher, A.L. & Salzberg, S.L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).

  24. 24

    Earl, D.A. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2224–2241 (2011).

  25. 25

    Warren, W.C. et al. The genome of a songbird. Nature 464, 757–762 (2010).

  26. 26

    Hillier, L. et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716 (2004).

  27. 27

    Vezzi, F., Narzisi, G. & Mishra, B. Feature-by-feature—evaluating de novo sequence assembly. PLoS ONE 7, e31002 (2012).

  28. 28

    Wu, T.D. & Watanabe, C.K. Gmap: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

  29. 29

    Enard, W. et al. Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869–872 (2002).

  30. 30

    Enard, W. FOXP2 and the role of cortico-basal ganglia circuits in speech and language evolution. Curr. Opin. Neurobiol. 21, 415–424 (2011).

  31. 31

    Lai, C.S., Fisher, S.E., Hurst, J.A., Vargha-Khadem, F. & Monaco, A.P. A forkhead-domain gene is mutated in a severe speech and language disorder. Nature 413, 519–523 (2001).

  32. 32

    Haesler, S. et al. FoxP2 expression in avian vocal learners and non-learners. J. Neurosci. 24, 3164–3175 (2004).

  33. 33

    Haesler, S. et al. Incomplete and inaccurate vocal imitation after knockdown of FoxP2 in songbird basal ganglia nucleus Area X. PLoS Biol. 5, e321 (2007).

  34. 34

    Carroll, S.B. Evolution at two levels: on genes and form. PLoS Biol. 3, e245 (2005).

  35. 35

    Brose, K. et al. Slit proteins bind Robo receptors and have an evolutionarily conserved role in repulsive axon guidance. Cell 96, 795–806 (1999).

  36. 36

    Wada, K., Sakaguchi, H., Jarvis, E.D. & Hagiwara, M. Differential expression of glutamate receptors in avian neural pathways for learned vocalization. J. Comp. Neurol. 476, 44–64 (2004).

  37. 37

    Maes, T., Barcelo, A. & Buesa, C. Neuron navigator: a human gene family with homology to unc-53, a cell guidance gene from Caenorhabditis elegans. Genomics 80, 21–30 (2002).

  38. 38

    Matsunaga, E. & Okanoya, K. Vocal control area-related expression of neuropilin-1, plexin-A4, and the lig-and semaphorin-3A has implications for the evolution of the avian vocal system. Dev. Growth Differ. 51, 45–54 (2009).

  39. 39

    Morgan, J.I. & Curran, T. Stimulus-transcription coupling in neurons: role of cellular immediate-early genes. Trends Neurosci. 12, 459–462 (1989).

  40. 40

    Jarvis, E.D. & Nottebohm, F. Motor-driven gene expression. Proc. Natl. Acad. Sci. USA 94, 4097–4102 (1997).

  41. 41

    Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

  42. 42

    Kent, W.J. Blat–the blast-like alignment tool. Genome Res. 12, 656–664 (2002).

  43. 43

    Goldberg, S. et al. A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc. Natl. Acad. Sci. USA 103, 11240–11245 (2006).

  44. 44

    Fraser, C.M., Eisen, J.A., Nelson, K.E., Paulsen, I.T. & Salzberg, S.L. The value of complete microbial genome sequencing (you get what you pay for). J. Bacteriol. 184, 6403–6405 (2002).

  45. 45

    Li, Y. et al. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat. Biotechnol. 29, 723–730 (2011).

  46. 46

    Feuk, L., Carson, A.R. & Scherer, S.W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).

  47. 47

    Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).

  48. 48

    Rothberg, J.M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352 (2011).

  49. 49

    Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

  50. 50

    Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

Download references

Acknowledgements

We thank Pacific Biosciences, Roche 454, Illumina, BGI and the Duke Genome Center for the generation and/or release of many of the data sets examined herein, and to the Assemblathon working group for the coordination and release of the parrot genome data. This publication was developed and funded in part under Agreement No. HSHQDC-07-C-00020 awarded by the US Department of Homeland Security for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the US Department of Homeland Security. The Department of Homeland Security does not endorse any products or commercial services mentioned in this publication. The work conducted by the US Department of Energy Joint Genome Institute is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC02-05CH11231. This work was also funded in part by the US National Institutes of Health (NIH) R01-HG006677-12 (M.C.S.), NIH 2R01GM077117-04A1 (B.P.W.), the state of Maryland (D.A.R.), National Science Foundation IOS-1032105 to W.R.M., and Howard Hughes Medical Institute and NIH Directors Pioneer Award to E.D.J.

Author information

S.K. and A.M.P. conceived and designed the algorithm. S.K. implemented the algorithm and carried out the de novo assembly experiments. S.K., M.C.S. and A.M.P. drafted the manuscript, ran experiments and contributed analysis. B.P.W. modified the Celera Assembler to support long sequencing reads and developed the BOGART unitigger. J.M. and Z.W. sequenced Z. mays cDNA and performed analysis. J.H., G.G. and E.D.J. sequenced M. undulatus and performed analysis of vocal learning genes. D.A.R. provided and sequenced E. coli strains. W.R.M. sequenced S. cerevisiae S228c. All authors read and approved the final manuscript.

Correspondence to Sergey Koren or Adam M Phillippy.

Ethics declarations

Competing interests

W.R.M. has participated in Illumina-sponsored meetings over the past four years and received travel reimbursement and honoraria for presenting at these events.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1-7, Supplementary Notes 1–6 and Supplementary Figures 1–16 (PDF 2414 kb)

Supplementary Dataset 1

Celera Assembler and AMOS source code utilized for PacBio correction experiments described in this publication. (ZIP 9510 kb)

Supplementary Dataset 2

Celera Assembler source code utilized for PacBio assembly experiments described in this publication. (ZIP 3700 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Koren, S., Schatz, M., Walenz, B. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30, 693–700 (2012) doi:10.1038/nbt.2280

Download citation

Further reading