Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

TruSPAdes: barcode assembly of TruSeq synthetic long reads

Abstract

The recently introduced TruSeq synthetic long read (TSLR) technology generates long and accurate virtual reads from an assembly of barcoded pools of short reads. The TSLR method provides an attractive alternative to existing sequencing platforms that generate long but inaccurate reads. We describe the truSPAdes algorithm (http://bioinf.spbau.ru/spades) for TSLR assembly and show that it results in a dramatic improvement in the quality of metagenomics assemblies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: The TSLR technology.
Figure 2: Contig length.

Similar content being viewed by others

Accession codes

Primary accessions

European Nucleotide Archive

References

  1. Chin, C.S. et al. Nat. Methods 10, 563–569 (2013).

    Article  CAS  Google Scholar 

  2. Lam, K.K., Khalak, A. & Tse, D. BMC Bioinformatics 15, S4 (2014).

    Article  Google Scholar 

  3. Koren, S. et al. Genome Biol. 14, R101 (2013).

    Article  Google Scholar 

  4. Huddleston, J. et al. Genome Res. 24, 688–696 (2014).

    Article  CAS  Google Scholar 

  5. Salmela, L. & Rivals, E. Bioinformatics 30, 3506–3514 (2014).

    Article  CAS  Google Scholar 

  6. Ummat, A. & Bashir, A. Bioinformatics 30, 3491–3498 (2014).

    Article  CAS  Google Scholar 

  7. Lam, K.-K., LaButti, K., Khalak, A. & Tse, D. Bioinformatics 31, 3207–3209 (2015).

    Article  CAS  Google Scholar 

  8. Berlin, K. et al. Nat. Biotechnol. 33, 623–630 (2015).

    Article  CAS  Google Scholar 

  9. McCoy, R.C. et al. PLoS ONE 9, e106689 (2014).

    Article  Google Scholar 

  10. Tilgner, H. et al. Nat. Biotechnol. 33, 736–742 (2015).

    Article  CAS  Google Scholar 

  11. Li, R. et al. Sci. Rep. 5, 10814 (2015).

    Article  CAS  Google Scholar 

  12. Sharon, I. et al. Genome Res. 25, 534–543 (2015).

    Article  CAS  Google Scholar 

  13. Kuleshov, V. et al. Nat. Biotechnol. 34, 64–69 (2015).

    Article  Google Scholar 

  14. Chitsaz, H. et al. Nat. Biotechnol. 29, 915–921 (2011).

    Article  CAS  Google Scholar 

  15. Bankevich, A. et al. J. Comput. Biol. 19, 455–477 (2012).

    Article  CAS  Google Scholar 

  16. Peng, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. Bioinformatics 28, 1420–1428 (2012).

    Article  CAS  Google Scholar 

  17. Compeau, P.E., Pevzner, P.A. & Tesler, G. Nat. Biotechnol. 29, 987–991 (2011).

    Article  CAS  Google Scholar 

  18. Kuleshov, V. et al. Nat. Biotechnol. 32, 261–266 (2014).

    Article  CAS  Google Scholar 

  19. Simpson, J.T. & Durbin, R. Genome Res. 22, 549–556 (2012).

    Article  CAS  Google Scholar 

  20. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. Bioinformatics 29, 1072–1075 (2013).

    Article  CAS  Google Scholar 

  21. Peng, Y., Leung, H.C.M., Yiu, S.M. & Chin, F.Y.L. Bioinformatics 27, i94–i101 (2011).

    Article  CAS  Google Scholar 

  22. Boisvert, S., Raymond, F., Godzaridis, E., Laviolette, F. & Corbeil, J. Genome Biol. 13, R122 (2012).

    Article  Google Scholar 

  23. Haider, B. et al. Bioinformatics 30, 2717–2722 (2014).

    Article  CAS  Google Scholar 

  24. Howe, A.C. et al. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014).

    Article  CAS  Google Scholar 

  25. Marcy, Y. et al. Proc. Natl. Acad. Sci. USA 104, 11889–11894 (2007).

    Article  CAS  Google Scholar 

  26. McLean, J.S. et al. Genome Res. 23, 867–877 (2013).

    Article  CAS  Google Scholar 

  27. Nurk, S. et al. J. Comput. Biol. 20, 714–737 (2013).

    Article  CAS  Google Scholar 

  28. Myers, E.W. et al. Science 287, 2196–2204 (2000).

    Article  CAS  Google Scholar 

  29. Treangen, T.J. et al. Genome Biol. 14, R2 (2013).

    Article  Google Scholar 

  30. Peters, B.A., Liu, J. & Drmanac, R. Front. Genet. 5, 466 (2015).

    Article  Google Scholar 

  31. Dean, F.B., Nelson, J.R., Giesler, T.L. & Lasken, R.S. Genome Res. 11, 1095–1099 (2001).

    Article  CAS  Google Scholar 

  32. Lasken, R.-S. & Stockwell, T.B. BMC Biotechnol. 7, 19 (2007).

    Article  Google Scholar 

  33. Zerbino, D.-R. & Birney, E. Genome Res. 18, 821–829 (2008).

    Article  CAS  Google Scholar 

  34. Simpson, J.T. et al. Genome Res. 19, 1117–1123 (2009).

    Article  CAS  Google Scholar 

  35. Prjibelski, A. et al. Bioinformatics 30, 293–301 (2014).

    Article  Google Scholar 

  36. Zimin, A.V., Smith, D.R., Sutton, G. & Yorke, J.A. Bioinformatics 24, 42–45 (2008).

    Article  CAS  Google Scholar 

  37. Vasilinetc, I., Prjibelski, A.D., Gurevich, A., Korobeynikov, A. & Pevzner, P.A. Bioinformatics 30, 293–301 (2015).

    Google Scholar 

  38. Antipov, D., Korobeynikov, A., McLean, J.S. & Pevzner, P.A. Bioinformatics doi:10.1093/bioinformatics/btv688 (2015).

  39. Ashton, P.M. et al. Nat. Biotechnol. 33, 296–300 (2015).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We are indebted to V. Montel, J. Stuzka and O. Schulz-Trieglaff at Illumina for many helpful discussions, sample preparation and TSLR data. We thank J. Banfiled and I. Sharon for providing their metagenomics TSLR data. This study was supported by the Russian Science Foundation (grant 14-50-00069 to A.B. and P.A.P.).

Author information

Authors and Affiliations

Authors

Contributions

A.B. developed and implemented the truSPAdes algorithm and performed benchmarking. A.B. and P.A.P. conceived the study, designed the computational experiments and wrote the manuscript.

Corresponding author

Correspondence to Anton Bankevich.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 k-mer coverage histograms.

Histograms of k-mer coverage (k = 55) for the E. coli standard isolate dataset from Bankevich et al.16 (a), the E. Coli MDA-amplified single cell dataset from Bankevich et al.16 (b), one of the barcodes of TSLR data (c) and a single 10 Kb long fragment of a barcode (d). Conventional assemblers select a coverage threshold to separate correct from erroneous k-mers. The histogram for data from the standard isolate features a smaller peak on the left (formed by largely erroneous k-mers with low coverage) and a larger peak on the right (formed by largely correct k-mers with high coverage). Thus, one can choose a proper threshold that separates correct from false k-mers38. However, for both MDA and TSLR, there is no threshold separating correct and false k-mers.

Supplementary Figure 2 Barcode span.

Construction of the barcode span: red regions have rather uniform read coverage and length close to 10 Kb. Black reads do not belong to the selected barcode spans represent read mapping artifacts and are ignored.

Supplementary Figure 3 Typical misassemblies.

Two common types of misassemblies: false (a,b,c) and chimeric (d,e,f) connections. (a) Two unrelated instances of the blue repeat are located in red (left) and yellow (right) genome fragments. These instances are flanked by short dotted segments (b). These short dotted segments correspond to short dotted edges (tips) in the de Bruijn graph. (c) Tip trimming results in a single (misassembled) edge in the de Bruijn graph representing a false connection. (d) A region of the genome formed by consecutive yellow and green segments (e) Since the yellow fragment has been erroneously amplified from the opposite strand, the reverse complementary copy is added to the end of this region resulting in a chimeric fragment (f). In the de Bruijn graph, the corresponding yellow solid edge has two outgoing edges: one for each connection between the yellow and green parts of the genome fragment. One of these connections represents an erroneous chimeric connection (transition from solid yellow to dashed green). We note that our explanation for the experimental cause of the chimeric connection is just a hypothesis that accurately reflects the computational artifacts we observe.

Supplementary Figure 4 Iterative assembly.

A fragment of a genome along with four reads (1st panel) and de Bruijn graphs of these reads constructed for k = 3 (2nd panel), k = 4 (3rd panel), and k = 5 (4th panel). The parameter k = 4 represents the “sweet spot” in the iterative assembly since the de Bruijn graph for k = 3 is over-tangled while the de Bruijn graph for k = 5 is over-fragmented.

Supplementary Figure 5 TruSPAdes pseudocode.

Outline of truSPAdes pipeline. TruSPAdes specific modifications are highlighted in blue.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–5 (PDF 598 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bankevich, A., Pevzner, P. TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat Methods 13, 248–250 (2016). https://doi.org/10.1038/nmeth.3737

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3737

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics