Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Time- and memory-efficient genome assembly with Raven

A preprint version of the article is available at bioRxiv.

Abstract

Whole genome sequencing technologies are unable to invariably read DNA molecules intact, a shortcoming that assemblers try to resolve by stitching the obtained fragments back together. Here, we present methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven. Raven maintains similar performance for various genomes and has accuracy on par with other assemblers that support third-generation sequencing data. It is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Bacterial assembly graph drawn with the force-directed placement algorithm.

Similar content being viewed by others

Data availability

The ONT dataset for A. thaliana is available under accession no. ERR2173373, for D. melanogaster under SRR6702603, for H. sapiens NA12878 at https://github.com/nanopore-wgs-consortium/NA12878 (release 6), for H. sapiens CHM13 at https://github.com/marbl/CHM13 (release 6), for H. sapiens HG002 at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.4.5/ and for H. sapiens HG00733 at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG00733/nanopore/. The PacBio CLR dataset for A. thaliana is available at https://downloads.pacbcloud.com/public/SequelData/ArabidopsisDemoData/, for D. melanogaster under accession no. SRR5439404, for H. sapiens CHM13 at https://github.com/marbl/CHM13 (extracted from draft v1.0 bam), for H. sapiens HG002 at https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/PacBio_fasta/ and for H. sapiens HG0073 under SRR7615963. The PacBio HiFi dataset for H. sapiens CHM13 is available from accession nos. SRR11292120–SRR11292123, for H. sapiens HG002 under SRR10382244, SRR10382245, SRR10382248 and SRR10382249, and for H. sapiens HG00733 under ERX3831682. Illumina reads for yak evaluation are available from accession nos. SRX1049768–SRX1049782 for H. sapiens NA12878, from https://github.com/marbl/CHM13 (extracted from draft v1.0 bam) for H. sapiens CHM13, from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/ (extracted from 60x bam) for H. sapiens HG002 and under accession no. SRR7782677 for H. sapiens HG00733. ONT plant datasets are available under accession nos. ERR2564160–ERR2564170 for B. rapa, from ERR2564373–ERR2564382 for B. oleracea, from ERR2571286–ERR2571303 for M. schizocarpa, from ERR3476478–ERR3476482 for O. sativa basmati 334 and from ERR3476463–ERR3476466 for O. sativa dom sufid. All generated assemblies in this research are available at Zenodo26.

Code availability

The Raven source code is available under an MIT license on GitHub at https://github.com/lbcb-sci/raven. Source code for version 1.3.0 used in this manuscript is also available at Zenodo27.

References

  1. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    Article  Google Scholar 

  2. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    Article  Google Scholar 

  3. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article  Google Scholar 

  4. Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    Article  Google Scholar 

  5. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    Article  Google Scholar 

  6. Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).

    Article  Google Scholar 

  7. Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & Tse, D. N. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).

    Article  Google Scholar 

  8. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    Article  Google Scholar 

  9. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  Google Scholar 

  10. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  Google Scholar 

  11. Broder, A. Z. On the resemblance and containment of documents. In Proc. Compression and Complexity of SEQUENCES 1997 (cat. no. 97TB100171) (eds. Carpentieri, B. et al.) 21–29 (IEEE, 1997); https://doi.org/10.1109/SEQUEN.1997.666900

  12. Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. M. A fast approximate algorithm for mapping long reads to large reference databases. In Research in Computational Molecular Biology (ed. Sahinalp, S. C.) 66–81 (Springer, 2017).

  13. Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).

  14. Fruchterman, T. M. J. & Reingold, E. M. Graph drawing by force-directed placement. Softw. Pract. Exp. 21, 1129–1164 (1991).

    Article  Google Scholar 

  15. Barnes, J. & Hut, P. A hierarchical O(NlogN) force-calculation algorithm. Nature 324, 446–449 (1986).

    Article  Google Scholar 

  16. Wick, R. R. & Holt, K. E. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res. 8, 2138 (2020).

    Article  Google Scholar 

  17. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article  Google Scholar 

  18. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  Google Scholar 

  19. Belser, C. et al. Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat. Plants 4, 879–887 (2018).

    Article  Google Scholar 

  20. Choi, J. Y. et al. Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice. Genome Biol. 21, 21 (2020).

    Article  Google Scholar 

  21. Vaser, R. & Šikić, M. Yet another de novo genome assembler. In Proc. 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA) (eds. Lončarić, S. et al.) 147–151 (IEEE, 2019); https://doi.org/10.1109/ISPA.2019.8868909

  22. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

    Article  Google Scholar 

  23. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    Article  Google Scholar 

  24. Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).

    Article  Google Scholar 

  25. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  Google Scholar 

  26. Vaser, R. & Sikic, M. 2021. Assemblies generated in the manuscript ‘Time and memory efficient genome assembly with Raven’. Zenodo https://doi.org/10.5281/zenodo.4443062

  27. Vaser, R. & Sikic, M. 2021. Raven source code used in the manuscript ‘Time and memory efficient genome assembly with Raven’. Zenodo https://doi.org/10.5281/zenodo.4672196

Download references

Acknowledgements

This work has been supported in part by the Croatian Science Foundation under the project ‘Single genome and metagenome assembly’ (IP-2018-01-5886, to M.Š.), the European Regional Development Fund under grant no. KK.01.1.1.01.0009 (DATACROSS, to M.Š.) and the A*STAR Computational Resource Centre through the use of their high-performance computing facilities. R.V. and M.Š. have been partially supported by funding from A*STAR, Singapore. We acknowledge Intel Corporation for allowing us to test with the Intel Optane persistent memory server and providing us with high-quality technical support. Finally, we thank G. Žužić from Carnegie Mellon University for valuable discussions about graph drawings.

Author information

Authors and Affiliations

Authors

Contributions

M.Š. devised the project. R.V. designed and implemented Raven, and benchmarked it with other assemblers. Both authors drafted and revised the manuscript.

Corresponding author

Correspondence to Mile Šikić.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–4 and Tables 1–5.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vaser, R., Šikić, M. Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332–336 (2021). https://doi.org/10.1038/s43588-021-00073-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-021-00073-4

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics