metaFlye: scalable long-read metagenome assembly using repeat graphs

Abstract

Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: metaFlye repeat annotation and examples of simple bubbles, superbubbles and roundabouts.
Fig. 2: Comparison of Canu, Flye, metaFlye, miniasm and wtdbg2 assemblies of the individual genomes in the SYNTH181 dataset.
Fig. 3: Per-species reference coverage and NGA50 statistics for the mock community datasets (HMP, ZymoEven GridION and ZymoLog GridION) computed using metaQUAST.
Fig. 4: Information about strains in the sheep microbiome revealed by metaFlye.

Data availability

Sequencing data for the sheep gut sample are available under the NCBI BioProject PRJNA595610. HMP mock dataset is available at: https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun. Zymo datasets are at: https://github.com/LomanLab/mockcommunity. Cow rumen dataset is at: NCBI SRA repository under BioProject PRJNA507739. Human stool samples are at: ENA project PRJEB29152. NCBI accession codes for the sequences used in the NRPS analysis are: AM229678.1, AB101202.1, FP929054.1 and FP929054.1. All assemblies that were evaluated in this study, as well as SYNTH64 and SYNTH181 datasets are available at: https://doi.org/10.5281/zenodo.3986210 (ref. 66).

Code availability

metaFlye is freely available as a part of the Flye package at: https://github.com/fenderglass/Flye. The pbclip tool for PacBio subread splitting is available from https://github.com/fenderglass/pbclip.

References

  1. 1.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature https://doi.org/10.1038/s41586-020-2547-7 (2020).

  3. 3.

    Tsai, Y. C. et al. Resolving the complexity of human skin metagenomes using single-molecule sequencing. MBio 7, e01948–15 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Driscoll, C. B., Otten, T. G., Brown, N. M. & Dreher, T. W. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture. Stand. Genom. Sci. 12, 9 (2017).

    Google Scholar 

  5. 5.

    Nicholls, S. M., Quick, J. C., Tang, S. & Loman, N. J. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. GigaScience 8, 1–9 (2019).

    CAS  Google Scholar 

  6. 6.

    Bertrand, D. et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat. Biotechnol. 37, 937–944 (2019).

    CAS  PubMed  Google Scholar 

  7. 7.

    Somerville, V. et al. Long read-based de novo assembly of low complex metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol. 19, 143 (2019).

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Moss, E. L., Maghini, D. G. & Bhatt, A. S. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nat. Biotechnol. 38, 701–707 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Stewart, R. D. et al. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nat. Biotechnol. 37, 953–961 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Arumugam, K. et al. Annotated bacterial chromosomes from frame-shift-corrected long read metagenomic data. Microbiome 7, 61 (2019).

    PubMed  PubMed Central  Google Scholar 

  11. 11.

    Hiraoka, S. et al. Metaepigenomic analysis reveals the unexplored diversity of DNA methylation in an environmental prokaryotic community. Nat. Commun. 10, 159 (2019).

    PubMed  PubMed Central  Google Scholar 

  12. 12.

    Bickhart, D. M. et al. Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation. Genome Biol. 20, 1–18 (2019).

    CAS  Google Scholar 

  13. 13.

    Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    CAS  PubMed  Google Scholar 

  17. 17.

    Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).

    CAS  PubMed  Google Scholar 

  18. 18.

    Li, D., Liu, C. M., Luo, R., Sadakane, K. & Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).

    CAS  PubMed  Google Scholar 

  19. 19.

    Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Ghurye, J., Treangen, T., Fedarko, M., Hervey, W. J. & Pop, M. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol. 20, 174 (2019).

    PubMed  PubMed Central  Google Scholar 

  22. 22.

    Goltsman, D. S. A. et al. Metagenomic analysis with strain-level resolution reveals fine-scale variation in the human pregnancy microbiome. Genome Res. 28, 1467–1480 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Guo, J. et al. Horizontal gene transfer in an acid mine drainage microbial community. BMC Genomics 16, 496 (2015).

    PubMed  PubMed Central  Google Scholar 

  24. 24.

    Eloe-Fadrosh, E. A. et al. Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs. Nat. Commun. 7, 10476 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Suzuki, Y. et al. Long-read metagenomic exploration of extrachromosomal mobile genetic elements in the human gut. Microbiome 7, 119 (2019).

    PubMed  PubMed Central  Google Scholar 

  26. 26.

    Stevenson, L. J., Owen, J. G. & Ackerley, D. F. Metagenome driven discovery of nonribosomal peptides. ACS Chem. Biol. 14, 2115–2126 (2019).

    CAS  PubMed  Google Scholar 

  27. 27.

    Nijkamp, J. F., Pop, M., Reinders, M. J. T. & de Ridder, D. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29, 2826–2834 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Onodera, T., Sadakane, K. & Shibuya, T. Detecting superbubbles in assembly graphs. In International Workshop on Algorithms in Bioinformatics, 338–348 (Springer, 2013).

  29. 29.

    Garg, S. et al. A haplotype-aware de novo assembly of related individuals using pedigree sequence graph. Bioinformatics 36, 2385–2392 (2020).

    PubMed  Google Scholar 

  30. 30.

    Sczyrba, A. et al. Critical assessment of metagenome interpretation - a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90 K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 1–8 (2018).

    Google Scholar 

  32. 32.

    Wick, R. Badread: simulation of error-prone long reads. J. Open Source Softw. 4, 1316 (2019).

    Google Scholar 

  33. 33.

    Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Antipov, D., Raiko, M., Lapidus, A. & Pevzner, P. A. Plasmid detection and assembly in genomic and metagenomic data sets. Genome Res. 29, 961–968 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Latorre-Pérez, Adriel, Villalba-Bermell, Pascual, Pascual, Javier & Vilanova, Cristina Assembly methods for nanopore-based metagenomic sequencing: a comparative study. Sci. Rep. 10, 1–14 (2020).

    Google Scholar 

  37. 37.

    Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 11, 119 (2010).

    Google Scholar 

  39. 39.

    Laetsch, D. R. & Blaxter, M. L. BlobTools: interrogation of genome assemblies. F1000Research 6, 1287 (2017).

    Google Scholar 

  40. 40.

    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

    CAS  Google Scholar 

  41. 41.

    UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2014).

    Google Scholar 

  42. 42.

    Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).

    PubMed  PubMed Central  Google Scholar 

  43. 43.

    Minkin, I. & Medvedev, P. Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. Preprint at bioRxiv https://doi.org/10.1101/548123 (2019).

  44. 44.

    Kersten, R. D. et al. A mass spectrometry-guided genome mining approach for natural product peptidogenomics. Nat. Chem. Biol. 7, 794–802 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Ling, L. L. et al. A new antibiotic kills pathogens without detectable resistance. Nature 517, 455–459 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Meleshko, D. et al. BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs. Genome Res. 29, 1352–1362 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47.

    Behsaz, B. et al. De novo peptide sequencing reveals many cyclopeptides in the human gut and other environments. Cell Syst. 10, 99–108 (2020).

    CAS  PubMed  Google Scholar 

  48. 48.

    Wilson, M. R. et al. The human gut bacterial genotoxin colibactin alkylates DNA. Science 363, eaar7785 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Mohimani, H. & Pevzner, P. A. Dereplication, sequencing and identification of peptidic natural products: from genome mining to peptidogenomics to spectral networks. Nat. Prod. Rep. 33, 73–86 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–D596 (2012).

    PubMed  PubMed Central  Google Scholar 

  51. 51.

    Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Hunter, J. D. Matplotlib A 2D graphics environment. Comput. Sci. Eng. 9, 90 (2007).

    Google Scholar 

  53. 53.

    Dolev, S., Ghanayim, M., Binun, B., Frenkel, S. & Sun, Y. S. Relationship of Jaccard and edit distance in malware clustering and online identification. In 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA), 1–5 (IEEE, 2017).

  54. 54.

    Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  55. 55.

    Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016).

    PubMed  PubMed Central  Google Scholar 

  56. 56.

    Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).

    CAS  PubMed  Google Scholar 

  57. 57.

    Li, X., Andersen, D. G., Kaminsky, M. & Freedman, M. J. Algorithmic improvements for fast concurrent cuckoo hashing. In Proceedings of the Ninth European Conference on Computer Systems, 27 (ACM, 2014).

  58. 58.

    Jiang, Z. et al. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 (2007).

    CAS  PubMed  Google Scholar 

  59. 59.

    Bankevich, A. & Pevzner, P. A. mosaicFlye: resolving long mosaic repeats using long error-prone reads. Preprint at bioRxiv, https://doi.org/10.1101/2020.01.15.908285 (2020).

  60. 60.

    Koren, S., Treangen, T. J. & Pop, M. Bambus 2: scaffolding metagenomes. Bioinformatics 27, 2964–2971 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. 62.

    Nurk, S. et al. Assembling genomes and mini-metagenomes from highly chimeric reads. J. Comp. Biol. 20, 714–737 (2013).

    CAS  Google Scholar 

  63. 63.

    Brankovic, L. et al. Linear-time superbubble identification algorithm for genome assembly. Theor. Comput. Sci. 609, 374–383 (2016).

    Google Scholar 

  64. 64.

    Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. 65.

    Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Computational Biol. 25, 649–663 (2018).

    CAS  Google Scholar 

  66. 66.

    Supporting data for the manuscript “metaFlye: scalable long-read metagenome assembly using repeat graphs” (version 3.0) (Dataset). Zenodo https://doi.org/10.5281/zenodo.3986210 (2020).

Download references

Acknowledgements

We are grateful to Denis Bertrand and Niranjan Nagarajan for sharing the metagenomic datasets before journal publication. M.K. and P.A.P. were supported by the NSF/MCB-BSF grant 1715911. B.B. was supported by the US National Institutes of Health grant 2-P41-GM103484. D.B. was funded by USDA CRIS project 5090-31000-026-00-D and K.K., S.S. and T.S. by project 3040-31000-100-00D. A.G. and M.R. were supported by the Russian Science Foundation (grant 19-16-00049). Computational resources were provided in part by the Research Park Computer Center at St. Petersburg State University.

Author information

Affiliations

Authors

Contributions

M.K., J.Y. and P.P. developed the metaFlye concept. M.K. implemented and maintained metaFlye. E.P. implemented the short plasmid analysis module. D.B., S.B.S., K.K. and T.S. performed sheep gut sequencing. M.K., D.B., A.G. and M.R. benchmarked metaFlye and analyzed results. A.G. and M.K. performed analysis of synthetic datasets. M.R. analyzed plasmid and virus content. B.B. performed analysis of biosynthetic gene clusters. M.K., D.B., B.B., A.G., M.R., T.S. and P.P. edited the manuscript. P.P. supervised the project. All authors read and approved the manuscript.

Corresponding author

Correspondence to Pavel A. Pevzner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Information about metaFlye, Flye, Canu, miniasm, and wtdbg2 assemblies of the individual genomes in the SYNTH64 dataset.

NGA50 (in megabases) and reference coverage (in percentages) reported for all genomes from the SYNTH64 dataset. Genomes are ordered in the increasing mean NGA50 across all assemblers. Challenging genomes that have closely related species or strains in the metagenome are marked with (!). Grey bars on the NGA50 plot represent the length of the longest chromosome in the reference sequence for each genome (a theoretical upper bound for NGA50). NGA50 is shown in logarithmic scale (not shown for values lower than 100 kb or if the reference coverage is below 50%). The full metaQUAST report for the SYNTH64 dataset is provided in Supplementary Table 1.

Extended Data Fig. 2 NGAx plots for the mock community datasets (HMP mock, ZymoEven GridION, ZymoLog GridION).

NGA(x) is the statistic computed for contigs that are broken at their misassembly breakpoints (if any). NGA(x) is the highest possible number L such that all broken contigs that are longer than L cover at least X% of the reference. Plots were generated by metaQUAST using all available references for each dataset. Flye failed to assemble the ZymoLog datasets due to poor k-mer indexing (Methods).

Extended Data Fig. 3 Base-pair accuracy analysis for assemblies of the mock community datasets (HMP, ZymoEven GridION, and ZymoLog GridION).

Heatmaps showing the number of mismatches and short indels per 100 kbp for each species reference, computed using metaQUAST. Blue and red colors correspond to the values higher and lower than the median, respectively. Statistics were not computed for genomes with no assembled sequence (“-” symbol). Flye failed to assemble the ZymoLog datasets due to poor k-mer indexing (Methods).

Extended Data Fig. 4 The ORF lengths distribution and the GC content distribution of metaFlye and Canu assemblies of the sheep microbiome.

The ORF length distribution suggests similar base-level accuracy for both assemblies.

Extended Data Fig. 5 Taxonomic assignments of sheep microbiome assemblies.

a, metaFlye contigs assignment at the phylum level visualized with BlobTools. b, Length distributions of metaFlye and Canu contigs within each assigned superkingdom.

Extended Data Fig. 6 Statistics of simple bubbles for the metaFlye assemblies human gut and cow rumen.

(Left) the human gut dataset with 615 bubbles, and (right) the cow rumen dataset with 1510 bubbles. Bubble counts exclude loops, and include roundabouts with two edges.

Extended Data Fig. 7 Analysis of sequence overlap between 19 human gut samples.

Multi-way sequence alignments were computed using SiebliaZ. (left) The proportions of unique and shared sequences in each sample. An assembled segment within a sample is called unique if it has no alignments against sequence from any other samples. Otherwise, the segment is shared. (right) The total amount of sequence for each multiplicity bin. A sequence fragment belongs to the multiplicity bin X if it is shared by exactly X samples.

Supplementary information

Supplementary Information

Supplementary Tables 4–10 and Notes 1–8

Reporting Summary

Supplementary Table 1

Detailed information about metaFlye, Flye, Canu, miniasm and wtdbg2 assemblies of the SYNTH64 dataset

Supplementary Table 2

Detailed information about metaFlye, Flye, Canu, miniasm and wtdbg2 assemblies of the SYNTH181 dataset

Supplementary Table 3

Detailed information about metaFlye, Flye, Canu, miniasm and wtdbg2 assemblies of all mock community datasets

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kolmogorov, M., Bickhart, D.M., Behsaz, B. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods (2020). https://doi.org/10.1038/s41592-020-00971-x

Download citation

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing