Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities

Abstract

Microbial communities might include distinct lineages of closely related organisms that complicate metagenomic assembly and prevent the generation of complete metagenome-assembled genomes (MAGs). Here we show that deep sequencing using long (HiFi) reads combined with Hi-C binning can address this challenge even for complex microbial communities. Using existing methods, we sequenced the sheep fecal metagenome and identified 428 MAGs with more than 90% completeness, including 44 MAGs in single circular contigs. To resolve closely related strains (lineages), we developed MAGPhase, which separates lineages of related organisms by discriminating variant haplotypes across hundreds of kilobases of genomic sequence. MAGPhase identified 220 lineage-resolved MAGs in our dataset. The ability to resolve closely related microbes in complex microbial communities improves the identification of biosynthetic gene clusters and the precision of assigning mobile genetic elements to host genomes. We identified 1,400 complete and 350 partial biosynthetic gene clusters, most of which are novel, as well as 424 (298) potential host–viral (host–plasmid) associations using Hi-C data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: HiFi complete MAGs assembled low relative abundant species in the sample.
Fig. 2: Phased SNP haplotype detection in metagenomes with MAGPhase.
Fig. 3: MAG representation and assembly at different depths of coverage.
Fig. 4: HiFi reads improve associations of mobile genetic elements with candidate host species.

Similar content being viewed by others

Data availability

The HiFi sheep dataset, Hi-C reads and WGS short reads are available on National Center of Biotechnology Information BioProject PRJNA595610 at accession IDs SRX7628648, SRX10704191 and SRX7649993, respectively. Whole-metagenome assemblies and MAG bins for the pCLR and HiFi datasets are available at https://doi.org/10.5281/zenodo.4729049. The ‘kaiju_db_nr_euk_2021-02-24’ database was used for Kaiju classification (https://kaiju.binf.ku.dk/server). The ‘2017-07’ version of the UniProt database was used for BlobTools classification (https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2017_07/).

Code availability

The MAGPhase script and codebase are part of the https://github.com/Magdoll/cDNA_Cupcake GitHub repository. Scripts to replicate the analysis of the manuscript and to implement the MAGPhase workflow are located at this centralized repository: https://github.com/njdbickhart/SheepHiFiManuscript (ref. 61). A listing of all analysis software packages used in this study can be found in Supplementary Table 10.

References

  1. Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Chen, L.-X., Anantharaman, K., Shaiber, A., Eren, A. M. & Banfield, J. F. Accurate and complete genomes from metagenomes. Genome Res. 30, 315–333 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Singleton, C. M. et al. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nat. Commun. 12, 2009 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

    Article  CAS  PubMed  Google Scholar 

  6. Bickhart, D. M. et al. Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation. Genome Biol. 20, 153 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  7. Moss, E. L., Maghini, D. G. & Bhatt, A. S. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nat. Biotechnol. 38, 701–707 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Zhang, L. et al. A comprehensive investigation of metagenome assembly by linked-read sequencing. Microbiome 8, 156 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).

    Article  CAS  PubMed  Google Scholar 

  11. Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).

    Article  CAS  PubMed  Google Scholar 

  12. Latorre-Pérez, A., Villalba-Bermell, P., Pascual, J. & Vilanova, C. Assembly methods for nanopore-based metagenomic sequencing: a comparative study. Sci. Rep. 10, 13588 (2020).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).

  14. Quince, C. et al. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol. 22, 214 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015).

  16. Burton, J. N., Liachko, I., Dunham, M. J. & Shendure, J. Species-level deconvolution of metagenome assemblies with Hi-C–based contact probability maps. G3 (Bethesda) 4, 1339–1346 (2014).

    Article  Google Scholar 

  17. Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).

    Article  CAS  PubMed  Google Scholar 

  18. Lapierre, P. & Gogarten, J. P. Estimating the size of the bacterial pan-genome. Trends Genet. 25, 107–110 (2009).

    Article  CAS  PubMed  Google Scholar 

  19. Vicedomini, R., Quince, C., Darling, A. E. & Chikhi, R. Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nat. Commun. 12, 4485 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. O’Brien, J. D. et al. A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data. Genetics 197, 925–937 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Quince, C. et al. DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biol. 18, 181 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. Nicholls, S. M. et al. On the complexity of haplotyping a microbial community. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa977 (2020).

  23. Vicedomini, R., Quince, C., Darling, A. E. & Chikhi, R. Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nat. Commun. 12, 4485 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

  26. Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).

    Article  CAS  PubMed  Google Scholar 

  27. Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).

    Article  CAS  PubMed  Google Scholar 

  28. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Kolmogorov, M. Supporting data for the manuscript ‘Generation of lineage-resolved complete metagenome-assembled genomes in complex microbial communities’. https://doi.org/10.5281/zenodo.5138306 (2021).

  31. Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).

    CAS  Google Scholar 

  32. Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  33. Wang, B. et al. Variant phasing and haplotypic expression from long-read sequencing in maize. Commun. Biol. 3, 1–11 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Tseng, E. cDNA_cupcake v24.0.0. https://github.com/Magdoll/cDNA_Cupcake

  35. Nei, M. & Rooney, A. P. Concerted and birth-and-death evolution of multigene families. Annu. Rev. Genet. 39, 121–152 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Meleshko, D. et al. BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs. Genome Res. 29, 1352–1362 (2019).

  37. Blin, K. et al. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 47, W81–W87 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Pellow, D. et al. SCAPP: an algorithm for improved plasmid assembly in metagenomes. Microbiome 9, 144 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  39. He, C. et al. Genome-resolved metagenomics reveals site-specific diversity of episymbiotic CPR bacteria and DPANN archaea in groundwater ecosystems. Nat. Microbiol. 6, 354–365 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Guo, C.-J. et al. Discovery of reactive microbiota-derived metabolites that inhibit host proteases. Cell 168, 517–526 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Press, M. O. et al. Hi-C deconvolution of a human gut microbiome yields high-quality draft genomes and reveals plasmid-genome interactions. Preprint at https://www.biorxiv.org/content/10.1101/198713v1 (2017).

  43. Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).

  45. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  46. Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. DeMaere, M. Z. & Darling, A, E.bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol. 20, 46 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Laetsch, D. R. & Blaxter, M. L. BlobTools: interrogation of genome assemblies. F1000Research 6, 1287 (2017).

    Article  Google Scholar 

  49. Chan, P. P. & Lowe, T. M. tRNAscan-SE: searching for tRNA genes in genomic sequences. Methods Mol. Biol. 1962, 1–14 (2019).

    PubMed  PubMed Central  Google Scholar 

  50. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

  51. Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at https://www.biorxiv.org/content/10.1101/705616v1 (2019).

  52. Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2020).

  53. Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a web browser. BMC Bioinformatics 12, 385 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  55. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300 (1995).

    Google Scholar 

  56. Robinson, J. T. et al. Integrative Genomics Viewer. Nat. Biotechnol. 29, 24–26 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Chen, Z., Erickson, D. L. & Meng, J. Polishing the Oxford Nanopore long-read assemblies of bacterial pathogens with Illumina short reads to improve genomic analyses. Genomics 113, 1366–1377 (2021).

    Article  CAS  PubMed  Google Scholar 

  58. Stewart, R. D. et al. Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery. Nat. Biotechnol. 37, 953–961 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  60. Kautsar, S. A. et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 48, D454–D458 (2020).

    PubMed  Google Scholar 

  61. Bickhart, D. M. SheepHiFiManuscript. https://doi.org/10.5281/zenodo.5120910 (2021).

Download references

Acknowledgements

We thank K. McClure, K. Kuhn, B. Lee, J. Carnahan and W. Thompson for technical support. D.M.B. was supported by appropriated USDA CRIS Project 5090-31000-026-00-D. T.P.L.S. and S.B.S. were supported by appropriated USDA CRIS Project 3040-31000-100-00D. I.L., S.T.S. and G.U. were supported, in part, by NIH grants R44AI150008 and R44AI162570 to Phase Genomics. I.M. was supported by grants from the European Research Council (no. 640384) and from the Israel Science Foundation (no. 1947/19). M.K. and P.A.P. were supported by NSF/MCB-BSF grant 1715911. V.P.A. was supported by the US Defense Advanced Research Projects Agency’s Living Foundries program award HR0011-15-C-0084. A.K. and I.T. were supported by St. Petersburg State University (grant ID PURE 73023672). K.P. was supported by appropriated USDA CRIS Project 5090-21000-071-000-D. We thank P. J. Weimer for helpful comments and suggestions on the manuscript. The USDA does not endorse any products or services. Mentioning of trade names is for information purposes only. The USDA is an equal opportunity employer.

Author information

Authors and Affiliations

Authors

Contributions

T.P.L.S. and D.M.B. conceived the project, with extensive modifications introduced on the advice of I.L. and P.A.P. S.B.S and T.P.L.S. were responsible for collecting the sample and generating the sequence data. D.B. and M.K. produced the assemblies and conducted a large proportion of reported analysis. G.U. and S.T.S. conducted analysis related to Hi-C linkage data. V.P.A. and M.H.M. identified biosynthetic gene clusters in the dataset. D.M.B., A.Z. and I.M. identified mobile genetic elements in the sample. E.T. developed the MAGPhase algorithm, with input from D.M.B. D.M.B., T.P.L.S., M.K. and P.A.P. wrote the manuscript. All authors read and contributed to the final manuscript.

Corresponding authors

Correspondence to Pavel A. Pevzner or Timothy P. L. Smith.

Ethics declarations

Competing interests

The authors declare the following competing interests: M.H.M. is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. E.T. and D.M.P. are employees of Pacific Biosciences. G.U. is an employee of Amazon. S.T.S. and I.L. are co-founders and the CTO and CEO, respectively, of Phase Genomics. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks Mads Albertsen, C. Titus Brown and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Contig-level comparison of pCLR and HiFi assemblies.

a. Strategy for generating the read sets for the three pCLR and the HiFi assemblies. b. Comparison of contig length distributions in the four assemblies demonstrating a tendency for pCLR assembly to create longer contigs. c. Comparison of the total length of each assembly after separation of contigs into predicted Superkingdoms demonstrating an increased length from HiFi assembly among assigned Superkingdom and reduced length in unassigned bin. d. Comparison of the completeness of pCLR and HiFi assemblies based on the presence of >90% expected single-copy genes with <5% redundancy.

Extended Data Fig. 2 Assembled MAG taxonomy.

A circular dendrogram showing the presence (blue) and absence (black) of GTDB-TK assigned taxonomy to Assembly bins for the HiFi (outermost ring) and CLR (innermost rings, descending) assemblies. Branch nodes were consolidated to Genus-level affiliations when possible. Branch colors were assigned based on Phylum-level classification, with the exception of the Firmicutes, which was sub-divided into separate classes due to its increased diversity relative to other Phyla.

Extended Data Fig. 3 Read depth across orthologous, collapsed pCLR bins.

Each bin from separate, replicate pCLR assemblies corresponds to all three HiFi bins displayed in Supplementary Figure 6. Read depth that can be attributed to the reference sequence is labeled in blue, whereas phased alternative haplotypes identified via MAGPhase are labelled in alternating colors (see legend). Contig ends are denoted by vertical black bars and the x-axis represents the total length of the entire MAG with contigs placed randomly from end-to-end.

Extended Data Fig. 4 Read depth across three closely related HiFi Complete MAGs.

Read depth that can be attributed to the reference sequence is labeled in blue, whereas phased alternative haplotypes identified via MAGPhase are labelled in alternating colors (see legend). Contig ends are denoted by vertical black bars and the x-axis represents the total length of the entire MAG with contigs placed randomly from end-to-end.

Extended Data Fig. 5 Biosynthetic Gene Cluster Analysis.

The HiFi assembly revealed approximately 25% more complete Biosynthetic Gene Clusters (BGCs) than the average pCLR assembly (a). This increase was manifested in all identified BGC classes (colors in legend) and was not exclusive to one particular class. As found in other metagenome assembly datasets, the majority of identified BGCs were novel in all assemblies (b), but the HiFi assembly had a higher proportion of novel BGCs than the other assemblies. Additionally, the HiFi assembly contained more partial BGCs (c) of any assembly.

Extended Data Fig. 6 CLR1 viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Extended Data Fig. 7 CLR2 Viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Extended Data Fig. 8 CLR3 Viral association network plot.

Viral contigs identified from Blobtools-assigned taxonomy estimates are represented as hexagonal nodes with black borders, whereas non-viral host contigs are represented as circular nodes with white borders. Edges represent associations identified for each connection, with colors representing the identification of partial HiFi read overlap (blue), Hi-C read links (green) or both types of data (red), respectively.

Supplementary information

Supplementary Information

Supplementary Figs. 1–11, Tables 3, 4 and 8–10 and Notes 1–3.

Reporting Summary.

Supplementary Tables

Supplementary Tables 1, 2 and 5–7.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bickhart, D.M., Kolmogorov, M., Tseng, E. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat Biotechnol 40, 711–719 (2022). https://doi.org/10.1038/s41587-021-01130-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-021-01130-z

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing