Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing

Abstract

Bacterial species in microbial communities are often represented by mixtures of strains, distinguished by small variations in their genomes. Short-read approaches can be used to detect small-scale variation between strains but fail to phase these variants into contiguous haplotypes. Long-read metagenome assemblers can generate contiguous bacterial chromosomes but often suppress strain-level variation in favor of species-level consensus. Here we present Strainy, an algorithm for strain-level metagenome assembly and phasing from Nanopore and PacBio reads. Strainy takes a de novo metagenomic assembly as input and identifies strain variants, which are then phased and assembled into contiguous haplotypes. Using simulated and mock Nanopore and PacBio metagenome data, we show that Strainy assembles accurate and complete strain haplotypes, outperforming current Nanopore-based methods and comparable with PacBio-based algorithms in completeness and accuracy. We then use Strainy to assemble strain haplotypes of a complex environmental metagenome, revealing distinct strain distribution and mutational patterns in bacterial species.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the Strainy workflow and the strain-clustering algorithm.
Fig. 2: Benchmarking using mock microbial communities.
Fig. 3: Benchmarking using simulated Nanopore and PacBio sequencing data.
Fig. 4: Assembly and phasing of an activated sludge metagenome with variable phasing stringency and variant callers.
Fig. 5: Analysis of phasing of individual MAGs assembled from the activated sludge metagenome.
Fig. 6: Intra-species small and structural variation provide evolutionary insights.

Similar content being viewed by others

Data availability

Nanopore mock community sequencing, PRJNA804004; PacBio HiFi mock community sequencing, https://github.com/PacificBiosciences/pb-metagenomics-tools/blob/master/docs/PacBio-Data.md; activated sludge sequencing, PRJEB48021; NDARO, https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/; NDARO catalog, https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/. Real and mock assemblies are available at https://doi.org/10.5281/zenodo.11149518 (ref. 62), and simulated reads and results are available at https://doi.org/10.5281/zenodo.11142288 (ref. 63).

Code availability

Strainy is freely available at https://github.com/katerinakazantseva/strainy.

References

  1. Zhao, S. et al. Adaptive evolution within gut microbiomes of healthy people. Cell Host Microbe 25, 656–667 (2019).

    CAS  Google Scholar 

  2. Kaper, J. B., Nataro, J. P. & Mobley, H. L. Pathogenic Escherichia coli. Nat. Rev. Microbiol. 2, 123–140 (2004).

    Article  CAS  PubMed  Google Scholar 

  3. Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).

    Article  PubMed  Google Scholar 

  4. Good, B. H., McDonald, M. J., Barrick, J. E., Lenski, R. E. & Desai, M. M. The dynamics of molecular evolution over 60,000 generations. Nature 551, 45–50 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Yan, Y., Nguyen, L. H., Franzosa, E. A. & Huttenhower, C. Strain-level epidemiology of microbial communities and the human microbiome. Genome Med. 12, 71 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Zimmermann, M., Zimmermann-Kogadeeva, M., Wegmann, R. & Goodman, A. L. Mapping human microbiome drug metabolism by gut bacteria and their genes. Nature 570, 462–467 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Albanese, D. & Donati, C. Strain profiling and epidemiology of bacterial species from metagenomic sequencing. Nat. Commun. 8, 2260 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Quince, C. et al. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol. 22, 214 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Ghurye, J. et al. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol. 20, 174 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).

    Article  CAS  PubMed  Google Scholar 

  12. Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Bertrand, D. et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat. Biotechnol. 37, 937–944 (2019).

    Article  CAS  PubMed  Google Scholar 

  14. Kim, C. Y., Ma, J. & Lee, I. HiFi metagenomic sequencing enables assembly of accurate and complete genomes from human gut microbiota. Nat. Commun. 13, 6367 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Dai, D. et al. Long-read metagenomic sequencing reveals shifts in associations of antibiotic resistance genes with mobile genetic elements from sewage to activated sludge. Microbiome 10, 20 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Beaulaurier, J. et al. Assembly-free single-molecule sequencing recovers complete virus genomes from natural microbial communities. Genome Res. 30, 437–446 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Van Goethem, M. W. et al. Long-read metagenomics of soil communities reveals phylum-specific secondary metabolite dynamics. Commun. Biol. 4, 1302 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    Article  CAS  PubMed  Google Scholar 

  20. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Bickhart, D. M. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat. Biotechnol. 40, 711–719 (2022).

    Article  CAS  PubMed  Google Scholar 

  23. Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Curry, K. D. et al. Reference-free structural variant detection in microbiomes via long-read coassembly graphs. Bioinformatics 40, i58–i67 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Feng, X., Cheng, H., Portik, D. & Li, H. Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nat. Methods 19, 671–674 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Benoit, G. et al. High-quality metagenome assembly from long accurate reads with metaMDBG. Nat. Biotechnol. 42, 1378–1383 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Fedarko, M. W., Kolmogorov, M. & Pevzner, P. A. Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Res. 32, 2119–2133 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Kolmogorov, M. et al. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Chen, L. et al. Short- and long-read metagenomics expand individualized structural variations in gut microbiomes. Nat. Commun. 13, 3175 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Jin, H. et al. A high-quality genome compendium of the human gut microbiome of Inner Mongolians. Nat. Microbiol. 8, 150–161 (2023).

    Article  CAS  PubMed  Google Scholar 

  34. Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).

  35. Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Schrinner, S. D. et al. Haplotype threading: accurate polyploid phasing from long reads. Genome Biol. 21, 252 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Garg, S. et al. A haplotype-aware de novo assembly of related individuals using pedigree sequence graph. Bioinformatics 36, 2385–2392 (2020).

    Article  CAS  PubMed  Google Scholar 

  40. Faure, R., Guiglielmoni, N. & Flot, J.-F. GraphUnzip: unzipping assembly graphs with long reads and Hi-C. Preprint at bioRxiv https://doi.org/10.1101/2021.01.29.428779 (2021).

  41. Nicholls, S. M. et al. On the complexity of haplotyping a microbial community. Bioinformatics 37, 1360–1366 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Vicedomini, R., Quince, C., Darling, A. E. & Chikhi, R. Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nat. Commun. 12, 4485 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Feng, Z., Clemente, J. C., Wong, B. & Schadt, E. E. Detecting and phasing minor single-nucleotide variants from long-read sequencing data. Nat. Commun. 12, 3032 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Knyazev, S., Hughes, L., Skums, P. & Zelikovsky, A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinform. 22, 96–108 (2021).

    Article  CAS  Google Scholar 

  45. Jablonski, K. P. & Beerenwinkel, N. in Virus Bioinformatics 51–64 (Chapman and Hall/CRC, 2021).

  46. Warwick-Dugdale, J. et al. Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands. PeerJ 7, e6800 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Zhou, Z., Luhmann, N., Alikhan, N.-F., Quince, C. & Achtman, M. Accurate reconstruction of microbial strains from metagenomic sequencing using representative reference genomes. In Research in Computational Molecular Biology 225–240 (Springer, 2018).

  48. Liu, L., Yang, Y., Deng, Y. & Zhang, T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. Microbiome 10, 209 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Luo, X., Kang, X. & Schönhuth, A. VeChat: correcting errors in long reads using variation graphs. Nat. Commun. 13, 6657 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Mikheenko, A., Saveliev, V. & Gurevich, A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32, 1088–1090 (2015).

    Article  PubMed  Google Scholar 

  51. Shaw, J. & Yu, Y. W. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat. Methods 20, 1661–1665 (2023).

  52. Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Sereika, M. et al. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat. Methods 19, 823–826 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Zheng, Z. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat. Comput. Sci. 2, 797–803 (2022).

    Google Scholar 

  55. Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Jee, J. et al. Rates and mechanisms of bacterial mutagenesis from maximum-depth sequencing. Nature 534, 693–696 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Huang, H. et al. Tigecycline resistance-associated mutations in the MepA efflux pump in Staphylococcus aureus. Microbiol. Spectr. 11, e0063423 (2023).

    Article  PubMed  Google Scholar 

  59. Jagdmann, J., Andersson, D. I. & Nicoloff, H. Low levels of tetracyclines select for a mutation that prevents the evolution of high-level resistance to tigecycline. PLoS Biol. 20, e3001808 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Raghavan, U. N., Albert, R. & Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 76, 036106 (2007).

    Article  Google Scholar 

  62. Kazantseva, E., Donmez, A. & Kolmogorov, M. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing—real and mock datasets. Zenodo https://doi.org/10.5281/zenodo.11149518 (2024).

  63. Kazantseva, E., Donmez, A. & Kolmogorov, M. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing—simulated datasets. Zenodo https://doi.org/10.5281/zenodo.11142288 (2024).

Download references

Acknowledgements

M.K. and A.D. were supported by the Intramural Research Program of the Center for Cancer Research, National Cancer Institute, National Institutes of Health. This work used the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

E.K., A.D. and M.K. developed Strainy and performed the benchmarking. M.F. performed the functional analysis of strain variation; M.P. and M.K. supervised the work.

Corresponding authors

Correspondence to Mihai Pop or Mikhail Kolmogorov.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Mads Albertsen, Jue Ruan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Informative and non-informative SNPs.

(A) An example of graph multi-phasing challenge with a set of four closely-related strains and their corresponding phylogenetic tree. SNP positions are shown in different shapes, hollow/solid indicate different genotypes. SNP1 is not informative for the highlighted subset of strains (S2, S3, S4), but SNP2 and SNP3 positions are informative for this subset (B) An example of sequence graph phasing with three strains. Strain-specific variants are shown with yellow rectangles. After phasing and graph simplification, most of the graph nodes are strain-resolved, but some nodes remain strain-collapsed.

Extended Data Fig. 2 Completeness per strain of the simulated datasets assemblies.

Assembled strain genome fraction computed using metaQUAST for every bacterial strain in the simulated Nanopore (A) and PacBio (B) datasets. Heatmaps show the number of misassemblies for each strain. Strains are sorted in the decreasing mean value among all tools.

Extended Data Fig. 3 Additional evaluations of simulated assemblies.

Analysis of assembly size, strain completeness and duplication rates of the simulated dataset assemblies for Nanopore (A) and PacBio (B) datasets.

Extended Data Fig. 4 Correlation between strain similarities and Strainy switch errors.

Pairwise average nucleotide identity (ANI) of strain genomes in the simulated datasets and the number of inter-species misassemblies computed for E. coli (A), P. Aeruginosa (B), L. Monocytogenes (C), and S. Aureus (D).

Extended Data Fig. 5 Benchmarking using simulated sequencing data, stratified by number of strains.

Plots show distribution of strain completeness, strain switch errors, intra-strain errors and NGA50 aggregated by the number of strains in the dataset for Nanopore (A) and PacBio HiFi data (B). Boxes show the quartiles of 8 data points.

Extended Data Fig. 6 Benchmarking using simulated sequencing data, stratified by coverage mode.

Plots show distribution of strain completeness, strain switch errors, intra-strain errors and NGA50 aggregated by the coverage mode for Nanopore (A) and PacBio HiFi data (B). Boxes show the quartiles of 16 data points.

Extended Data Fig. 7 Read-based evaluation of Strainy phasing.

For every simulated read, its strain of origin is known. For each phasing cluster, we define a “major” strain as the most frequent strain in the cluster. Cluster quality is defined as the proportion of reads coming from the major strain.

Extended Data Fig. 8 Additional simulations to establish Strainy phasing limits.

(A) Completeness and assembly size of individual E. coli strains for different strains for different levels of strain heterozygosity. Each data point shows the mean of 14 values and the error bands show a 95% confidence interval. (B, C) Completeness of individual E. coli strain pairs for a given read coverage depth to establish minimum depth (B) and maximum strain abundance ratio (C) required for Strainy assembly.

Extended Data Fig. 9 Additional examples of IGV visualization of strain structural variants.

(A) A 63 bp indel in a region of MAG 122. (B) A 50 bp indel in a region of MAG 87. (C) A 247 bp indel in a region of MAG 87.

Extended Data Fig. 10 Additional analysis of strain mutations.

(A, B) Substitution mutation signatures of the activated sludge dataset derived from Oxford Nanopore Technologies (ONT) and PacBio HiFi strain assemblies. (C) dN/dS substitution ratio for AMR-related genes inside individual MAGs assembled from the activated sludge dataset.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kazantseva, E., Donmez, A., Frolova, M. et al. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02424-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41592-024-02424-1

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing