Abstract
Bacterial species in microbial communities are often represented by mixtures of strains, distinguished by small variations in their genomes. Short-read approaches can be used to detect small-scale variation between strains but fail to phase these variants into contiguous haplotypes. Long-read metagenome assemblers can generate contiguous bacterial chromosomes but often suppress strain-level variation in favor of species-level consensus. Here we present Strainy, an algorithm for strain-level metagenome assembly and phasing from Nanopore and PacBio reads. Strainy takes a de novo metagenomic assembly as input and identifies strain variants, which are then phased and assembled into contiguous haplotypes. Using simulated and mock Nanopore and PacBio metagenome data, we show that Strainy assembles accurate and complete strain haplotypes, outperforming current Nanopore-based methods and comparable with PacBio-based algorithms in completeness and accuracy. We then use Strainy to assemble strain haplotypes of a complex environmental metagenome, revealing distinct strain distribution and mutational patterns in bacterial species.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Nanopore mock community sequencing, PRJNA804004; PacBio HiFi mock community sequencing, https://github.com/PacificBiosciences/pb-metagenomics-tools/blob/master/docs/PacBio-Data.md; activated sludge sequencing, PRJEB48021; NDARO, https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/; NDARO catalog, https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/. Real and mock assemblies are available at https://doi.org/10.5281/zenodo.11149518 (ref. 62), and simulated reads and results are available at https://doi.org/10.5281/zenodo.11142288 (ref. 63).
Code availability
Strainy is freely available at https://github.com/katerinakazantseva/strainy.
References
Zhao, S. et al. Adaptive evolution within gut microbiomes of healthy people. Cell Host Microbe 25, 656–667 (2019).
Kaper, J. B., Nataro, J. P. & Mobley, H. L. Pathogenic Escherichia coli. Nat. Rev. Microbiol. 2, 123–140 (2004).
Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).
Good, B. H., McDonald, M. J., Barrick, J. E., Lenski, R. E. & Desai, M. M. The dynamics of molecular evolution over 60,000 generations. Nature 551, 45–50 (2017).
Yan, Y., Nguyen, L. H., Franzosa, E. A. & Huttenhower, C. Strain-level epidemiology of microbial communities and the human microbiome. Genome Med. 12, 71 (2020).
Zimmermann, M., Zimmermann-Kogadeeva, M., Wegmann, R. & Goodman, A. L. Mapping human microbiome drug metabolism by gut bacteria and their genes. Nature 570, 462–467 (2019).
Albanese, D. & Donati, C. Strain profiling and epidemiology of bacterial species from metagenomic sequencing. Nat. Commun. 8, 2260 (2017).
Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).
Quince, C. et al. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol. 22, 214 (2021).
Ghurye, J. et al. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol. 20, 174 (2019).
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Bertrand, D. et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat. Biotechnol. 37, 937–944 (2019).
Kim, C. Y., Ma, J. & Lee, I. HiFi metagenomic sequencing enables assembly of accurate and complete genomes from human gut microbiota. Nat. Commun. 13, 6367 (2022).
Dai, D. et al. Long-read metagenomic sequencing reveals shifts in associations of antibiotic resistance genes with mobile genetic elements from sewage to activated sludge. Microbiome 10, 20 (2022).
Beaulaurier, J. et al. Assembly-free single-molecule sequencing recovers complete virus genomes from natural microbial communities. Genome Res. 30, 437–446 (2020).
Van Goethem, M. W. et al. Long-read metagenomics of soil communities reveals phylum-specific secondary metabolite dynamics. Commun. Biol. 4, 1302 (2021).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).
Bickhart, D. M. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat. Biotechnol. 40, 711–719 (2022).
Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
Curry, K. D. et al. Reference-free structural variant detection in microbiomes via long-read coassembly graphs. Bioinformatics 40, i58–i67 (2024).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).
Feng, X., Cheng, H., Portik, D. & Li, H. Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nat. Methods 19, 671–674 (2022).
Benoit, G. et al. High-quality metagenome assembly from long accurate reads with metaMDBG. Nat. Biotechnol. 42, 1378–1383 (2024).
Fedarko, M. W., Kolmogorov, M. & Pevzner, P. A. Analyzing rare mutations in metagenomes assembled using long and accurate reads. Genome Res. 32, 2119–2133 (2022).
Kolmogorov, M. et al. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483–1492 (2023).
Chen, L. et al. Short- and long-read metagenomics expand individualized structural variations in gut microbiomes. Nat. Commun. 13, 3175 (2022).
Jin, H. et al. A high-quality genome compendium of the human gut microbiome of Inner Mongolians. Nat. Microbiol. 8, 150–161 (2023).
Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at bioRxiv https://doi.org/10.1101/085050 (2016).
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).
Schrinner, S. D. et al. Haplotype threading: accurate polyploid phasing from long reads. Genome Biol. 21, 252 (2020).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Garg, S. et al. A haplotype-aware de novo assembly of related individuals using pedigree sequence graph. Bioinformatics 36, 2385–2392 (2020).
Faure, R., Guiglielmoni, N. & Flot, J.-F. GraphUnzip: unzipping assembly graphs with long reads and Hi-C. Preprint at bioRxiv https://doi.org/10.1101/2021.01.29.428779 (2021).
Nicholls, S. M. et al. On the complexity of haplotyping a microbial community. Bioinformatics 37, 1360–1366 (2021).
Vicedomini, R., Quince, C., Darling, A. E. & Chikhi, R. Strainberry: automated strain separation in low-complexity metagenomes using long reads. Nat. Commun. 12, 4485 (2021).
Feng, Z., Clemente, J. C., Wong, B. & Schadt, E. E. Detecting and phasing minor single-nucleotide variants from long-read sequencing data. Nat. Commun. 12, 3032 (2021).
Knyazev, S., Hughes, L., Skums, P. & Zelikovsky, A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinform. 22, 96–108 (2021).
Jablonski, K. P. & Beerenwinkel, N. in Virus Bioinformatics 51–64 (Chapman and Hall/CRC, 2021).
Warwick-Dugdale, J. et al. Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands. PeerJ 7, e6800 (2019).
Zhou, Z., Luhmann, N., Alikhan, N.-F., Quince, C. & Achtman, M. Accurate reconstruction of microbial strains from metagenomic sequencing using representative reference genomes. In Research in Computational Molecular Biology 225–240 (Springer, 2018).
Liu, L., Yang, Y., Deng, Y. & Zhang, T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. Microbiome 10, 209 (2022).
Luo, X., Kang, X. & Schönhuth, A. VeChat: correcting errors in long reads using variation graphs. Nat. Commun. 13, 6657 (2022).
Mikheenko, A., Saveliev, V. & Gurevich, A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32, 1088–1090 (2015).
Shaw, J. & Yu, Y. W. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat. Methods 20, 1661–1665 (2023).
Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350–3352 (2015).
Sereika, M. et al. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat. Methods 19, 823–826 (2022).
Zheng, Z. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat. Comput. Sci. 2, 797–803 (2022).
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Jee, J. et al. Rates and mechanisms of bacterial mutagenesis from maximum-depth sequencing. Nature 534, 693–696 (2016).
Huang, H. et al. Tigecycline resistance-associated mutations in the MepA efflux pump in Staphylococcus aureus. Microbiol. Spectr. 11, e0063423 (2023).
Jagdmann, J., Andersson, D. I. & Nicoloff, H. Low levels of tetracyclines select for a mutation that prevents the evolution of high-level resistance to tigecycline. PLoS Biol. 20, e3001808 (2022).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Raghavan, U. N., Albert, R. & Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 76, 036106 (2007).
Kazantseva, E., Donmez, A. & Kolmogorov, M. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing—real and mock datasets. Zenodo https://doi.org/10.5281/zenodo.11149518 (2024).
Kazantseva, E., Donmez, A. & Kolmogorov, M. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing—simulated datasets. Zenodo https://doi.org/10.5281/zenodo.11142288 (2024).
Acknowledgements
M.K. and A.D. were supported by the Intramural Research Program of the Center for Cancer Research, National Cancer Institute, National Institutes of Health. This work used the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
E.K., A.D. and M.K. developed Strainy and performed the benchmarking. M.F. performed the functional analysis of strain variation; M.P. and M.K. supervised the work.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Mads Albertsen, Jue Ruan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Informative and non-informative SNPs.
(A) An example of graph multi-phasing challenge with a set of four closely-related strains and their corresponding phylogenetic tree. SNP positions are shown in different shapes, hollow/solid indicate different genotypes. SNP1 is not informative for the highlighted subset of strains (S2, S3, S4), but SNP2 and SNP3 positions are informative for this subset (B) An example of sequence graph phasing with three strains. Strain-specific variants are shown with yellow rectangles. After phasing and graph simplification, most of the graph nodes are strain-resolved, but some nodes remain strain-collapsed.
Extended Data Fig. 2 Completeness per strain of the simulated datasets assemblies.
Assembled strain genome fraction computed using metaQUAST for every bacterial strain in the simulated Nanopore (A) and PacBio (B) datasets. Heatmaps show the number of misassemblies for each strain. Strains are sorted in the decreasing mean value among all tools.
Extended Data Fig. 3 Additional evaluations of simulated assemblies.
Analysis of assembly size, strain completeness and duplication rates of the simulated dataset assemblies for Nanopore (A) and PacBio (B) datasets.
Extended Data Fig. 4 Correlation between strain similarities and Strainy switch errors.
Pairwise average nucleotide identity (ANI) of strain genomes in the simulated datasets and the number of inter-species misassemblies computed for E. coli (A), P. Aeruginosa (B), L. Monocytogenes (C), and S. Aureus (D).
Extended Data Fig. 5 Benchmarking using simulated sequencing data, stratified by number of strains.
Plots show distribution of strain completeness, strain switch errors, intra-strain errors and NGA50 aggregated by the number of strains in the dataset for Nanopore (A) and PacBio HiFi data (B). Boxes show the quartiles of 8 data points.
Extended Data Fig. 6 Benchmarking using simulated sequencing data, stratified by coverage mode.
Plots show distribution of strain completeness, strain switch errors, intra-strain errors and NGA50 aggregated by the coverage mode for Nanopore (A) and PacBio HiFi data (B). Boxes show the quartiles of 16 data points.
Extended Data Fig. 7 Read-based evaluation of Strainy phasing.
For every simulated read, its strain of origin is known. For each phasing cluster, we define a “major” strain as the most frequent strain in the cluster. Cluster quality is defined as the proportion of reads coming from the major strain.
Extended Data Fig. 8 Additional simulations to establish Strainy phasing limits.
(A) Completeness and assembly size of individual E. coli strains for different strains for different levels of strain heterozygosity. Each data point shows the mean of 14 values and the error bands show a 95% confidence interval. (B, C) Completeness of individual E. coli strain pairs for a given read coverage depth to establish minimum depth (B) and maximum strain abundance ratio (C) required for Strainy assembly.
Extended Data Fig. 9 Additional examples of IGV visualization of strain structural variants.
(A) A 63 bp indel in a region of MAG 122. (B) A 50 bp indel in a region of MAG 87. (C) A 247 bp indel in a region of MAG 87.
Extended Data Fig. 10 Additional analysis of strain mutations.
(A, B) Substitution mutation signatures of the activated sludge dataset derived from Oxford Nanopore Technologies (ONT) and PacBio HiFi strain assemblies. (C) dN/dS substitution ratio for AMR-related genes inside individual MAGs assembled from the activated sludge dataset.
Supplementary information
Rights and permissions
About this article
Cite this article
Kazantseva, E., Donmez, A., Frolova, M. et al. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02424-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41592-024-02424-1