BugSplit enables genome-resolved metagenomics through highly accurate taxonomic binning of metagenomic assemblies

Chandrakumar, Induja; Gauthier, Nick P. G.; Nelson, Cassidy; Bonsall, Michael B.; Locher, Kerstin; Charles, Marthe; MacDonald, Clayton; Krajden, Mel; Manges, Amee R.; Chorlton, Samuel D.

doi:10.1038/s42003-022-03114-4

Download PDF

Article
Open access
Published: 22 February 2022

BugSplit enables genome-resolved metagenomics through highly accurate taxonomic binning of metagenomic assemblies

Communications Biology volume 5, Article number: 151 (2022) Cite this article

5589 Accesses
6 Citations
26 Altmetric
Metrics details

Subjects

Abstract

A large gap remains between sequencing a microbial community and characterizing all of the organisms inside of it. Here we develop a novel method to taxonomically bin metagenomic assemblies through alignment of contigs against a reference database. We show that this workflow, BugSplit, bins metagenome-assembled contigs to species with a 33% absolute improvement in F1-score when compared to alternative tools. We perform nanopore mNGS on patients with COVID-19, and using a reference database predating COVID-19, demonstrate that BugSplit’s taxonomic binning enables sensitive and specific detection of a novel coronavirus not possible with other approaches. When applied to nanopore mNGS data from cases of Klebsiella pneumoniae and Neisseria gonorrhoeae infection, BugSplit’s taxonomic binning accurately separates pathogen sequences from those of the host and microbiota, and unlocks the possibility of sequence typing, in silico serotyping, and antimicrobial resistance prediction of each organism within a sample. BugSplit is available at https://bugseq.com/academic.

Using nanopore sequencing to identify fungi from clinical samples with high phylogenetic resolution

Article Open access 16 June 2023

Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes

Article 29 July 2019

A streamlined clinical metagenomic sequencing protocol for rapid pathogen identification

Article Open access 23 February 2021

Introduction

Automated genome-resolved metagenomics, including the identification and characterization of members of a microbial community, has remained challenging despite improvements in sequencing technology and bioinformatic analysis¹. A community-initiative for the Critical Assessment of Metagenomics Interpretation (CAMI) has tracked the progress of this goal over time and posited two main challenges for metagenomic next-generation sequencing (mNGS): profiling and binning². Metagenomic profiling aims to quantify the presence/absence and abundance of organisms in a microbial community, and has seen a marked improvement in the number and performance of tools from the first to the second CAMI challenge³.

Metagenomic binning aims to place sequences (usually assembled contigs) from the same organism in a unique bin, enabling the study of each organism within complex microbial communities. Metagenomic binning can further be divided into supervised and unsupervised approaches, where unsupervised approaches use sequence information such as tetranucleotide repeat counts and sequencing depth to bin sequences, while supervised approaches use reference sequence databases and previously generated information to bin sequences⁴. While unsupervised binners have improved over recent years, their lack of ability to assign taxonomic labels to bins precludes downstream analyses such as profiling the presence/absence of specific organisms, performing organism-specific analyses, or identifying sequences of concern (e.g., novel pathogens with sequence homology to known pathogens)³. The COVID-19 pandemic has highlighted the need for such improvements, with the aim of ensuring early availability of pathogen-agnostic diagnostics when outbreaks of novel strains emerge⁵.

Earlier work on taxonomic binning has relied on amino acid alignments of assembled contigs to a universal protein database^6,7,8,9. These workflows allow for identification of divergent sequences, but do not leverage the non-coding and synonymous variation within contigs, nor the positional relationship of classifier features (e.g., co-localization of proteins) into taxonomic classification. An alternative approach relies on a search for taxa-specific conserved features; however, this approach is limited by recall, where contigs missing the conserved marker cannot be classified, and to date, no approaches encompass conserved markers spanning archaea, bacterial, viruses and eukaryotes^10,11. Finally, k-mer and minimizer-based approaches suffer from lack of positional relationship between k-mers, and lack of ability to resolve uncertainty when using a single k-mer or even base to break lowest common ancestor ties, as demonstrated in previous evaluations⁹.

In addition to software improvements, advancements in sequencing technology have enabled the production of higher quality metagenomic assemblies through long-read approaches, which have been posited to enable genome-resolved metagenomics^12,13. Third-generation sequencers, such as those by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), can generate reads up to megabases in length and contigs spanning full microbial genomes¹⁴. Traditionally, there has been a tradeoff between read length and sequencing quality; however, with the advent of high fidelity (HiFi) PacBio sequencing and improvements in ONT sequencing kits and basecallers, that gap is quickly disappearing^15,16,17. Additionally, the small size and cost-effectiveness of third-generation sequencers is poised to enable broad-scale adoption of this technology for mNGS, which has already been used across clinical, food, defense and environmental applications^18,19,20.

Here, we describe BugSplit, an automated, supervised method to bin contigs by taxonomic identity. The main difference between our workflow, BugSplit, and others is that it utilizes local nucleotide alignments of contigs against a universal reference database, such as the NCBI nucleotide (nt) database or Refseq, for taxonomic binning of assemblies. Using full alignments captures the synonymous and non-coding variation, as well as the positional relationship of features, without needing to annotate protein coding regions or translate sequences. Several authors have posited that nucleotide alignment lacks sensitivity to classify divergent taxa^6,7; however, we show this assumption to be false when classifying contigs. Second, our approach leverages the absolute nucleotide identity (ANI) of alignments to collapse taxonomic assignments to higher ranks based on accepted ANI thresholds for defining taxa. Third, we incorporate uncertainty between alignments using a voting algorithm to collapse assignments up the taxonomic tree, and fourth, we adjust for known inaccuracies in reference databases by detecting and correcting misannotated contigs. We show that these advancements, especially when paired with long-read sequencing, enable the automated identification and characterization of microbes in complex microbial communities.

Results

Evaluation of taxonomic binning accuracy with mock microbial communities and known organisms

We first evaluate BugSplit using three commonly used benchmarking datasets generated with third-generation sequencers: the ZymoBIOMICS Even²¹ and Log²² datasets are mock microbial communities of eight bacteria and two yeasts, with varying abundance, sequenced on an ONT GridION²³, and the ZymoBIOMICS Gut Microbiome Standard²⁴ containing 19 bacteria (five of which are strains of the same species) and two yeasts sequenced on a PacBio Sequel II. We compare BugSplit with DIAMOND+MEGAN-LR and MMseqs2, two popular tools for taxonomic binning of contigs, and use the official CAMI AMBER tool to assess taxonomic binning performance^6,7,22,25. AMBER’s calculation of performance metrics has been previously reported and is summarized in the methods, along with methods for assembly and polishing of each metagenomic community. A graphic overview of the BugSplit workflow is shown in Fig. 1.

**Fig. 1: Overview of full BugSplit workflow and example of contig classification algorithm.**

On average, BugSplit binned contigs to species bins with an absolute F1-score 33.0% better than DIAMOND+MEGAN-LR and 54.9% better than MMseqs2. These advantages are maintained across all taxonomic ranks, but decrease in magnitude at higher ranks where the performance of DIAMOND+MEGAN-LR and MMseqs2 improves (Fig. 2 and Supplementary Table 1). BugSplit largely demonstrates superior accuracy by improving classification completeness while maintaining or surpassing the classification purity of alternative tools. At the species level, average classification purity was within 6% of DIAMOND+MEGAN-LR and MMseqs2, while average completeness exceeded DIAMOND+MEGAN-LR by 42.3% and MMseqs2 by 58.4%. Similarly, BugSplit was able to classify, to a species level, 36.6% more contigs than DIAMOND+MEGAN-LR and 50.6% more than MMseqs2, while misclassifying 0.5% fewer contigs than DIAMOND+MEGAN-LR and 3.3% more than MMseqs2. In general, all tools performed better on bins with higher sequencing coverage, as these bins contained less fragmented assemblies, and small bins were more likely to have lower completeness and purity than large bins produced by BugSplit (Supplementary Figs. 1–3). Finally, BugSplit achieves these improvements in accuracy without sacrificing execution time: BugSplit executed these processes faster than DIAMOND+MEGAN-LR and MMseqs2 on all three datasets by 100 min or more (Supplementary Table 2).

**Fig. 2: Performance of contig taxonomic classifiers across four datasets (Zymo Even, Zymo Log, Zymo Gut, and CAMI high complexity).**

To highlight the classification performance of BugSplit on a species with close sequence homology yet very different implications for public safety, we apply BugSplit to a case of human anthrax (Bacillus anthracis strain Ba0914), sequenced on a MinION at the Centers for Disease Control and Prevention^26,27. Our taxonomic binning pipeline classified the complete assembled genome in 1 h and 20 min as Bacillus anthracis. MMseqs2 took 3 h and 48 min to identify the bacterial genome as belonging to the family Bacillaceae, and DIAMOND+MEGAN-LR took 6 h 38 min to identify the genome as Bacillus (most specific taxonomic classification presented for all classifiers).

Evaluation of taxonomic binning accuracy with novel organisms

To demonstrate that nucleotide alignment retains and even improves on the ability to bin divergent sequences from those in public databases, we apply BugSplit and the other taxonomic binners to two datasets simulating novel organisms. We first analyze the CAMI High-Complexity dataset²⁸, a well characterized microbial community comprising 2.80 Gb of contigs from 596 novel organisms generated from a simulated Illumina HiSeq sequencing run, which has been used in other benchmarking studies². We use the CAMI version of Refseq from 2015, predating the publication of organisms in this dataset. BugSplit binned contigs with a genus-level absolute F1-score 16.0% and 15.2% better than DIAMOND+MEGAN-LR and MMseqs2, respectively, which was driven by both superior completeness and purity of bins (7–17.5% better purity, 13.5–14.3% better completeness) (Fig. 2). These advantages are maintained at the family rank, where BugSplit exhibited superior purity and completeness by 1% to 3%, respectively.

To demonstrate the impact of BugSplit’s performance on detecting a novel pathogen, we performed nanopore mNGS using sequence-independent single-primer amplification of two nasopharyngeal swabs from patients with COVID-19 and from one viral culture of SARS-CoV-2. We generated ~5000 to 24,000 reads per sample meeting stringent quality control settings, and assembled 6, 2, and 1 contigs spanning 2000 to 14,000 base pairs from each sample. We analyzed these contigs with BugSplit, DIAMOND+MEGAN-LR and MMseqs2, replacing their databases with an archive of the NCBI nucleotide database²⁹ from 2019, predating the emergence of SARS-CoV-2.

BugSplit successfully classified all contigs to the genus Betacoronavirus, and correctly classified none of them to the species level. DIAMOND+MEGAN-LR overclassified two contigs, classifying them as Bat SARS-like coronavirus (parent species: Severe acute respiratory syndrome-related coronavirus), and failed to classify the other seven contigs. MMseqs2 classified eight of nine contigs as the less-specific Coronaviridae family.

Improved taxonomic binning enables highly accurate taxonomic profiling of third-generation metagenomic data

We hypothesized that accurate taxonomic assignment of contigs would improve compositional estimates derived from metagenome sequencing datasets. We have previously shown that alternative long-read taxonomic profilers vastly overestimate the number of species in a sample when applied to mNGS data: in our previous benchmarking study, BugSeq version 1, which was the top performing tool, estimated five times more species than the number of species truly present in a simple mock community (52 vs. 10 true species present)³⁰. This result was two orders of magnitude better than Centrifuge (>5000 species detected), the tool that currently drives ONT’s official platform EPI2ME^30,31. As metaFlye³² produces length and coverage output of each contig in a metagenomic assembly, we can combine this information with BugSplit taxonomic bin labels to compute the relative abundance of each taxon in a sample in <5 min for all third-generation datasets. We additionally include two read-level taxonomic profilers in our comparison: Centrifuge (given its popularity for third-generation metagenomic analysis) and BugSeq version 1 (which has previously been shown to outperform other classifiers including MetaMaps³³ and CDKAM³⁴ on ONT data). Comparison of tools was performed with standardized metrics calculated with OPAL³⁵.

Indeed, we find that BugSplit was the top OPAL-ranked among all of the metagenomic profilers assessed, and exhibited the top F1-score (0.81 vs. 0.63 for second-place MMseqs2) when calling species as present or absent across the three benchmark communities (Fig. 3 and Supplementary Table 3). Completeness was also highest after read-based profilers Centrifuge and BugSeq 1; however, Centrifuge called greater than 900 species as present across all datasets (true value: 10–21) and therefore had purity approaching 0, while BugSeq 1 had purity approaching 10%. In comparison, the purity of BugSplit was 89.4%. BugSplit assigned <0.02% of bases in each assembly to Saccharomyces pastorianus, a hybrid organism of Saccharomyces cerevisiae (truly present in each dataset) and Saccharomyces eubayanus; filtering these contigs out results in 100% purity for BugSplit across all three long-read datasets. We also find that BugSplit was most accurate identifying the abundance of each species in two of three datasets, and was second to BugSeq 1 for the third dataset, the Zymo Log community (L1 norm error: 0.109 versus 0.073). Organisms that BugSplit failed to detect had the lowest abundance in each dataset, with the exception of Faecalibacterium prausnitzii, Veillonella rogosae and Prevotella corporis from the Zymo Gut dataset, which had their assembled contigs assigned to the genus level. Difficulty in detecting the lowest abundance contigs reflects difficulty in assembling them rather than classifying their contigs; in the Zymo Log dataset, multiple organisms have sequencing coverage below 0.005X and therefore contribute no assembled contigs.

**Fig. 3: Taxonomic profiling accuracy of five tools across three mock microbial communities sequenced with a long-read sequencer.**

Binning by taxonomic identity enables targeted downstream analysis using tools designed for single-organism whole-genome sequencing data

We demonstrate the importance of BugSplit’s species-level taxonomic binning by applying it to a previously reported case of hypervirulent Klebsiella pneumoniae liver abscess and its detection using mock blood cultures^36,37. Hypervirulence typing of K. pneumoniae is important for clinical care, as hypervirulent strains are more likely to cause life-threatening infections and are increasingly resistant to antimicrobials³⁸. Two mock blood culture samples from this patient were created by spiking the isolate into blood at 30 CFU/mL, incubating it in an automated blood culture system and sequencing positive blood culture bottles using a nanopore sequencer as previously described³⁶. BugSplit successfully detected the only microorganism as K. pneumoniae in both samples, and binned sequences belonging to K. pneumoniae together, producing bins 97.3% and 94.6% complete with <1% contamination using CheckM³⁹. In contrast, MMseqs2 produced bins 94.3% and 44.8% complete, and DIAMOND + MEGAN-LR produced bins 29.3 and 25.7% complete. To demonstrate the impact of binning completeness, we implement automatic bioinformatic analyses developed for K. pneumoniae whole-genome sequencing data (Kleborate⁴⁰) on each bin labeled as K. pneumoniae. Kleborate has recently been reported to accurately predict sequence type, serotype and other features of K. pneumoniae isolates from WGS data. Kleborate correctly called each BugSplit K. pneumoniae bin from both blood culture bottles as sequence type 11 and serotype K47, concordant with bacterial culture and traditional serotyping results. In contrast, Kleborate was unable to determine sequence type for both DIAMOND + MEGAN-LR bins and one of two MMseqs2 bins, and was additionally unable to determine serotype for one of two DIAMOND + MEGAN-LR bins.

By binning at the species level, we can also attribute antimicrobial resistance genes and mutations to specific organisms in a sample. We incorporate ResFinder⁴¹ into BugSplit and execute it automatically on all taxonomic bins, looking for variants that cause AMR specifically in the taxon assigned to each bin. We additionally implement a modification to ResFinder that automatically adjusts for the higher error rate of nanopore metagenomic assemblies, and corrects indels and erroneous stop codons in resistance conferring loci (see Methods, and publicly available in ResFinder version 4.2). We apply BugSplit and our modified ResFinder tool to a recent dataset⁴² of 10 urine samples from patients infected with Neisseria gonorrhoeae. Urine underwent mNGS using an ONT GridION, and cultured isolates from the same samples underwent Illumina sequencing for comparison, as previously reported⁴³.

BugSplit successfully identified N. gonorrhoeae in all 10 samples, ranging from 1.1 to 58.1% of sequenced bases (2.0 to 72.8% of microbial sequenced bases). In contrast to the original Centrifuge-based analysis of this data, and in accordance with the authors’ conclusions that reads classified as Neisseria lactamica likely reflect taxonomic misclassification of N. gonorrhoeae reads, we find no N. lactamica in any of the samples⁴³. Interestingly, we find one sample (original identifier: 301_UB_U) with co-infection with Ureaplasma urealyticum, another potential cause of urethritis. The original metagenomic analysis of this data does not discuss this finding, and this sequence was not found in any other samples.

BugSplit successfully assembled and binned together a median of 97.1% (range: 67.0–98.3%) of the N. gonorrhoeae genome, as determined by CheckM on the N. gonorrhoeae bins, with a maximum of 0.9% contamination. DIAMOND+MEGAN-LR and MMseqs2 binned a median of 11.4% (range: 0% to 26.7%) and 22.4% (range: 0% to 65.5%) of the N. gonorrhoeae genome, respectively, with a maximum of 0.2% contamination. When our modified ResFinder is run on the N. gonorrhoeae bins created with BugSplit, we identify variants conferring AMR to cephalosporins, quinolones, penicillins, macrolides and tetracyclines with 96.6% sensitivity (28/29 variants detected) and 100% specificity as compared with Illumina sequencing of cultured isolates from the same patients, and the authors’ original pathogen-specific analysis⁴⁴ (Table 1). In comparison, DIAMOND+MEGAN-LR detected 13.8% of variants (4/29) and MMseqs2 detected 24.1% (7/29) variants conferring AMR.

Table 1 Antimicrobial resistance prediction of BugSplit applied to nanopore mNGS of urine, compared with Illumina isolate sequencing, of Neisseria gonorrhoeae infections.

Full size table

Discussion

As mNGS becomes more ubiquitous and is applied to clinical, environmental, biodefense and other samples encompassing incredible microbial diversity, methods to characterize the genome and functional capacity of each organism in a sample become increasingly important. Here, we demonstrate that nucleotide alignment of contigs against a reference database enables significantly improved taxonomic binning of metagenomic assemblies when compared to tools that rely on protein alignments. To evaluate the performance of this approach, we performed taxonomic binning of simulated and real sequencing data from three sequencing technologies and microbial communities containing one to over 500 organisms. BugSplit can classify organisms with available reference genomes to the species level with significantly greater completeness than alternative approaches while preserving bin purity. Our results on simulated novel organisms demonstrate that nucleotide alignment retains sensitivity to classify divergent organisms, with precision to place them on the taxonomic tree at the appropriate rank.

We demonstrate through several use cases how these improvements in taxonomic binning unlock downstream analyses not feasible with current taxonomic binners. Our application of mNGS and BugSplit to the detection of SARS-CoV-2 before any reference sequences were available highlights the power of broadly deploying mNGS with optimal taxonomic binning for pathogen surveillance and pandemic prevention by detecting pathogens earlier. The robust binning provided by BugSplit also allows for automated analysis of taxonomic bins using tools designed for single-organism whole-genome sequencing data. We demonstrate that it is possible to accurately predict the sequence type, serotype and antimicrobial resistance of organisms directly from clinical samples such as blood cultures and urine. We additionally demonstrate how taxonomic binning may be used to identify unknown organisms of bioterrorism potential, such as Bacillus anthracis. As diagnostic microbiology laboratories adopt NGS for bacterial isolate and metagenomic sequencing, automated tools to detect pathogens and leverage the existing body of pathogen-specific bioinformatic analyses will enable a faster, easier transition to NGS. We anticipate that this technology will be broadly useful for the detection and characterization of organisms from diverse samples, and can be greatly expanded upon to support analyses that are application and domain specific.

We anticipate several improvements that will further refine taxonomic bins produced by BugSplit. The most substantial impact is likely to be the inclusion of assembly graph topology into the binning algorithm to improve strain-level resolution. Currently, metagenomic assemblers only output a single contig for conserved regions between different strains of a single species; using BugSplit, these contigs are assigned to the common species and placed in a species-level bin. By incorporating graph topology and linkage of contigs, we will be able to mitigate this limitation and place the contig in multiple strain-level taxonomic bins. Further exploration of the parameter space of BugSplit may also result in improved binning. For example, minimap2 could be tuned for greater alignment recall while preserving precision than its default “map-ont” setting, and voting coverage thresholds may be able to be tuned for improved classification of contigs across the taxonomic hierarchy. Ultimately, we expect to adopt a strategy that will allow optimal values for key parameters to be determined by the taxonomic lineage of alignments.

BugSplit is a highly accurate tool for taxonomic binning and profiling of third-generation metagenomic data with computing speeds faster than comparable workflows. We show that using BugSplit to bin metagenomic assemblies has several substantial downstream effects, including enabling highly similar species discrimination and identification, novel species identification and universal, pathogen-agnostic taxonomic profiling. When combined with automated assembly, polishing and post-processing of bins, we demonstrate that detecting pathogens, strain-typing them and accurately predicting their antimicrobial resistance directly from complex samples with mNGS becomes feasible.

Methods

BugSplit preprocessing

BugSplit uses Nextflow running on AWS Batch to orchestrate processing of sequencing data in the cloud. In brief, nanopore reads undergo demultiplexing and adapter trimming with qcat⁴⁵. Nanopore and PacBio (now added into the pipeline) next undergo quality control with prinseq-lite⁴⁶, filtering reads with a mean Phred score <7, a DUST complexity score <7 or a read length <100 base pairs. Finally, reads are aligned with minimap2⁴⁷ using default parameters against a database containing common non-microbial host genomes, including human, mouse, rat, pig, cow and chicken, to focus assembly on microorganisms. Reads unaligned to host genomes are retained and progress to assembly.

After preprocessing, reads are assembled with metaFlye³², preserving strain heterogeneity. Assemblies built from ONT R9.4.1 or R10.3 reads undergo four rounds of Racon⁴⁸ polishing, one round of Medaka⁴⁹ and one round of Homopolish⁵⁰, in accordance with recent assembly benchmarking⁵¹. A mash database⁵², published by the mash authors and comprising all genomes and plasmid sequences in Refseq (https://gembox.cbcb.umd.edu/mash/refseq.genomes%2Bplasmid.k21s1000.msh) is used for homology search with Homopolish. Racon and Medaka are executed on g4dn-class instances via AWS Batch. PacBio HiFi and ONT Q20 + assemblies do not undergo polishing beyond that included in metaFlye. An entire GridION flowcell using R9.4.1 pores can be assembled and polished to Q40 in <6.5 h, ~5 h faster than using CPUs alone (Supplementary Table 4).

Taxonomic binning of contigs

Contigs are first aligned to a reference database such as RefSeq or NCBI nucleotide database (nt) with minimap2. We use the default ‘map-ont’ preset of minimap2, as it provides the greatest sensitivity for nucleotide alignment out of all minimap2 presets and performs comparably to nucleotide BLAST⁵³. We evaluated replacing minimap2 with an alternative local nucleotide aligner, discontiguous nucleotide megaBLASTN⁵⁴, however this approach was too slow for practical purposes. As alignments are made to individual genomes representing a single strain of an organism, the taxonomic identification of each retained alignment is reassigned to internal nodes on the taxonomic tree based on absolute nucleotide identity. Based on the previous identification of ANI thresholds to define a species and genus, as well as the current error rates for metagenomic assembly and the lack of strain representation in public reference databases, we reassign any alignment to the reference database with 95–99% ANI to the species level, with 62–94.9% ANI to the genus level, and with <62% ANI to the superkingdom (highest rank before root) level^{55,56,57,58,59}. Alignments with greater than 99% ANI are retained at the strain level.

As minimap2 randomly picks a primary alignment if there are multiple alignments with equal top score, we collapse equally good top hits to their lowest common ancestor. Alignments to collapse are identified as secondary alignments with equal dynamic programming score of the max scoring segment in the alignment (“ms” minimap2 SAM tag) to a non-secondary alignment, covering the exact same region of a query contig as the non-secondary alignment.

Next, we implement a voting algorithm to assign contigs to the taxonomic node encompassing a certain percentage of all bases in the contig, aggregating the alignments from above. We again parameterize this vote using accepted definitions of species and previous studies utilizing ANI, requiring 95% and 70% of bases in a contig to map to a strain or species for the contig to be assigned to that strain or species^60,61,62, respectively. For ranks above species (e.g., genus), we use a majority vote, assigning the contig to the deepest taxon encompassing at least 50% of all bases in a contig, as this approach has previously been reported to perform well⁶³. In summary, for a contig to be assigned to a species, it must have at least 70% of its bases with 95% or more ANI mapped to a reference sequence.

NCBI assigns plasmids to the taxonomy identity of their host bacteria in which they were first sequenced^64,65. This can be misleading due to plasmid conjugation and the ability of plasmids to replicate in organisms across phylogenetic subgroups. We implement a mechanism to recover and correct taxon labels of plasmid sequences. In brief, plasmid sequences are identified with PlasmidFinder⁶⁶, and their taxonomic identities are overridden to that of “plasmid sequences” (NCBI taxon 36549). Full commands and versions for each program are available in Supplementary Note 1.

Evaluation of alternative alignment

We attempted to align the CAMI high-complexity assembly against a DUST-masked CAMI Refseq database from 2015 using the following command:

“blastn -task dc-megablast -template_type optimal -template_length 18 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -num_threads 32 -db dust/refseq_2015_db -out CAMI_blast.txt -evalue 0.01 -db_soft_mask 11 -query CAMI.fna”. Execution was not complete at seven days of runtime, and was therefore terminated.

Abundance calculation

The relative abundance of a taxon t, in terms of percent of total nucleic acid in a sample, can be approximated using the sequencing depth and length of all contigs c assigned to t (c_t), divided by the total size of sequencing data (Equation 1).

$${{{{{{\mathrm{Relative}}}}}}}\;{{{{{{\mathrm{abundance}}}}}}}\left(t\right)=\frac{\sum \left[{{{{{{\mathrm{depth}}}}}}}\left({c}_{t}\right)\times {{{{{{\mathrm{leng}}}}}}}{{{{{{\mathrm{th}}}}}}}\left({c}_{t}\right)\right]}{\sum {{{{{{\mathrm{read}}}}}}}\;{{{{{{\mathrm{lengths}}}}}}}}$$

Equation 1: Calculation of taxon abundance based on contig classification, length and depth of sequencing.

Abundance in bases can be summed up the taxonomic tree to calculate cumulative bases assigned to each taxon, yielding relative abundance at all ranks, in an approach similar to Sczyrba et al.².

Accuracy assessment and comparison to alternative tools

We use AMBER²⁵ version 2.0.2 to assess the performance of each tool binning contigs to taxa. Bin completeness was calculated as the average fraction of true-positive base pairs in each predicted bin from the true bin size. Bin purity was calculated as the average fraction of true-positive base pairs in each predicted bin. We use OPAL³⁵ version 1.0.10 to assess the taxonomic profiling performance of each tool. The default OPAL ranking scheme was used to identify the top taxonomic profiler. OPAL’s purity is calculated as the number of taxa correctly predicted as present in a sample divided by all predicted taxa at that rank. OPAL’s completeness is calculated as the number of taxa correctly predicted as present in a sample divided by all taxa present at that rank. Completeness and purity for both AMBER²⁵ and OPAL³⁵ range from 0 (worst) to 1 (best). For both tools, completeness is analogous to recall and purity is analogous to precision. A related metric, contamination, can be regarded as the opposite of purity and reflects the fraction of incorrect sequence data assigned to a bin. Further calculation details are available in their respective original publications.

MMseqs2 and DIAMOND were run with the NCBI non-redundant amino acid database as suggested by their authors (Supplementary Note 1). All databases were downloaded on May 15, 2021, and all tools were run with 96 threads and 768 Gb of RAM available to them. The CAMI comparison used the 2015 Refseq database and NCBI taxonomy as provided by the CAMI authors.² The built-in taxonomy of MEGAN-LR was replaced by placing ncbi.tre and ncbi.map, a Newick formatted NCBI taxonomy, in the working directory. These files were generated by converting the NCBI taxonomy files (names.dmp and nodes.dmp) provided with the CAMI datasets into Newick format with the Python taxonomy package⁶⁷.

Ground truth classifications were generated for all datasets except for CAMI, which used the gold standard contig classifications provided by CAMI. Ground truths were generated by comparing each contig in our metagenomic assembly to the reference genome of each organism contained within the mock microbial community using MegaBLASTN. The taxonomic identification of the top BLAST hit for each contig was determined to be its gold standard assignment. Sequencing depth of each bin for the ZymoBIOMICS Even and Log datasets was extracted from data presented by Nicholls et al.²¹, and sequencing depth for the ZymoBIOMICS Gut dataset was calculated with CoverM using the “-x map-hifi –secondary=no” minimap2 preset⁶⁸.

Application to detection of an emerging coronavirus, hypervirulent Klebsiella pneumoniae, and Neisseria gonorrhoeae infections

For the detection of an emerging coronavirus, nasopharyngeal swabs (n = 2) were collected as part of routine testing at Vancouver General Hospital during Fall 2020 (ORF1ab C_t values = 14.7, 20.6) and cultured SARS-CoV-2 viral particles (RdRp C_t value = 18.3) were obtained from the BC Centre for Disease Control Public Health Laboratory. Both clinical samples and cultured virions were extracted and randomly amplified through sequence-independent single-primer amplification as previously described⁶⁹. Samples were sequenced on Oxford Nanopore MinION devices and basecalling was performed with Guppy (Oxford Nanopore Technologies). Ethics approval for collection of nasopharyngeal swabs was obtained from the University of British Columbia (H20-02152).

For the application to human anthrax, hypervirulent K. pneumoniae, and N. gonorrhoeae infections, raw data was downloaded from the NCBI accessions listed below in Data Availability and submitted to BugSplit. In brief, reads were preprocessed, assembled and polished as detailed above in BugSplit preprocessing. Binning completion and contamination were assessed with CheckM using the default CheckM database. The NCBI nucleotide database from 2019 was downloaded from the second CAMI challenge (https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_blast/nt.gz) and used in place of BugSplit’s default database for the emerging coronavirus application.

Modifications to ResFinder to accommodate for insertions, deletions, and stop codons in assemblies with high error rates

By default, ResFinder⁴¹ (which includes the PointFinder⁷⁰ module) performs a BLASTN⁵⁴ alignment of the query assembly against a database of resistance loci. PointFinder scans each alignment and identifies all differences, including insertions and deletions (indels), between the query assembly and reference locus for further annotation. In the event of a stop codon (nucleotides TAG, TAA or TGA) within a locus, PointFinder terminates its search for variants in the region upstream to the stop codon. Full details are available in the original PointFinder⁷⁰ methods. We base our modifications to PointFinder on the previously demonstrated observation that frameshifts and stop codons in third-generation assemblies are more likely to reflect sequencing and assembly errors than true sequence variation^71,72. We modify PointFinder to not halt its search for variants along a resistance locus if it encounters a stop codon. We additionally modify PointFinder to shift alignments around indels, maintaining the reading frame, in an approach similar to more general frameshift correction tools^71,72. Our modified PointFinder has been incorporated into ResFinder version 4.2 and can be activated with the “-ii” (Ignore Indels) and “-ic” (Ignore stop Codons) flags.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Source data for the figures presented in this paper are available in Supplementary Data 1–3. A public instance of BugSplit is freely available for academic use at https://bugseq.com/academic. Acceptable inputs include FASTQ files from one or more samples sequenced on an Illumina, PacBio or ONT sequencer. Paired Illumina FASTQ files are also accepted. Outputs comprise taxonomic profiling in visual (HTML) and Kraken-report format, taxonomic bins in FASTA format, and additional bin-specific analyses as detailed above in textual and visual formats. Benchmarking data was downloaded from: Bacillus anthracis whole-genome nanopore sequencing: SRA accession SRR10088696; ZymoBIOMICS Even nanopore mNGS: SRA accession ERR3152364; ZymoBIOMICS Log nanopore mNGS: SRA accession ERR3152366; ZymoBIOMICS Gut PacBio HiFi mNGS: SRA accession SRR13128014; CAMI High Complexity gold standard assembly and ground truth labels: https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_I_HIGH using the CAMI downloader. Hypervirulent Klebsiella pneumoniae nanopore mNGS data: NCBI BioProject PRJNA663005. Neisseria gonorrhoeae nanopore mNGS data: NCBI BioProject PRJEB35173. NCBI nt database from 2019: https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_blast/nt.gz. Newly generated COVID-19 nanopore mNGS data has been deposited under NCBI Bioproject Accession Number PRJNA766077. Our error-tolerant mode has been integrated into ResFinder/PointFinder and is freely available at: https://bitbucket.org/genomicepidemiology/resfinder/src/master/. Error-tolerant mode can be activated with the “-ii” and “-ic” command line flags.

Code availability

Standalone executable code of BugSplit (for academic use only) has been deposited on Zenodo⁷³ at https://doi.org/10.5281/zenodo.5826348.

References

Kayani, M. U. R., Huang, W., Feng, R. & Chen, L. Genome-resolved metagenomics using environmental and clinical samples. Brief. Bioinform. 22, bbab030 (2021).
Sczyrba, A. et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).
Article CAS PubMed PubMed Central Google Scholar
Meyer, F. et al. Critical Assessment of Metagenome Interpretation-the second round of challenges. Preprint at bioRxiv https://doi.org/10.1101/2021.07.12.451567 (2021).
Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2019).
Article CAS PubMed Google Scholar
Vandenberg, O., Martiny, D., Rochas, O., van Belkum, A. & Kozlakidis, Z. Considerations for diagnostic COVID-19 tests. Nat. Rev. Microbiol. 19, 171–183 (2021).
Article CAS PubMed Google Scholar
Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).
Article CAS PubMed Central Google Scholar
Huson, D. H. et al. MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol. Direct 13, 6 (2018).
Article PubMed PubMed Central Google Scholar
Bağcı, C., Patz, S. & Huson, D. H. DIAMOND+MEGAN: fast and easy taxonomic and functional analysis of short and long microbiome sequences. Curr. Protoc. 1, e59 (2021).
Article PubMed Google Scholar
von Meijenfeldt, F. A. B., Arkhipova, K., Cambuy, D. D., Coutinho, F. H. & Dutilh, B. E. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. 20, 217 (2019).
Article Google Scholar
Gregor, I., Dröge, J., Schirmer, M., Quince, C. & McHardy, A. C. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ 4, e1603 (2016).
Article PubMed PubMed Central Google Scholar
Chaumeil, P. A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).
CAS Google Scholar
Gehrig, J. L. et al. Finding the right fit: A comprehensive evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Preprint at bioRxiv https://doi.org/10.1101/2021.08.31.458285 (2021).
Malmstrom, R. R. & Eloe-Fadrosh, E. A. Advancing genome-resolved metagenomics beyond the shotgun. mSystems 4, e00118–e00119 (2019).
Payne, A., Holmes, N., Rakyan, V. & Loose, M. BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35, 2193–2198 (2019).
Article CAS PubMed Google Scholar
Lang, D. et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of pacific biosciences sequel II system and ultralong reads of Oxford Nanopore. GigaScience 9, giaa123 (2020).
Lal, A. et al. Improving long-read consensus sequencing accuracy with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2021.06.28.450238 (2021).
Wick, R. R., Judd, L. M. & Holt, K. E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129 (2019).
Article PubMed PubMed Central Google Scholar
Petersen, L. M., Martin, I. W., Moschetti, W. E., Kershaw, C. M. & Tsongalis, G. J. Third-generation sequencing in the clinical laboratory: exploring the advantages and challenges of nanopore sequencing. J. Clin. Microbiol. 58, e01315–e01319 (2019).
Maguire, M. et al. Precision long-read metagenomics sequencing for food safety by detection and assembly of Shiga toxin-producing Escherichia coli in irrigation water. PLoS ONE 16, e0245172 (2021).
Article CAS PubMed PubMed Central Google Scholar
Urban, L. et al. Freshwater monitoring by nanopore sequencing. eLife 10, e61504 (2021).
Article CAS PubMed PubMed Central Google Scholar
University of Birmingham, UK. Zymo-EVEN. NCBI SRA https://www.ncbi.nlm.nih.gov/sra/?term=ERR3152364 (University of Birmingham, 2019).
University of Birmingham, UK. Zymo-LOG. NCBI SRA https://www.ncbi.nlm.nih.gov/sra/?term=ERR3152366 (University of Birmingham, 2019).
Nicholls, S. M., Quick, J. C., Tang, S. & Loman, N. J. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. GigaScience 8, giz043 (2019).
Pacific Biosciences. Zymo D6331 PacBio Standard Input Library. NCBI SRA https://www.ncbi.nlm.nih.gov/sra/?term=SRR13128014 (Pacific Biosciences, 2020).
Meyer, F. et al. AMBER: assessment of metagenome BinnERs. GigaScience 7, giy069 (2018).
McLaughlin, H. P. et al. Rapid nanopore whole-genome sequencing for anthrax emergency preparedness. Emerg. Infect. Dis. 26, 358–361 (2020).
Article CAS PubMed PubMed Central Google Scholar
Centers for Disease Control and Prevention-Zoonoses and Select Agent Laboratory (CDC-ZSAL). MinION WGS of Bacillus anthracis Ba0914. NCBI SRA https://www.ncbi.nlm.nih.gov/sra/?term=SRR10088696 (CDC-ZSAL, 2020).
CAMI High Complexity Dataset. https://data.cami-challenge.org/ (2015).
NCBI. Nucleotide (nt) Database. (NCBI, 2019).
Fan, J., Huang, S. & Chorlton, S. D. BugSeq: a highly accurate cloud platform for long-read metagenomic analyses. BMC Bioinforma. 22, 160 (2021).
Article CAS Google Scholar
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).
Article CAS PubMed Google Scholar
Dilthey, A. T., Jain, C., Koren, S. & Phillippy, A. M. Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat. Commun. 10, 3066 (2019).
Article PubMed PubMed Central Google Scholar
Bui, V. K. & Wei, C. CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies. BMC Bioinforma. 21, 468 (2020).
Article Google Scholar
Meyer, F. et al. Assessing taxonomic metagenome profilers with OPAL. Genome Biol. 20, 51 (2019).
Article PubMed PubMed Central Google Scholar
Zhou, M. et al. Comprehensive pathogen identification, antibiotic resistance, and virulence genes prediction directly from simulated blood samples and positive blood cultures by nanopore metagenomic sequencing. Front. Genet. 12, 244 (2021).
Google Scholar
Beijing Applied Biological Technologies Company. Klebsiella pneumoniae (ID 663005). NCBI BioProject https://www.ncbi.nlm.nih.gov/bioproject/PRJNA663005/ (Beijing Applied Biological Technologies Company, 2020).
Russo, T. A. & Marr, C. M. Hypervirulent Klebsiella pneumoniae. Clin. Microbiol. Rev. 32, e00001–e00019 (2019).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lam, M. M. C. et al. A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex. Nat. Commun. 12, 4188 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bortolaia, V. et al. ResFinder 4.0 for predictions of phenotypes from genotypes. J. Antimicrob. Chemother. 75, 3491–3500 (2020).
Article CAS PubMed PubMed Central Google Scholar
University of Oxford, Oxford, England, UK. Direct urine sample N. gonorrhoeae Nanopore sequencing. (University of Oxford, 2020).
Street, T. L. et al. Optimizing DNA extraction methods for nanopore sequencing of Neisseria gonorrhoeae directly from urine samples. J. Clin. Microbiol. 58, e01822–19 (2019).
Sanderson, N. D. et al. High precision Neisseria gonorrhoeae variant and antimicrobial resistance calling from metagenomic Nanopore sequencing. Genome Res. 30, 1354–1363 (2020).
Article CAS PubMed PubMed Central Google Scholar
qcat. (Oxford Nanopore Technologies, 2021).
Schmieder, R. & Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Medaka. (Oxford Nanopore Technologies, 2021).
Huang, Y. T., Liu, P. Y. & Shih, P. W. Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing. Genome Biol. 22, 95 (2021).
Article CAS PubMed PubMed Central Google Scholar
Latorre-Pérez, A., Villalba-Bermell, P., Pascual, J. & Vilanova, C. Assembly methods for nanopore-based metagenomic sequencing: a comparative study. Sci. Rep. 10, 13588 (2020).
Article PubMed PubMed Central Google Scholar
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Li, H. What parameters best resmble blastn. minimap2 GitHub https://github.com/lh3/minimap2/issues/54 (2017).
Morgulis, A. et al. Database indexing for production MegaBLAST searches. Bioinformatics 24, 1757–1764 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ciufo, S. et al. Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int. J. Syst. Evol. Microbiol. 68, 2386–2392 (2018).
Article PubMed PubMed Central Google Scholar
Kim, M., Oh, H. S., Park, S. C. & Chun, J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int. J. Syst. Evol. Microbiol. 64, 346–351 (2014).
Richter, M. & Rosselló-Móra, R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl Acad. Sci. USA 106, 19126–19131 (2009).
Article CAS PubMed PubMed Central Google Scholar
Barco, R. A. et al. A Genus definition for bacteria and archaea based on a standard genome relatedness index. mBio 11, e02475–19 (2020).
Federhen, S. et al. Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records. Stand. Genom. Sci. 9, 1275 (2014).
Article Google Scholar
Goris, J. et al. DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 57, 81–91 (2007).
Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl Acad. Sci. USA 102, 2567–2572 (2005).
Article CAS PubMed PubMed Central Google Scholar
Konstantinidis, K. T., Ramette, A. & Tiedje, J. M. Toward a more robust assessment of intraspecies diversity, using fewer genetic markers. Appl. Environ. Microbiol. 72, 7286–7293 (2006).
Article CAS PubMed PubMed Central Google Scholar
Hanson, N. W., Konwar, K. M. & Hallam, S. J. LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes. Bioinformatics 32, 3535–3542 (2016).
CAS PubMed PubMed Central Google Scholar
Robertson, J., Bessonov, K., Schonfeld, J. & Nash, J. H. E. Y. Universal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microb. Genomics 6, e000435 (2020).
Article Google Scholar
Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062 (2020).
Carattoli, A. & Hasman, H. PlasmidFinder and in silico pMLST: identification and typing of plasmid replicons in whole-genome sequencing (WGS). Methods Mol. Biol. Clifton NJ 2075, 285–294 (2020).
Article CAS Google Scholar
Bovee, R. Taxonomy. (One Codex, 2021).
Woodcroft, B. J. CoverM. (Centre for Microbiome Research, School of Biomedical Sciences, Faculty of Health, Queensland University of Technology, 2021).
Gauthier, N. P. G. et al. Nanopore metagenomic sequencing for detection and characterization of SARS-CoV-2 in clinical samples. PLoS ONE 16, e0259712 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zankari, E. et al. PointFinder: a novel web tool for WGS-based detection of antimicrobial resistance associated with chromosomal point mutations in bacterial pathogens. J. Antimicrob. Chemother. 72, 2764–2768 (2017).
Article CAS PubMed PubMed Central Google Scholar
Arumugam, K. et al. Annotated bacterial chromosomes from frame-shift-corrected long-read metagenomic data. Microbiome 7, 61 (2019).
Article PubMed PubMed Central Google Scholar
Hackl, T. et al. proovframe: frameshift-correction for long-read (meta)genomics. Preprint at bioRxiv https://doi.org/10.1101/2021.08.23.457338 (2021).
Chandrakuma, I. et al. BugSplit: highly accurate taxonomic binning of metagenomic assemblies enables genome-resolved metagenomics (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5826348 (2021).

Download references

Author information

Authors and Affiliations

BugSeq Bioinformatics Inc, Vancouver, BC, Canada
Induja Chandrakumar & Samuel D. Chorlton
Department of Microbiology and Immunology, University of British Columbia, Vancouver, BC, Canada
Nick P. G. Gauthier
Mathematical Ecology Research Group, Department of Zoology, University of Oxford, Oxford, UK
Cassidy Nelson & Michael B. Bonsall
Division of Medical Microbiology, Vancouver General Hospital, Vancouver, BC, Canada
Kerstin Locher, Marthe Charles & Clayton MacDonald
Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada
Kerstin Locher, Marthe Charles, Clayton MacDonald, Mel Krajden & Samuel D. Chorlton
British Columbia Centre for Disease Control, Vancouver, BC, Canada
Mel Krajden & Amee R. Manges
School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada
Amee R. Manges

Authors

Induja Chandrakumar
View author publications
You can also search for this author in PubMed Google Scholar
Nick P. G. Gauthier
View author publications
You can also search for this author in PubMed Google Scholar
Cassidy Nelson
View author publications
You can also search for this author in PubMed Google Scholar
Michael B. Bonsall
View author publications
You can also search for this author in PubMed Google Scholar
Kerstin Locher
View author publications
You can also search for this author in PubMed Google Scholar
Marthe Charles
View author publications
You can also search for this author in PubMed Google Scholar
Clayton MacDonald
View author publications
You can also search for this author in PubMed Google Scholar
Mel Krajden
View author publications
You can also search for this author in PubMed Google Scholar
Amee R. Manges
View author publications
You can also search for this author in PubMed Google Scholar
Samuel D. Chorlton
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.D.C. developed and implemented the BugSplit concept. S.D.C. and I.C. benchmarked BugSplit. K.L., M.C., and C.M. collected and provided COVID-19 samples for sequencing. N.P.P.G. and A.R.M. sequenced COVID-19 samples. I.C., N.G., C.N., M.B.B., K.L., M.K., S.D.C., and A.R.M. edited the manuscript. S.D.C. supervised the project. All authors read and approved the manuscript.

Corresponding author

Correspondence to Samuel D. Chorlton.

Ethics declarations

Competing interests

The authors declare the following competing interests: I.C. is an empoyee of BugSeq Bioinformatics Inc. S.D.C. is a shareholder of BugSeq Bioinformatics Inc. The other authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Brooke LaFlamme.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental Material

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chandrakumar, I., Gauthier, N.P.G., Nelson, C. et al. BugSplit enables genome-resolved metagenomics through highly accurate taxonomic binning of metagenomic assemblies. Commun Biol 5, 151 (2022). https://doi.org/10.1038/s42003-022-03114-4

Download citation

Received: 03 November 2021
Accepted: 03 February 2022
Published: 22 February 2022
DOI: https://doi.org/10.1038/s42003-022-03114-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.