Introduction

Automated genome-resolved metagenomics, including the identification and characterization of members of a microbial community, has remained challenging despite improvements in sequencing technology and bioinformatic analysis1. A community-initiative for the Critical Assessment of Metagenomics Interpretation (CAMI) has tracked the progress of this goal over time and posited two main challenges for metagenomic next-generation sequencing (mNGS): profiling and binning2. Metagenomic profiling aims to quantify the presence/absence and abundance of organisms in a microbial community, and has seen a marked improvement in the number and performance of tools from the first to the second CAMI challenge3.

Metagenomic binning aims to place sequences (usually assembled contigs) from the same organism in a unique bin, enabling the study of each organism within complex microbial communities. Metagenomic binning can further be divided into supervised and unsupervised approaches, where unsupervised approaches use sequence information such as tetranucleotide repeat counts and sequencing depth to bin sequences, while supervised approaches use reference sequence databases and previously generated information to bin sequences4. While unsupervised binners have improved over recent years, their lack of ability to assign taxonomic labels to bins precludes downstream analyses such as profiling the presence/absence of specific organisms, performing organism-specific analyses, or identifying sequences of concern (e.g., novel pathogens with sequence homology to known pathogens)3. The COVID-19 pandemic has highlighted the need for such improvements, with the aim of ensuring early availability of pathogen-agnostic diagnostics when outbreaks of novel strains emerge5.

Earlier work on taxonomic binning has relied on amino acid alignments of assembled contigs to a universal protein database6,7,8,9. These workflows allow for identification of divergent sequences, but do not leverage the non-coding and synonymous variation within contigs, nor the positional relationship of classifier features (e.g., co-localization of proteins) into taxonomic classification. An alternative approach relies on a search for taxa-specific conserved features; however, this approach is limited by recall, where contigs missing the conserved marker cannot be classified, and to date, no approaches encompass conserved markers spanning archaea, bacterial, viruses and eukaryotes10,11. Finally, k-mer and minimizer-based approaches suffer from lack of positional relationship between k-mers, and lack of ability to resolve uncertainty when using a single k-mer or even base to break lowest common ancestor ties, as demonstrated in previous evaluations9.

In addition to software improvements, advancements in sequencing technology have enabled the production of higher quality metagenomic assemblies through long-read approaches, which have been posited to enable genome-resolved metagenomics12,13. Third-generation sequencers, such as those by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), can generate reads up to megabases in length and contigs spanning full microbial genomes14. Traditionally, there has been a tradeoff between read length and sequencing quality; however, with the advent of high fidelity (HiFi) PacBio sequencing and improvements in ONT sequencing kits and basecallers, that gap is quickly disappearing15,16,17. Additionally, the small size and cost-effectiveness of third-generation sequencers is poised to enable broad-scale adoption of this technology for mNGS, which has already been used across clinical, food, defense and environmental applications18,19,20.

Here, we describe BugSplit, an automated, supervised method to bin contigs by taxonomic identity. The main difference between our workflow, BugSplit, and others is that it utilizes local nucleotide alignments of contigs against a universal reference database, such as the NCBI nucleotide (nt) database or Refseq, for taxonomic binning of assemblies. Using full alignments captures the synonymous and non-coding variation, as well as the positional relationship of features, without needing to annotate protein coding regions or translate sequences. Several authors have posited that nucleotide alignment lacks sensitivity to classify divergent taxa6,7; however, we show this assumption to be false when classifying contigs. Second, our approach leverages the absolute nucleotide identity (ANI) of alignments to collapse taxonomic assignments to higher ranks based on accepted ANI thresholds for defining taxa. Third, we incorporate uncertainty between alignments using a voting algorithm to collapse assignments up the taxonomic tree, and fourth, we adjust for known inaccuracies in reference databases by detecting and correcting misannotated contigs. We show that these advancements, especially when paired with long-read sequencing, enable the automated identification and characterization of microbes in complex microbial communities.

Results

Evaluation of taxonomic binning accuracy with mock microbial communities and known organisms

We first evaluate BugSplit using three commonly used benchmarking datasets generated with third-generation sequencers: the ZymoBIOMICS Even21 and Log22 datasets are mock microbial communities of eight bacteria and two yeasts, with varying abundance, sequenced on an ONT GridION23, and the ZymoBIOMICS Gut Microbiome Standard24 containing 19 bacteria (five of which are strains of the same species) and two yeasts sequenced on a PacBio Sequel II. We compare BugSplit with DIAMOND+MEGAN-LR and MMseqs2, two popular tools for taxonomic binning of contigs, and use the official CAMI AMBER tool to assess taxonomic binning performance6,7,22,25. AMBER’s calculation of performance metrics has been previously reported and is summarized in the methods, along with methods for assembly and polishing of each metagenomic community. A graphic overview of the BugSplit workflow is shown in Fig. 1.

Fig. 1: Overview of full BugSplit workflow and example of contig classification algorithm.
figure 1

a Flow of data through the BugSplit workflow. Rectangles represent data points, diamonds represent processes, and circles represent forks in analysis. b Example application of contig classification algorithm. Alignments against the reference database are first collapsed up the taxonomic tree based on absolute nucleotide identity. A base-level vote is then performed across all bases of a contig, determining the final taxonomic assignment of the contig based on rank-specific majority thresholds.

On average, BugSplit binned contigs to species bins with an absolute F1-score 33.0% better than DIAMOND+MEGAN-LR and 54.9% better than MMseqs2. These advantages are maintained across all taxonomic ranks, but decrease in magnitude at higher ranks where the performance of DIAMOND+MEGAN-LR and MMseqs2 improves (Fig. 2 and Supplementary Table 1). BugSplit largely demonstrates superior accuracy by improving classification completeness while maintaining or surpassing the classification purity of alternative tools. At the species level, average classification purity was within 6% of DIAMOND+MEGAN-LR and MMseqs2, while average completeness exceeded DIAMOND+MEGAN-LR by 42.3% and MMseqs2 by 58.4%. Similarly, BugSplit was able to classify, to a species level, 36.6% more contigs than DIAMOND+MEGAN-LR and 50.6% more than MMseqs2, while misclassifying 0.5% fewer contigs than DIAMOND+MEGAN-LR and 3.3% more than MMseqs2. In general, all tools performed better on bins with higher sequencing coverage, as these bins contained less fragmented assemblies, and small bins were more likely to have lower completeness and purity than large bins produced by BugSplit (Supplementary Figs. 13). Finally, BugSplit achieves these improvements in accuracy without sacrificing execution time: BugSplit executed these processes faster than DIAMOND+MEGAN-LR and MMseqs2 on all three datasets by 100 min or more (Supplementary Table 2).

Fig. 2: Performance of contig taxonomic classifiers across four datasets (Zymo Even, Zymo Log, Zymo Gut, and CAMI high complexity).
figure 2

a Average bin completeness across taxonomic ranks. b average purity across taxonomic ranks. Shaded bands show the standard error of the metrics in a and b. c BugSplit produces more complete bins with less contamination compared with alternative taxonomic binners.

To highlight the classification performance of BugSplit on a species with close sequence homology yet very different implications for public safety, we apply BugSplit to a case of human anthrax (Bacillus anthracis strain Ba0914), sequenced on a MinION at the Centers for Disease Control and Prevention26,27. Our taxonomic binning pipeline classified the complete assembled genome in 1 h and 20 min as Bacillus anthracis. MMseqs2 took 3 h and 48 min to identify the bacterial genome as belonging to the family Bacillaceae, and DIAMOND+MEGAN-LR took 6 h 38 min to identify the genome as Bacillus (most specific taxonomic classification presented for all classifiers).

Evaluation of taxonomic binning accuracy with novel organisms

To demonstrate that nucleotide alignment retains and even improves on the ability to bin divergent sequences from those in public databases, we apply BugSplit and the other taxonomic binners to two datasets simulating novel organisms. We first analyze the CAMI High-Complexity dataset28, a well characterized microbial community comprising 2.80 Gb of contigs from 596 novel organisms generated from a simulated Illumina HiSeq sequencing run, which has been used in other benchmarking studies2. We use the CAMI version of Refseq from 2015, predating the publication of organisms in this dataset. BugSplit binned contigs with a genus-level absolute F1-score 16.0% and 15.2% better than DIAMOND+MEGAN-LR and MMseqs2, respectively, which was driven by both superior completeness and purity of bins (7–17.5% better purity, 13.5–14.3% better completeness) (Fig. 2). These advantages are maintained at the family rank, where BugSplit exhibited superior purity and completeness by 1% to 3%, respectively.

To demonstrate the impact of BugSplit’s performance on detecting a novel pathogen, we performed nanopore mNGS using sequence-independent single-primer amplification of two nasopharyngeal swabs from patients with COVID-19 and from one viral culture of SARS-CoV-2. We generated ~5000 to 24,000 reads per sample meeting stringent quality control settings, and assembled 6, 2, and 1 contigs spanning 2000 to 14,000 base pairs from each sample. We analyzed these contigs with BugSplit, DIAMOND+MEGAN-LR and MMseqs2, replacing their databases with an archive of the NCBI nucleotide database29 from 2019, predating the emergence of SARS-CoV-2.

BugSplit successfully classified all contigs to the genus Betacoronavirus, and correctly classified none of them to the species level. DIAMOND+MEGAN-LR overclassified two contigs, classifying them as Bat SARS-like coronavirus (parent species: Severe acute respiratory syndrome-related coronavirus), and failed to classify the other seven contigs. MMseqs2 classified eight of nine contigs as the less-specific Coronaviridae family.

Improved taxonomic binning enables highly accurate taxonomic profiling of third-generation metagenomic data

We hypothesized that accurate taxonomic assignment of contigs would improve compositional estimates derived from metagenome sequencing datasets. We have previously shown that alternative long-read taxonomic profilers vastly overestimate the number of species in a sample when applied to mNGS data: in our previous benchmarking study, BugSeq version 1, which was the top performing tool, estimated five times more species than the number of species truly present in a simple mock community (52 vs. 10 true species present)30. This result was two orders of magnitude better than Centrifuge (>5000 species detected), the tool that currently drives ONT’s official platform EPI2ME30,31. As metaFlye32 produces length and coverage output of each contig in a metagenomic assembly, we can combine this information with BugSplit taxonomic bin labels to compute the relative abundance of each taxon in a sample in <5 min for all third-generation datasets. We additionally include two read-level taxonomic profilers in our comparison: Centrifuge (given its popularity for third-generation metagenomic analysis) and BugSeq version 1 (which has previously been shown to outperform other classifiers including MetaMaps33 and CDKAM34 on ONT data). Comparison of tools was performed with standardized metrics calculated with OPAL35.

Indeed, we find that BugSplit was the top OPAL-ranked among all of the metagenomic profilers assessed, and exhibited the top F1-score (0.81 vs. 0.63 for second-place MMseqs2) when calling species as present or absent across the three benchmark communities (Fig. 3 and Supplementary Table 3). Completeness was also highest after read-based profilers Centrifuge and BugSeq 1; however, Centrifuge called greater than 900 species as present across all datasets (true value: 10–21) and therefore had purity approaching 0, while BugSeq 1 had purity approaching 10%. In comparison, the purity of BugSplit was 89.4%. BugSplit assigned <0.02% of bases in each assembly to Saccharomyces pastorianus, a hybrid organism of Saccharomyces cerevisiae (truly present in each dataset) and Saccharomyces eubayanus; filtering these contigs out results in 100% purity for BugSplit across all three long-read datasets. We also find that BugSplit was most accurate identifying the abundance of each species in two of three datasets, and was second to BugSeq 1 for the third dataset, the Zymo Log community (L1 norm error: 0.109 versus 0.073). Organisms that BugSplit failed to detect had the lowest abundance in each dataset, with the exception of Faecalibacterium prausnitzii, Veillonella rogosae and Prevotella corporis from the Zymo Gut dataset, which had their assembled contigs assigned to the genus level. Difficulty in detecting the lowest abundance contigs reflects difficulty in assembling them rather than classifying their contigs; in the Zymo Log dataset, multiple organisms have sequencing coverage below 0.005X and therefore contribute no assembled contigs.

Fig. 3: Taxonomic profiling accuracy of five tools across three mock microbial communities sequenced with a long-read sequencer.
figure 3

a Greater bin completeness reflects better taxonomic profiling. b Greater bin purity reflects better taxonomic profiling. c Lower Bray-Curtis distance reflects better taxonomic profiling. Shaded bands show the standard error of the metrics in a, b, and c.

Binning by taxonomic identity enables targeted downstream analysis using tools designed for single-organism whole-genome sequencing data

We demonstrate the importance of BugSplit’s species-level taxonomic binning by applying it to a previously reported case of hypervirulent Klebsiella pneumoniae liver abscess and its detection using mock blood cultures36,37. Hypervirulence typing of K. pneumoniae is important for clinical care, as hypervirulent strains are more likely to cause life-threatening infections and are increasingly resistant to antimicrobials38. Two mock blood culture samples from this patient were created by spiking the isolate into blood at 30 CFU/mL, incubating it in an automated blood culture system and sequencing positive blood culture bottles using a nanopore sequencer as previously described36. BugSplit successfully detected the only microorganism as K. pneumoniae in both samples, and binned sequences belonging to K. pneumoniae together, producing bins 97.3% and 94.6% complete with <1% contamination using CheckM39. In contrast, MMseqs2 produced bins 94.3% and 44.8% complete, and DIAMOND + MEGAN-LR produced bins 29.3 and 25.7% complete. To demonstrate the impact of binning completeness, we implement automatic bioinformatic analyses developed for K. pneumoniae whole-genome sequencing data (Kleborate40) on each bin labeled as K. pneumoniae. Kleborate has recently been reported to accurately predict sequence type, serotype and other features of K. pneumoniae isolates from WGS data. Kleborate correctly called each BugSplit K. pneumoniae bin from both blood culture bottles as sequence type 11 and serotype K47, concordant with bacterial culture and traditional serotyping results. In contrast, Kleborate was unable to determine sequence type for both DIAMOND + MEGAN-LR bins and one of two MMseqs2 bins, and was additionally unable to determine serotype for one of two DIAMOND + MEGAN-LR bins.

By binning at the species level, we can also attribute antimicrobial resistance genes and mutations to specific organisms in a sample. We incorporate ResFinder41 into BugSplit and execute it automatically on all taxonomic bins, looking for variants that cause AMR specifically in the taxon assigned to each bin. We additionally implement a modification to ResFinder that automatically adjusts for the higher error rate of nanopore metagenomic assemblies, and corrects indels and erroneous stop codons in resistance conferring loci (see Methods, and publicly available in ResFinder version 4.2). We apply BugSplit and our modified ResFinder tool to a recent dataset42 of 10 urine samples from patients infected with Neisseria gonorrhoeae. Urine underwent mNGS using an ONT GridION, and cultured isolates from the same samples underwent Illumina sequencing for comparison, as previously reported43.

BugSplit successfully identified N. gonorrhoeae in all 10 samples, ranging from 1.1 to 58.1% of sequenced bases (2.0 to 72.8% of microbial sequenced bases). In contrast to the original Centrifuge-based analysis of this data, and in accordance with the authors’ conclusions that reads classified as Neisseria lactamica likely reflect taxonomic misclassification of N. gonorrhoeae reads, we find no N. lactamica in any of the samples43. Interestingly, we find one sample (original identifier: 301_UB_U) with co-infection with Ureaplasma urealyticum, another potential cause of urethritis. The original metagenomic analysis of this data does not discuss this finding, and this sequence was not found in any other samples.

BugSplit successfully assembled and binned together a median of 97.1% (range: 67.0–98.3%) of the N. gonorrhoeae genome, as determined by CheckM on the N. gonorrhoeae bins, with a maximum of 0.9% contamination. DIAMOND+MEGAN-LR and MMseqs2 binned a median of 11.4% (range: 0% to 26.7%) and 22.4% (range: 0% to 65.5%) of the N. gonorrhoeae genome, respectively, with a maximum of 0.2% contamination. When our modified ResFinder is run on the N. gonorrhoeae bins created with BugSplit, we identify variants conferring AMR to cephalosporins, quinolones, penicillins, macrolides and tetracyclines with 96.6% sensitivity (28/29 variants detected) and 100% specificity as compared with Illumina sequencing of cultured isolates from the same patients, and the authors’ original pathogen-specific analysis44 (Table 1). In comparison, DIAMOND+MEGAN-LR detected 13.8% of variants (4/29) and MMseqs2 detected 24.1% (7/29) variants conferring AMR.

Table 1 Antimicrobial resistance prediction of BugSplit applied to nanopore mNGS of urine, compared with Illumina isolate sequencing, of Neisseria gonorrhoeae infections.

Discussion

As mNGS becomes more ubiquitous and is applied to clinical, environmental, biodefense and other samples encompassing incredible microbial diversity, methods to characterize the genome and functional capacity of each organism in a sample become increasingly important. Here, we demonstrate that nucleotide alignment of contigs against a reference database enables significantly improved taxonomic binning of metagenomic assemblies when compared to tools that rely on protein alignments. To evaluate the performance of this approach, we performed taxonomic binning of simulated and real sequencing data from three sequencing technologies and microbial communities containing one to over 500 organisms. BugSplit can classify organisms with available reference genomes to the species level with significantly greater completeness than alternative approaches while preserving bin purity. Our results on simulated novel organisms demonstrate that nucleotide alignment retains sensitivity to classify divergent organisms, with precision to place them on the taxonomic tree at the appropriate rank.

We demonstrate through several use cases how these improvements in taxonomic binning unlock downstream analyses not feasible with current taxonomic binners. Our application of mNGS and BugSplit to the detection of SARS-CoV-2 before any reference sequences were available highlights the power of broadly deploying mNGS with optimal taxonomic binning for pathogen surveillance and pandemic prevention by detecting pathogens earlier. The robust binning provided by BugSplit also allows for automated analysis of taxonomic bins using tools designed for single-organism whole-genome sequencing data. We demonstrate that it is possible to accurately predict the sequence type, serotype and antimicrobial resistance of organisms directly from clinical samples such as blood cultures and urine. We additionally demonstrate how taxonomic binning may be used to identify unknown organisms of bioterrorism potential, such as Bacillus anthracis. As diagnostic microbiology laboratories adopt NGS for bacterial isolate and metagenomic sequencing, automated tools to detect pathogens and leverage the existing body of pathogen-specific bioinformatic analyses will enable a faster, easier transition to NGS. We anticipate that this technology will be broadly useful for the detection and characterization of organisms from diverse samples, and can be greatly expanded upon to support analyses that are application and domain specific.

We anticipate several improvements that will further refine taxonomic bins produced by BugSplit. The most substantial impact is likely to be the inclusion of assembly graph topology into the binning algorithm to improve strain-level resolution. Currently, metagenomic assemblers only output a single contig for conserved regions between different strains of a single species; using BugSplit, these contigs are assigned to the common species and placed in a species-level bin. By incorporating graph topology and linkage of contigs, we will be able to mitigate this limitation and place the contig in multiple strain-level taxonomic bins. Further exploration of the parameter space of BugSplit may also result in improved binning. For example, minimap2 could be tuned for greater alignment recall while preserving precision than its default “map-ont” setting, and voting coverage thresholds may be able to be tuned for improved classification of contigs across the taxonomic hierarchy. Ultimately, we expect to adopt a strategy that will allow optimal values for key parameters to be determined by the taxonomic lineage of alignments.

BugSplit is a highly accurate tool for taxonomic binning and profiling of third-generation metagenomic data with computing speeds faster than comparable workflows. We show that using BugSplit to bin metagenomic assemblies has several substantial downstream effects, including enabling highly similar species discrimination and identification, novel species identification and universal, pathogen-agnostic taxonomic profiling. When combined with automated assembly, polishing and post-processing of bins, we demonstrate that detecting pathogens, strain-typing them and accurately predicting their antimicrobial resistance directly from complex samples with mNGS becomes feasible.

Methods

BugSplit preprocessing

BugSplit uses Nextflow running on AWS Batch to orchestrate processing of sequencing data in the cloud. In brief, nanopore reads undergo demultiplexing and adapter trimming with qcat45. Nanopore and PacBio (now added into the pipeline) next undergo quality control with prinseq-lite46, filtering reads with a mean Phred score <7, a DUST complexity score <7 or a read length <100 base pairs. Finally, reads are aligned with minimap247 using default parameters against a database containing common non-microbial host genomes, including human, mouse, rat, pig, cow and chicken, to focus assembly on microorganisms. Reads unaligned to host genomes are retained and progress to assembly.

After preprocessing, reads are assembled with metaFlye32, preserving strain heterogeneity. Assemblies built from ONT R9.4.1 or R10.3 reads undergo four rounds of Racon48 polishing, one round of Medaka49 and one round of Homopolish50, in accordance with recent assembly benchmarking51. A mash database52, published by the mash authors and comprising all genomes and plasmid sequences in Refseq (https://gembox.cbcb.umd.edu/mash/refseq.genomes%2Bplasmid.k21s1000.msh) is used for homology search with Homopolish. Racon and Medaka are executed on g4dn-class instances via AWS Batch. PacBio HiFi and ONT Q20 + assemblies do not undergo polishing beyond that included in metaFlye. An entire GridION flowcell using R9.4.1 pores can be assembled and polished to Q40 in <6.5 h, ~5 h faster than using CPUs alone (Supplementary Table 4).

Taxonomic binning of contigs

Contigs are first aligned to a reference database such as RefSeq or NCBI nucleotide database (nt) with minimap2. We use the default ‘map-ont’ preset of minimap2, as it provides the greatest sensitivity for nucleotide alignment out of all minimap2 presets and performs comparably to nucleotide BLAST53. We evaluated replacing minimap2 with an alternative local nucleotide aligner, discontiguous nucleotide megaBLASTN54, however this approach was too slow for practical purposes. As alignments are made to individual genomes representing a single strain of an organism, the taxonomic identification of each retained alignment is reassigned to internal nodes on the taxonomic tree based on absolute nucleotide identity. Based on the previous identification of ANI thresholds to define a species and genus, as well as the current error rates for metagenomic assembly and the lack of strain representation in public reference databases, we reassign any alignment to the reference database with 95–99% ANI to the species level, with 62–94.9% ANI to the genus level, and with <62% ANI to the superkingdom (highest rank before root) level55,56,57,58,59. Alignments with greater than 99% ANI are retained at the strain level.

As minimap2 randomly picks a primary alignment if there are multiple alignments with equal top score, we collapse equally good top hits to their lowest common ancestor. Alignments to collapse are identified as secondary alignments with equal dynamic programming score of the max scoring segment in the alignment (“ms” minimap2 SAM tag) to a non-secondary alignment, covering the exact same region of a query contig as the non-secondary alignment.

Next, we implement a voting algorithm to assign contigs to the taxonomic node encompassing a certain percentage of all bases in the contig, aggregating the alignments from above. We again parameterize this vote using accepted definitions of species and previous studies utilizing ANI, requiring 95% and 70% of bases in a contig to map to a strain or species for the contig to be assigned to that strain or species60,61,62, respectively. For ranks above species (e.g., genus), we use a majority vote, assigning the contig to the deepest taxon encompassing at least 50% of all bases in a contig, as this approach has previously been reported to perform well63. In summary, for a contig to be assigned to a species, it must have at least 70% of its bases with 95% or more ANI mapped to a reference sequence.

NCBI assigns plasmids to the taxonomy identity of their host bacteria in which they were first sequenced64,65. This can be misleading due to plasmid conjugation and the ability of plasmids to replicate in organisms across phylogenetic subgroups. We implement a mechanism to recover and correct taxon labels of plasmid sequences. In brief, plasmid sequences are identified with PlasmidFinder66, and their taxonomic identities are overridden to that of “plasmid sequences” (NCBI taxon 36549). Full commands and versions for each program are available in Supplementary Note 1.

Evaluation of alternative alignment

We attempted to align the CAMI high-complexity assembly against a DUST-masked CAMI Refseq database from 2015 using the following command:

“blastn -task dc-megablast -template_type optimal -template_length 18 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -num_threads 32 -db dust/refseq_2015_db -out CAMI_blast.txt -evalue 0.01 -db_soft_mask 11 -query CAMI.fna”. Execution was not complete at seven days of runtime, and was therefore terminated.

Abundance calculation

The relative abundance of a taxon t, in terms of percent of total nucleic acid in a sample, can be approximated using the sequencing depth and length of all contigs c assigned to t (ct), divided by the total size of sequencing data (Equation 1).

$${{{{{{\mathrm{Relative}}}}}}}\;{{{{{{\mathrm{abundance}}}}}}}\left(t\right)=\frac{\sum \left[{{{{{{\mathrm{depth}}}}}}}\left({c}_{t}\right)\times {{{{{{\mathrm{leng}}}}}}}{{{{{{\mathrm{th}}}}}}}\left({c}_{t}\right)\right]}{\sum {{{{{{\mathrm{read}}}}}}}\;{{{{{{\mathrm{lengths}}}}}}}}$$

Equation 1: Calculation of taxon abundance based on contig classification, length and depth of sequencing.

Abundance in bases can be summed up the taxonomic tree to calculate cumulative bases assigned to each taxon, yielding relative abundance at all ranks, in an approach similar to Sczyrba et al.2.

Accuracy assessment and comparison to alternative tools

We use AMBER25 version 2.0.2 to assess the performance of each tool binning contigs to taxa. Bin completeness was calculated as the average fraction of true-positive base pairs in each predicted bin from the true bin size. Bin purity was calculated as the average fraction of true-positive base pairs in each predicted bin. We use OPAL35 version 1.0.10 to assess the taxonomic profiling performance of each tool. The default OPAL ranking scheme was used to identify the top taxonomic profiler. OPAL’s purity is calculated as the number of taxa correctly predicted as present in a sample divided by all predicted taxa at that rank. OPAL’s completeness is calculated as the number of taxa correctly predicted as present in a sample divided by all taxa present at that rank. Completeness and purity for both AMBER25 and OPAL35 range from 0 (worst) to 1 (best). For both tools, completeness is analogous to recall and purity is analogous to precision. A related metric, contamination, can be regarded as the opposite of purity and reflects the fraction of incorrect sequence data assigned to a bin. Further calculation details are available in their respective original publications.

MMseqs2 and DIAMOND were run with the NCBI non-redundant amino acid database as suggested by their authors (Supplementary Note 1). All databases were downloaded on May 15, 2021, and all tools were run with 96 threads and 768 Gb of RAM available to them. The CAMI comparison used the 2015 Refseq database and NCBI taxonomy as provided by the CAMI authors.2 The built-in taxonomy of MEGAN-LR was replaced by placing ncbi.tre and ncbi.map, a Newick formatted NCBI taxonomy, in the working directory. These files were generated by converting the NCBI taxonomy files (names.dmp and nodes.dmp) provided with the CAMI datasets into Newick format with the Python taxonomy package67.

Ground truth classifications were generated for all datasets except for CAMI, which used the gold standard contig classifications provided by CAMI. Ground truths were generated by comparing each contig in our metagenomic assembly to the reference genome of each organism contained within the mock microbial community using MegaBLASTN. The taxonomic identification of the top BLAST hit for each contig was determined to be its gold standard assignment. Sequencing depth of each bin for the ZymoBIOMICS Even and Log datasets was extracted from data presented by Nicholls et al.21, and sequencing depth for the ZymoBIOMICS Gut dataset was calculated with CoverM using the “-x map-hifi –secondary=no” minimap2 preset68.

Application to detection of an emerging coronavirus, hypervirulent Klebsiella pneumoniae, and Neisseria gonorrhoeae infections

For the detection of an emerging coronavirus, nasopharyngeal swabs (n = 2) were collected as part of routine testing at Vancouver General Hospital during Fall 2020 (ORF1ab Ct values = 14.7, 20.6) and cultured SARS-CoV-2 viral particles (RdRp Ct value = 18.3) were obtained from the BC Centre for Disease Control Public Health Laboratory. Both clinical samples and cultured virions were extracted and randomly amplified through sequence-independent single-primer amplification as previously described69. Samples were sequenced on Oxford Nanopore MinION devices and basecalling was performed with Guppy (Oxford Nanopore Technologies). Ethics approval for collection of nasopharyngeal swabs was obtained from the University of British Columbia (H20-02152).

For the application to human anthrax, hypervirulent K. pneumoniae, and N. gonorrhoeae infections, raw data was downloaded from the NCBI accessions listed below in Data Availability and submitted to BugSplit. In brief, reads were preprocessed, assembled and polished as detailed above in BugSplit preprocessing. Binning completion and contamination were assessed with CheckM using the default CheckM database. The NCBI nucleotide database from 2019 was downloaded from the second CAMI challenge (https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_blast/nt.gz) and used in place of BugSplit’s default database for the emerging coronavirus application.

Modifications to ResFinder to accommodate for insertions, deletions, and stop codons in assemblies with high error rates

By default, ResFinder41 (which includes the PointFinder70 module) performs a BLASTN54 alignment of the query assembly against a database of resistance loci. PointFinder scans each alignment and identifies all differences, including insertions and deletions (indels), between the query assembly and reference locus for further annotation. In the event of a stop codon (nucleotides TAG, TAA or TGA) within a locus, PointFinder terminates its search for variants in the region upstream to the stop codon. Full details are available in the original PointFinder70 methods. We base our modifications to PointFinder on the previously demonstrated observation that frameshifts and stop codons in third-generation assemblies are more likely to reflect sequencing and assembly errors than true sequence variation71,72. We modify PointFinder to not halt its search for variants along a resistance locus if it encounters a stop codon. We additionally modify PointFinder to shift alignments around indels, maintaining the reading frame, in an approach similar to more general frameshift correction tools71,72. Our modified PointFinder has been incorporated into ResFinder version 4.2 and can be activated with the “-ii” (Ignore Indels) and “-ic” (Ignore stop Codons) flags.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.