Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing

Long-read Oxford Nanopore sequencing has democratized microbial genome sequencing and enables the recovery of highly contiguous microbial genomes from isolates or metagenomes. However, to obtain near-finished genomes it has been necessary to include short-read polishing to correct insertions and deletions derived from homopolymer regions. Here, we show that Oxford Nanopore R10.4 can be used to generate near-finished microbial genomes from isolates or metagenomes without short-read or reference polishing.

gaps or ambiguities" and "a consensus error rate equivalent to Q50 or better". This is difficult to achieve even with multiple sequencing technologies on pure cultures 19 and metagenome-assembled genomes (MAGs) 27 . However, the second-highest quality tier, high quality, can be achieved despite large amounts of frameshift errors, which can have large implications for downstream analysis 20 . Hence, we here introduce the term 'near-finished' genome and define it as a high-quality genome for which short-read polishing is not expected to significantly improve the consensus sequence.
We first evaluated the ability to obtain near-finished microbial genomes from Oxford Nanopore R9. 4 Table 1). In contrast to the R9.4.1 data, we do not see any significant improvement in the assembly quality for R10.4 by the addition of Illumina polishing ( Fig. 1c and Supplementary Fig. 1). This indicates that near-finished microbial reference genomes can be obtained from R10.4 data alone at a coverage of approximately 40-fold (Supplementary Table 2). The improvement in assembly accuracy from R9.4.1 to R10.4 is largely due to an improved ability to call homopolymers ( Fig. 1b and Supplementary Figs. 2 and 3). Even though there is some nucleotide-specific variation in homopolymer calling accuracy at lengths 8 and 9 on a read level (especially with cytosines), on a genome consensus level the vast majority of homopolymers are correctly resolved up to a length of <11 bp in R10.4 data ( Supplementary Fig. 4). In general, long homopolymers are very rare in bacteria 21 , and by analyzing complete genomes from 1,598 different genera ( Supplementary Fig. 5) we found only 18 genomes (1%) with long homopolymers (>10), at a rate of more than 1 per 100,000 bp (theoretical Q50 limit).
To assess the performance of state-of-the-art sequencing technologies in recovering near-finished microbial genomes from metagenomes we sequenced activated sludge from an anaerobic digester using single runs of Illumina MiSeq 2 × 300 bp, PacBio HiFi, and Oxford Nanopore R9. 4 Table 3). d, IDEEL 28 score, calculated as the proportion of predicted proteins that are ≥95% the length of their best-matching known protein in a database 19 . The dotted line represents the IDEEL score for the reference genome, while the dashed lines mark a 40-fold coverage cut-off.
( Supplementary Fig. 6) and altered the relative abundance of the species in the sample ( Supplementary Fig. 7). Furthermore, Nanopore R9.4.1 produced more than twice the amount of data compared with the other datasets, while the Illumina data featured variations in relative abundances presumably due to guanine and cytosine bias ( Supplementary Fig. 7). To facilitate automated contig binning, we performed Illumina sequencing of nine additional samples from the same anaerobic digester spread over 9 years (Supplementary Table  4) and used the coverage profiles as input for binning using multiple different approaches. Furthermore, to evaluate the impact of microdiversity on MAG quality, we calculated the polymorphic site rates for each MAG as a simple proxy for the presence of microdiversity 6 . After performing automated contig binning it is evident that microdiversity has a large impact on MAG fragmentation, but that long-read sequencing data results in much less fragmentation of bins at higher amounts of microdiversity ( Supplementary Fig. 8). Despite large differences in read length for Nanopore and PacBio HiFi data (N50 read length 6 kbp versus 15 kbp) only small differences in bin fragmentation were observed, as compared with the Illumina-based results (Table 1 and Supplementary Fig. 8).
All long-read methods produce high numbers of high-quality MAGs, which capture 39-49% of all reads (Table 1). Nanopore R9.4.1 is able to produce high-quality MAGs as a standalone technology, but Illumina polishing increases the number of high-quality MAGs from 64 to 86. For Nanopore R10.4, Illumina polishing increases the number of high-quality MAGs from 34 to 36. Using the IDEEL score 19 (Supplementary Fig. 9) as a relative measurement for improvement in genome consensus quality, Illumina polishing results in minor improvements for Nanopore R10.4 above a coverage of 40, and the Nanopore R10.4 is in the same IDEEL range as PacBio HiFi MAGs. As with sequencing of the Zymo mock, the difference from R9.4.1 to R10.4 is largely due to the significantly better accuracy in homopolymers for lengths up to 10 ( Supplementary  Fig. 4).
Since its introduction as an early access program in 2014 Oxford Nanopore sequencing technology has democratized sequencing and enabled more laboratories and classrooms to engage in microbial genome sequencing. However, for the generation of high-quality genomes, additional short-read polishing has been essential, given that indels in homopolymer regions cause fragmented gene calls. The additional sequencing requirements have been one of the barriers to widespread uptake. Here, we show that Oxford Nanopore R10.4 enables the generation of near-finished microbial genomes from pure cultures or metagenomes at coverages of 40-fold without short-read polishing. Although homopolymers of 10 or more bases will probably still be problematic, they constitute a minor part of microbial genomes ( Supplementary Fig. 5).
For genome recovery from metagenomes, low-coverage bins (<40-fold) do need Illumina polishing to achieve a quality comparable to PacBio HiFi. Hence, in some cases the most economic option could be Nanopore R9.4.1 supplemented with short-read sequencing, given that the throughput is currently at least twofold higher on R9.4.1 compared with R10.4 and no difference is seen between the methods after Illumina short-read polishing.
Automated binning was carried out using three binners: MetaBAT2 v. 2.12.1 (ref. 37 ) with the '-s 500000' setting, MaxBin2 v. 2.2.7 (ref. 38 ), and Vamb v. 3.0.2 (ref. 39 ) with the '-o C-minfasta 500000' setting. To aid with the binning process, contig coverage profiles from different sequencer datasets (Supplementary Table 1) as well as contig coverage by nine additional time-series Illumina datasets of the same anaerobic digester (Supplementary Table 4) were provided as input to the three binners. The binning output of different tools was then integrated and refined using DAS Tool v. 1.1.2 (ref. 40 ). CoverM v. 0.6.1 (https://github.com/wwood/ CoverM) was applied to calculate the bin coverage (using the '-m mean' setting) and the relative abundance ('-m relative_abundance'). A general overview of the processing of the sludge metagenomic data is presented in Supplementary Fig. 10.
Assembly processing. The completeness and contamination of the genome bins were estimated using CheckM v. 1.1.2 (ref. 41 ). The bins were classified using GDTB-Tk v. 1.5.0 (ref. 42 ) and the R202 database. Protein sequences were predicted using Prodigal v. 2.6.3 (ref. 43 ) with the 'p meta' setting, while the ribosomal RNA genes were predicted using Barrnap v. 0.9 (https://github.com/tseemann/barrnap) and the transfer RNA predictions were made using tRNAscan-SE v. 2.0.5 (ref. 44 ). Bin quality was determined following the Genomic Standards Consortium guidelines, in which a MAG of high quality has genome completeness of more than 90%, contamination of less than 5%, at least 18 distinct tRNA genes, and an occurrence of at least once of the 5S, 16S and 23S rRNA genes 26 . MAGs with completeness above 50% and contamination below 10% were classified as medium quality, while low-quality MAGs featured completeness below 50% and contamination below 10%. MAGs with contamination estimates higher than 10% were classified as contaminated.
Illumina reads were mapped to the assemblies using Bowtie2 v. 2.4.2 (ref. 45 ) with the '-very-sensitive-local' setting. The mapping was converted to BAM and sorted using SAMtools v. 1.9 (ref. 46 ). The single-nucleotide polymorphism rate was then calculated using CMseq v. 1.0.3 (ref. 6 ) from the mapping using poly.py script with the '-mincov 10-minqual 30' setting.
Bins were clustered using dRep v. 2.6.2 (ref. 47 ) with the '-comp 50 -con 10 -sa 0.95' setting. Only the bins that featured higher coverage than 10 in their respective sequencing platform and a higher Illumina read coverage than 5 for bins from the hybrid approach were included in downstream analysis. The IDEEL test was used to infer the level of protein truncations in the bins and was applied to provide a relative measurement of improvement in genome consensus quality via short-read polishing 20,28 . In brief, the predicted protein sequences from clustered bins and Zymo assemblies were searched against the UniProt TrEMBL 48 database (release 2021_01) using Diamond v. 2.0.6 (ref. 49 ). Query matches, which were not present in all datasets, were omitted to reduce noise. The IDEEL scores (estimated fraction of full-length protein sequences) were assigned as described previously 19 , where query-to-reference length ratios of more than 0.95 were counted as full-length protein sequences.
QUAST v. 4.6.3 (ref. 50 ) was applied on the Zymo assemblies and the clustered bins that had a single-nucleotide polymorphism rate less than 0.5% to determine the mismatch and indels metrics. Cases with the QUAST parameters genome fraction less than 75% and unaligned length more than 250 kbp were omitted to reduce noise. For homopolymer analysis, the clustered bins were mapped to each other using the asm5 mode of Minimap2, and Counterr was used on the mapping files to determine the homopolymer calling errors. For QUAST and Counterr, Illumina-polished PacBio HiFi bins were used as reference sequences. FastANI v. 1.33 (ref. 51 ) was used to calculate identity scores between Zymo assemblies and the Zymo reference sequences. The Zymo mock reference genome sequences, which were used as a substitute for PacBio HiFi, were obtained from a link in the accompanying instruction manual to the ZymoBIOMICS HMW DNA Standard Catalog No. D6332 (https://s3.amazonaws.com/zymo-files/BioPool/ D6322.refseq.zip).
Genome database analysis. Archeal and bacterial genomes from the National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) genome database were downloaded using ncbi-genome-download v. 0.3.0 (https:// github.com/kblin/ncbi-genome-download, downloaded on 24 November 2021) with the '-assembly-levels complete' option. Genomes were subsampled to include one genome per genus. Downloaded genome phylum taxonomy was determined by cross-referencing the RefSeq genome ID with the GTDB-tk (R202 database) metadata.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The raw anaerobic digester sequencing data are available at the ENA with the bio project ID PRJEB48021, while the Zymo mock community raw sequencing data are available at PRJEB48692 (Supplementary Table 4). The UniProt TrEMBL database used in the study is available at https://ftp.uniprot.org/pub/databases/uniprot/ previous_releases/release-2021_01/knowledgebase. The GTDB-tk database used in the study is available at https://data.ace.uq.edu.au/public/gtdb/data/releases/ release202. Links for accessing the genome assemblies, MAGs and summary data are available at https://github.com/Serka-M/Digester-MultiSequencing. Zymo Mock community reference sequences are available at https://s3.amazonaws. com/zymo-files/BioPool/D6322.refseq.zip. The NCBI RefSeq genome database is available at https://ftp.ncbi.nlm.nih.gov/genomes/refseq.

Code availability
Links for accessing code used to generate figures as well as supplementary resources are available at https://github.com/Serka-M/Digester-MultiSequencing. Software tools used in the study are either referenced or are provided as links in the Methods section.