Benchmarking the MinION: Evaluating long reads for microbial profiling

Nanopore based DNA-sequencing delivers long reads, thereby simplifying the decipherment of bacterial communities. Since its commercial appearance, this technology has been assigned several attributes, such as its error proneness, comparatively low cost, ease-of-use, and, most notably, aforementioned long reads. The technology as a whole is under continued development. As such, benchmarks are required to conceive, test and improve analysis protocols, including those related to the understanding of the composition of microbial communities. Here we present a dataset composed of twelve different prokaryotic species split into four samples differing by nucleic acid quantification technique to assess the specificity and sensitivity of the MinION nanopore sequencer in a blind study design. Taxonomic classification was performed by standard taxonomic sequence classification tools, namely Kraken, Kraken2 and Centrifuge directly on reads. This allowed taxonomic assignments of up to 99.27% on genus level and 92.78% on species level, enabling true-positive classification of strains down to 25,000 genomes per sample. Full genomic coverage is achieved for strains abundant as low as 250,000 genomes per sample under our experimental settings. In summary, we present an evaluation of nanopore sequence processing analysis with respect to microbial community composition. It provides an open protocol and the data may serve as basis for the development and benchmarking of future data processing pipelines.

. Yield (reads and bases), read length and mean quality presenting the output of the 36 h MinION sequencing run, after basecalling (Albacore) and adapter removal (Porechop). A clear drop in quality for un-and misclassified reads is observable as compared to correct assignment. Assigned Barcodes 1 to 4 match samples 1 to 4 (heterogeneous and equimolar adjusted by either ddPCR or Qubit). Statistics generated with NanoPlot, based on the sequencing_summary (Basecalled) and the individual fastq bins after porechopping.
99.27% (Centrifuge) between all samples, whereas read classification matching the ground truth on species level was up to 92.78% (Centrifuge) across all samples (Table 2). Generally, accuracy and deviation metrics (root mean squared deviation (RMSD) and mean absolute error (MAE)) on genus level were better than on species level. Comparing Centrifuge, Kraken and Kraken2 running their precompiled databases/indices, Centrifuge was able to assign the highest fraction of reads to the theoretically expected genera and species across all samples. Also, Centrifuge performed best with respect to both measures of deviation (RMSD, MAE), whereas Kraken 2 was superior over Kraken. However, beyond the accuracy of each classifier, computational aspects need to be considered. Especially, when limited computational resources are available, such as in field applications, Kraken 2 offers superior processing speed and lower memory consumption compared to Centrifuge and Kraken 28 . Precision and recall per species and genus reached generally high values on read level (see Supplementary  Table S3, S4). For genera with very low abundancy, drops in precision could be observed (see Supplementary  Table S3). Reads wrongly classified on species level were, e.g., attributable to close relatives, such as Bacillus species to Bacillus licheniformis, Enterobacter cloacae to Enterobacter hormaechei, et cetera, or exhibited differences in read abundancy as compared to true positive hits, which is similar to findings reported by Deshpande et al. 19 despite a different sequencing and analysis approach. This is also reflected by the lower values of recall for these species on read level (see Supplementary Table S4). The necessity for accurate databases and unified nomenclature is discussed elsewhere [29][30][31][32] and has been shown to affect classification of nanopore data 18 . These results indicate that classification is, as of yet, more reliable on genus level than on species level.
Serendipitously, rerunning the classification process after the removal of four most abundant initially selected strains from the read data allowed the additional selection and thus classification of four strains down to app.  www.nature.com/scientificreports www.nature.com/scientificreports/ 25,000 to 500,000 genomes per sample, using Krona plots. The remaining three strains adjusted to the range of 500 to 5,000 genomes per sample could not be reliably retrieved from the two samples with heterogeneous genomic concentrations (Fig. 2). Their presence was obfuscated by the filter process, i.e. they were as abundant as falsely classified reads and, subsequently, a clear discrimination allowing selection and classification was impossible. With the experimental settings and proceeding as described here, this suggests a dynamic range of detection and viable classification between 250 and 500,000 genomes/µl of initial DNA input, corresponding to a range of 25,000 to 50 million genomes from material obtained from microbial communities of low diversity from the MinION. The range reported here is similar to the findings of Nicholls et al. 21 .
These results showed good consistency with a) the output from the NanoOK analysis by direct comparison (Table 3, see Supplementary Table S5), where at least 99.21% of all available reads could be aligned to selected references and b) the theoretical expectation. Moreover, mean coverages reported by NanoOK indicate potential for de novo genome assemblies (Fig. 3). Full genomic coverage realistically permitting de novo assembly was achieved for strains down to a concentration of 250,000 genomes per sample (see Supplementary Table S5). At comparable sequencing times, we anticipate the concentration level required to achieve full genomic coverage to be even lower for libraries that are not multiplexed.
Despite the error rates currently accompanying MinION sequencing, these results clearly illustrate the viability and possibilities of long reads for direct taxonomic classification and abundance estimation with currently available bioinformatics pipelines.

conclusion
We present a MinION DNA sequence read dataset to facilitate the Nanopore community to improve and develop new bioinformatics pipelines aimed at the understanding of microbial diversity. Continual benchmarking using updated sequencing methods and chemistries in metagenome analyses is required 32 . With the presented detailed methodology, as a whole, this study follows the FAIR Guiding Principles 33 for scientific data management and stewardship by contributing (F)indable and (A)ccessible data under bioproject accession PRJNA545964 and corresponding signal level data 34 that is (R)eusable for the fast-paced development of third generation sequencing and downstream bioinformatics in a metagenomics context.
Based on the dataset, we present a simple and straightforward analysis pipeline to investigate the composition of microbial communities. Given our experimental approach we were able to achieve highly accurate taxonomic classification of low abundant (25,000 genomes/sample) organisms to at least genus level. Full genomic coverage was achieved for species with an abundancy of 250,000 genomes per sample and sufficient coverage for de novo assembly could be obtained.
While there is no standardized approach for the characterization of bacterial communities, molecular tools are considered powerful to gain knowledge and insight into these 35,36 , and nanopore sequencing is no exception to this point. In summary, the presented benchmark provides insight into nanopore data and data processing for the taxonomic classification of microbial communities. Hence, this study contributes to the toolsets and development of processing pipelines available to elucidate microbial diversity.

Material and methods
The overall experimental design is setup as follows: Bacteria cultivation, DNA extraction, quantification and creation of mock samples were performed by the Unit for Biological Agents, Federal Institute for Occupational Safety and Health (BAuA). Samples were shipped to the sequencing team (Mittweida UAS). The sequencing team performed library preparation, sequencing and downstream processing unaware of the samples' actual respective compositions (Fig. 4).  (Table 4)       www.nature.com/scientificreports www.nature.com/scientificreports/ was conducted with app. less than 40,000 target genes according to the manufacturer's instructions (Bio-Rad) using the ddPCR Supermix for Probes (no dUTP). Final concentrations of oligonucleotides were 0.4 pmol/µL 1055Falt (ATGGRTGTCGTCAGCT), 0.2 pmol/µL 1392 R (ACGGGCGGTGTGTAC) and 0.1 pmol/µL 1115IB (FAM-CAACGAGCG-ZEN-CAACCC-3IABkFQ) adopted from Rothrock et al. 38 . Droplet generation was conducted according to manufacturer's instructions in a QX200 Droplet Generator and amplified in a T100 Thermal Cycler. PCR conditions were initial denaturation at 95 °C for 10 min, and 30 cycles of denaturation at 95 °C for 30 s, annealing at 57 °C for 45 s, extension at 72 °C for 45 s with a ramp rate of 1 °C/s, followed by a final extension at 98 °C for 10 min and cooling to 12 °C. Droplet evaluation was performed in a QX200 Droplet Reader with QuantaSoft-Software.
Samples were shipped on ice by public postal services.
Library preparation and sequencing. A sequencing library was prepared according to manufacturer's instructions. The Ligation Sequencing Kit (SQK-LSK108, Oxford Nanopore Technologies (ONT)) and the Native Barcoding Expansion 1-12 kit (EXP-NBD103, ONT), barcoding each of the samples (barcodes #1, #2, #3, #4), were used with the following exceptions: Shearing times were prolonged and an optional FFPE DNA repair step (M6630, New England Biolabs (NEB)) was included. The incubation times during the end-repair/dA-tailing (E7546, NEB) were extended from five to 20 minutes for both, the 20 °C and 65 °C incubation steps. Qubit checkpoint measurements were performed according to the library preparation protocol (see Supplementary Table S1). Pooling of the barcoded samples was performed 'as is' instead of protocol-given 'equimolar' . Sequencing was then performed on a R9.4 flowcell (FLO-MIN106, ONT, >1200 pores, see Supplementary Data classification and validation. Taxonomic classification was performed with standard parameters (Centrifuge "-k 1") on native reads using Centrifuge (precompiled index: "Bacteria, Archaea (compressed), 2018-4-15") 22 , as well as Kraken (precompiled database: "DustMasked MiniKraken DB 8GB") 40 and Kraken2 (precompiled database: MiniKraken2_v1_8GB) 28   www.nature.com/scientificreports www.nature.com/scientificreports/ known input organism at the genus and species level out of the total number reads given any assignment at that rank 18 . To calculate a corresponding estimate of the accompanying error, the mean absolute error, as well as root mean squared deviation of classified to theoretically present fractions on genus and species level were computed. On read level, precision and recall for genus and species identification were computed 32 for Centrifuge, Kraken and Kraken 2 vs. the results obtained from the NanoOK analysis, with precision being the proportion of reads classified correctly to reads classified and recall being the proportion of reads classified correctly to the reads from the NanoOK dataset, which was used as "ground truth". All additional bioinformatics processing was performed in the Linux Bourne Again Shell (bash), using Samtools (version 1.9) 51 and seqtk (version 1.3-r106, https://github. com/lh3/seqtk).

Species
Strain ID  Table 4. Strain community overview: Overview of the strains selected to compose the microbial community with accessions and genomic specifications shown.