Evaluation of microbiome enrichment and host DNA depletion in human vaginal samples using Oxford Nanopore’s adaptive sequencing

Metagenomic sequencing is promising for clinical applications to study microbial composition concerning disease or patient outcomes. Alterations of the vaginal microbiome are associated with adverse pregnancy outcomes, like preterm premature rupture of membranes and preterm birth. Methodologically these samples often have to deal with low relative amounts of prokaryotic DNA and high amounts of host DNA (> 90%), decreasing the overall microbial resolution. Nanopore's adaptive sampling method offers selective DNA depletion or target enrichment to directly reject or accept DNA molecules during sequencing without specialized sample preparation. Here, we demonstrate how selective ‘human host depletion’ resulted in a 1.70 fold (± 0.27 fold) increase in total sequencing depth, providing higher taxonomic profiling sensitivity. At the same time, the microbial composition remains consistent with the control experiments. The complete removal of all human host sequences is not yet possible and should be considered as an ethical approval statement might still be necessary. Adaptive sampling increased microbial sequencing yield in all 15 sequenced clinical routine vaginal samples, making it a valuable tool for clinical surveillance and medical-based research, which can be used in addition to other host depletion methods before sequencing.


Influence of nonspecific amplification-based library preparation for the determination of microbial communities.
The nanopore amplification-based library preparation kit (RPB004) uses transposase-mediated cleaving of DNA molecules to attach the primer binding sites for PCR amplification, which should reduce PCR amplification bias.
We initially assessed this bias by determining the microbial composition of the ZymoBIOMICS Microbial Community Standard 23 (control) by sequencing using the RPB004 PCR-based library preparation kit and compared the abundance of the different species to the native PCR-free library preparation kit (LSK109). DNA of the mock community cells was isolated simultaneously in three replicates to address experimental variations. Each replicate was sequenced with the LSK109 and the RPB004 kit. Accordingly, all samples have the same "lysis and DNA isolation" bias; therefore, the library preparation kits are the only parameter that differentiates the sequenced samples.
The reads were mapped against the microbial genomes via minimap2 v.2.19 24 , counted via samtools depth v1.11 25 (bases sequenced per organism), and summarized via ggplot2 (Fig. 1). All reads could be mapped to the reference genomes of the mock community.
Both sequencing kits detected all ten organisms of the ZymoBIOMICS Microbial Community Standard and the results of the sample's replicates exhibited only negligible deviation.
Compared to the expected abundance in the control, Gram-positive bacteria and yeast were underrepresented in the sequencing data obtained by both library preparation methods (amplification free: average 0.60 fold, min 0.34 fold, max 0.79 fold; amplification-based: average 0.70 fold, min 0.35 fold, max 0.97 fold), while Gram-negative bacteria were overrepresented (amplification free: average 1.84 fold, min 1.80 fold, max 1.88 fold; amplification-based: average 1.64 fold, min: 1.01 fold, max: 1.99 fold). The amplification-based library preparation approach shows a considerable difference to the PCR-free library preparation method for Pseudomonas aeruginosa (0.55 fold of PCR-based), Lactobacillus fermentum (0.47 fold of PCR-based), Staphylococcus aureus (2.21 fold of PCR-based), and Cryptococcus neoformans (1.57 fold of PCR-based). Six organisms show minor differences to the PCR-free library preparation (Bacillus subtilis, Enterococcus faecalis, Escherichia coli, Listeria monocytogenes, Saccharomyces cerevisiae, Salmonella enterica).
We expected the PCR-free library preparation approach to represent the microbial community standard more accurately since it was previously validated by other groups 26 . However, it showed clear variation in abundances compared to the control, which might be attributed to the different cell disruption device used in this work. The description of the ZymoBIOMICS Microbial Community Standard states that it mimics a mixed microbial community of well-defined composition, containing three easy-to-lyse Gram-negative bacteria, five tough-to-lyse Gram-positive bacteria, and two tough-to-lyse yeasts. Thus, Gram-positive bacteria and yeast were underrepresented due to differential lysis rather than differences in the library preparation protocols. We did not observe a significant advantage or disadvantage in choosing the amplification-based library preparation method over the PCR-free library preparation method to assess the microbial composition as both similarly overrepresent Gram-negative bacteria (Fig. 1), but interestingly the PCR amplification-based library preparation seemed to represent the control slightly better than the PCR-free library preparation. We sequenced a human vaginal metagenome (87.93% human host contamination) from a pregnant woman to derive species information for the enrichment process first. In a second step, we performed a depletion experiment using a human genome as reference (GCF_000001405.39). Finally, we performed an enrichment experiment using nine bacterial genomes downloaded from NCBI as reference based on the most abundant identified species from the first control sequencing experiment (see "Methods" section: "Nanopore sequencing").
Each read passing the nanopore during adaptive sampling was mapped against a single or multiple reference genome(s) (e.g., human reference genome or multiple bacterial genomes) while sequencing. The mapping occurred in intervals of several bases, and three types of decisions were made: (1) 'no_decision'-the read has been continued and mapped against the reference(s) after several bases again ('no decision'), (2) 'stop_receiving'-the read was accepted and fully sequenced ('accepted'), (3) 'unblock'-the sequencing was immediately stopped and the read was rejected by reversing of the voltage ('rejected'). The base pairs required until a decision has been made were summarised in Fig. 2 B for all reads. For both methods, read rejections occurred within approx. 400-800 bp. Accepting reads started at approx. 400 bp or 4000 bp for the enrichment or depletion protocols, respectively. More generally, read lengths of at least 400 bp were required for both adaptive sampling methods to start the individual reads' decision-making process.
The enrichment experiment yielded a higher total reads' number (5.67 million, 1.50 fold more than depletion, of which 5.44 million were rejected reads), followed by the depletion experiment (3.79 million reads, 1.39 fold more than the control, of which 3.07 million reads were rejected). The control yielded 2.73 million reads. One should note that experimental variations affect the total sequencing performance, but the yield increase via depletion of human sequences was further validated (see "Performance of human host depletion via adaptive sampling in human vaginal metagenomic samples").
Due to short read lengths, which result from the high rejection rate and the fast decision process (Fig. 2B.2), the enrichment experiment yielded the least amount of total bases and microbial bases while 'human depletion' yields the most microbial bases (Table 1). Without adaptive sampling, the proportion of sequenced human reads was unsurprisingly highest (87.93%) but could be strongly reduced by the depletion approach to 34.73% and by the bacterial enrichment down to 8.29% (Fig. 2A). The 'human depletion' method rejected almost 81.01% of all reads, which was lower than the total abundance of human DNA in the control experiment, suggesting that the chosen human genome might be insufficient for a complete depletion of all human reads or the adaptive sampling process itself is prone to error. The bacterial enrichment method rejected 95.93% of all reads, which indicates that some bacterial reads were also rejected. We identified 5.48% of Gardnerella reads, 2.41% of Lactobacillus reads, and 2.20% of other microbial reads in the 'rejected' fraction of the bacterial enrichment experiment. Simultaneously, the proportions of essential vaginal microorganisms, like Lactobacillus and Gardnerella, could be increased by both methods but higher by the enrichment protocol.
We compared the proportions of bacterial genera of the experiments identified from the reads of the 'accepted' and 'no decision' category to validate whether the overall microbial composition was retained despite adaptive Figure 1. Abundance of the ten sequenced organisms of the ZymoBIOMICS Microbial Community Standard for the native PCR-free library preparation (LSK109) and the nanopore amplification-based library preparation (RPB004). The expected fraction for the microbial standard is shown on the left (control). The fraction of sequenced bases was determined by mapping the sequenced reads against the ten organisms via minimap2. www.nature.com/scientificreports/ sampling (Fig. 2C). The human depletion method clearly showed very similar proportions to the control for 34 of 36 microorganisms except for Escherichia and Luteimonas. The difference in the proportions of these two organisms could be attributed to experimental variations, especially since their frequency in the control experiment was only 0.05% (Escherichia) and 0.03% (Luteimonas). Conversely, the enrichment method shows significant differences in most genera, including important vaginal microorganisms like Gardnerella, Lactobacillus, and Ureaplasma. Therefore, we assume that an enrichment approach might be unsuitable for investigating the microbial composition between metagenomic samples if not all species can be reliably provided as target sequences during the enrichment. On the other hand, the 'human depletion' experiments maintained a comparable microorganism composition as the control experiments and considerably (53.20%) reduced the number of human reads, making it a robust choice for clinical metagenomic samples with high amounts of human host DNA.   www.nature.com/scientificreports/ First, each of the 15 samples were sequenced without adaptive sampling serving as a control experiment and ground truth of their metagenomic composition to track possible changes introduced via adaptive sampling. We then sequenced the same isolated DNA from the control experiments while using adaptive sampling (human DNA depletion) and compared the overall sequencing performance to the previously sequenced controls (Fig. 3). Additionally, a negative control of a swab without patient material following the same sample gathering and sequencing approach yielded 126 reads (ranging from 10 to 4000 bp) but none of the reads were classifiable and might be attributed to some PCR-primer and sequencing adapter constructs.

Performance of human host depletion via adaptive sampling in human vaginal
All reads were taxonomically classified via centrifuge v1.0.4 27 to investigate their taxonomic composition (centrifuge database: Human-Virus-Bacteria-Archaea 01.2021) 28 . 99.43-99.85% of the reads generated in the control experiments could be taxonomically classified, while 97.44-99.83% of the reads in the depletion experiments could be taxonomically assigned.
On average, depletion experiments yielded 1.71 fold (± 0.27 fold) more reads (including rejected reads) than the corresponding control experiments (Fig. 3A). This corresponds to a yield of ~ 1.7 flow cells from a standard Nanopore sequencing experiment, with the only difference being that unwanted DNA molecules (human) were only partially sequenced. Reads of the categories 'no decision' (average: 5.38%, ± 7.24%) and 'accepted' (average: 0.23%, ± 0.35%) contributed a small overall proportion of all sequenced reads due to the high amount of human DNA. On average, adaptive sampling categorized 92.05% (± 7.42%) of reads as 'rejected' . ' Accepted' reads were comparably long (Fig. 3C) with a median read length of ~ 4000 bp, which is in line with previous results (Fig. 2B).
In summary, most of the not 'rejected' reads were placed into the 'no decision' category as the 'accepted' decision was rarely made. The 'no decision' category also included many human reads and most of the bacterial reads. Combining the 'accepted' and 'no decision' fractions of the bacterial reads generally yielded more microbial reads compared to the control experiment underlining the capabilities of adaptive sampling to increase the sequencing depth (Fig. 3C). Furthermore, the decision made for the 'rejected' category was remarkably accurate as it contained almost exclusively human reads. However, those human reads were still identified in the 'accepted' and 'no decision' fractions as observed in previous experiments. Thus, to reliably remove all human reads during sequencing seems rather elusive.
Depletion does not alter the species distribution in samples. Adaptive sampling selectively depletes human reads while simultaneously enriching microbial reads due to the increased sequencing depth but may alter species representation and metagenomic composition. We, therefore, compared the taxonomically classified reads of the 15 vaginal metagenomic samples in detail, to compare the individual abundance between control and the host depletion experiments (Fig. 4, Supplementary Figures S1, S2). www.nature.com/scientificreports/ The proportion of each genus was calculated in relation to the total amount of reads generated for each sample, including 'rejected' reads for the depletion experiment. We only included genera with at least 30 reads to avoid over-interpreting uncertain taxonomic classifications. This resulted on average in five bacterial genera (min 1, max 20) and in six bacterial species (min 1, max 26; Supplementary Figure S2), which correspond well with the expected vaginal microbiome 29 . Across all 15 samples, the bacterial proportion varied from the corresponding control experiments on average by 1.03 fold, indicating a similar representation of genus abundance levels by the depletion experiment (Fig. 4). Low abundance genera with reads counts between > 30 and < 100 showed higher discrepancies (± 0.23 fold). Higher read counts showed higher reliability (e.g. > 500 < 1000 reads (± 0.03 fold) and > 1000 reads (± 0.04 fold). This higher discrepancy in genera with a small number of reads is expected, as the experimental variability's influence is more prominent. We did not detect any organisms that were found solely by only one method using the applied cutoff. Overall, both experiment groups performed highly similarly in relation to the detected genus abundance levels between the control and depletion experiments.
In addition to the abundance comparison, we assessed the Bray-Curtis-Dissimilarity and the Spearman correlation for a more robust statistical analysis. Low Bray-Curtis dissimilarity values for all samples (0.005-0.077) indicate high similarity between each pair of metagenomes (control and depletion), underlining previous findings using only the abundance comparison ( Table 2). The Spearman correlation test resulted in statistical significance (2.02E−05 and 1.92E−06) and positive coefficients (rho: 0.98 and 0.93) for samples 1 and 9 (Table 2). In other words: if, e.g. Lactobacillus is highly abundant in the control of sample 1, it is also highly abundant in the corresponding depletion experiment. Samples 3, 10 and 15 were not included in the Spearman correlation calculation since only one pairwise case (organism) occurs in these samples. The P value indicates no statistical significance for the remaining samples, probably due to few pairwise cases (< 10 pairwise cases).

Conclusion
The enrichment of targets in human cells [30][31][32][33][34] or species in mock communities 31,35 or fecal samples of lions 36 via adaptive sampling was previously demonstrated. In the present study, we used clinical metagenomic samples obtained from vaginal swabs of pregnant women to evaluate the performance to deplete the high content of human DNA that heavily impairs downstream microbiome analyses. Our results demonstrated that ONT's unique adaptive sequencing feature has reliably increased the overall sequencing depth of bacterial sequences in clinical metagenomic samples without changing the microbial composition when providing a human reference genome to deplete the human DNA during sequencing. However, the enrichment experiment showed significantly higher 'human depletion' , but changed the overall identified bacterial composition by also depleting other microbial sequences, illustrating that the enrichment method may be poorly suited for some microbiome studies.
Currently, to increase the sequencing depth for metagenomic samples with low microbial material, several sequencing runs per sample or sequencers with higher throughput (e.g., PromethION in case of ONT) are necessary, besides wet laboratory methods to deplete the host DNA. ONT's adaptive sampling method demonstrated in our work a 1.7 fold increase in sequencing depth for samples with high human DNA contamination, which increases sensitivity of taxonomic profiling by providing more sequencing data 2 . Moreover, adaptive sampling can be used in addition to wet lab procedures to increase sensitivity further 37 . This makes molecular monitoring www.nature.com/scientificreports/ of human reservoirs with low microbial concentrations and high host DNA loads (e.g., nasal swabs, sputum, or skin swabs) more feasible. In this manner, the adaptive sampling improved the detected organisms beneficial to the vaginal microbiota, such as Lactobacillus, and pathogenic microbiota, such as Streptococcus, Gardnerella, or Ureaplasma, potentially harmful in pregnancy. Therapeutic measures can be derived based on the presence or ratio of certain species improving vaginal microecological diagnostics in the future, which enables clinically relevant insights into eubiosis or dysbiosis during pregnancy. On a side note, due to the increased sequencing depth via adaptive sampling, more samples can be barcoded and sequenced simultaneously, reducing the total cost per sample in a diagnostic laboratory. However, human DNA could not be completely removed, so that the raw data always contained human sequences, which poses an ethical problem. Although the ONT adaptive sequencing could still be used for diagnostic purposes, patient consent is required for scientific purposes or data upload to public repositories like the National Center for Biotechnology Innovation (NCBI) or the European Nucleotide Archive (ENA). In turn, removing human sequences by different bioinformatics approaches poses hurdles for institutions and hospitals without adequate bioinformatics support. However, the research field of nanopore sequencing is evolving rapidly and dynamically, and ONT may address current limitations in the foreseeable future. In addition, continuous improvements in raw data accuracy and new chemistries mean that less sequencing depth is needed for reliable results 38 .
The present work can help to decide which adaptive sampling approach is best suited for analyzing specific clinical samples and questions. For example, when sequencing metagenomes with unknown microbial composition and host contamination, the depletion method is a better choice. On the other hand, the enrichment method might be helpful for metagenomics if only certain bacterial species are of interest as the higher rejection rates increase the sequencing depth further. Combining the real-time sequencing data stream of ONT with automated analysis pipelines, the turnaround time from sample collection to analysis and an appropriate treatment strategy can be reduced.
A few limitations must be noted. Our results were based on the library preparation with PCR amplification, for which we did not observe a noticeable bias when sequencing the microbial community standard. Still, these results might be different in other specimens. Due to the DNA isolation via bead beating and the PCR-based library preparation, we sequenced short DNA fragments of approximately 2500 bp only. Longer DNA fragments might improve the adaptive sampling's decision-making, further increasing overall sequencing depth. Furthermore, the user should carefully select the provided reference genome(s) during adaptive sampling as, e.g., another human reference might slightly improve or worsen the overall depletion performance. Finally, raw read accuracy is currently at around 97.5% and might impact the read to reference mapping during sequencing and thus the adaptive sampling accuracy.
We strongly believe that adaptive sampling will prove exceptionally useful within clinical research and the individual microbiological and microbiological diagnostic approach in routine diagnostics. The increased information depth for compartment-specific human microbiomes in a physiologic and pathophysiologic context may change paradigms of antiinfective therapies in a personalized risk stratifying manner.

Methods
All methods were performed in accordance with the relevant guidelines and regulations. Library preparation. DNA quantification steps were performed using the dsDNA HS assay for Qubit (Invitrogen, US). DNA of the microbial community standard was size-selected by cleaning up with 0.45 × volume of Ampure XP buffer (Beckman Coulter, Brea, CA, USA) and eluted in 50 μl EB buffer (Qiagen, Hilden, Germany). The library was prepared from 1 µg input DNA using the SQK-LSK109 kit (Oxford Nanopore Technologies, Oxford, UK) and 5 ng using the SQK-RPB004 kit (Oxford Nanopore Technologies, Oxford, UK), according to the manufacturer's protocol. The sequencing library of clinical vaginal samples was prepared from 5 ng input DNA using the SQK-RPB004 kit (Oxford Nanopore Technologies, Oxford, UK), according to the manufacturer's protocol. Nanopore sequencing. The microbial community standard was sequenced on the GridION using FLO-MIN106D Flow cells and the minknow-core-gridiron:4.1.2 software (all Oxford Nanopore Technologies). The Standard 48-h script with active channel selection was applied.

Bioinformatics analysis.
Sequencing data of the microbial community standard were mapped using mini-map2 v.2.19 against the reference organisms' sequences provided by Zymobiomics 23 . Subsequently, the base pairs sequenced per organism and per sequencing method were compared to each other.
Ethical approval and patient consent. This study was approved by ethical committees at University Hospital Jena (No. 2018-1183), University Hospital Halle/Saale (No. 2019-012), and University Hospital Rostock (No. A 2019-0055). Eligible women, who participated in the study, were informed about the study, applied procedures, and any risks due to sampling by a physician and gave their written consent. Participants were also informed that they could withdraw from the study at any time.

Data availability
The centrifuge database available at: https:// osf. io/ 5zv8t/. The read data without human reads is available at the following Bioproject accession number: PRJNA799199 (SRA Accession numbers: SRR17688764-SRR17688740) .