Introduction

The Oxford Nanopore Technologies (ONT) MinION sequencer represents a significant paradigm shift in the reach, applicability, and capability of nucleic acid sequencing technology1. Combining a portable form factor, simple library prep, long-read capability (kb to Mb)2, direct RNA sequencing3, and real-time data output, the MinION has been variously applied to forensic genotyping4, bacterial typing5, plant biology6, food safety7, environmental metagenomics8,9, cancer research10,11, antibiotic resistance studies12,13 and de novo genome assembly14,15,16. The small operational and logistical footprint of the MinION, combined with its real-time capabilities17, make it uniquely suited to diagnostics and surveillance in clinical and field-forward settings, where the MinION has already been applied to assay Ebola18,19, Zika20, tuberculosis21, and other pathogens22,23,24,25.

Despite these successes, nanopore sequencing-based diagnostics still face the “needle in a haystack” problem of obtaining sufficient coverage of low-abundance target from a high-abundance background (e.g., pathogen/host, cancer/nontumor) sample26. While bacterial culture provides enriched quantities of genetic material in some applications27, culture-independent molecular biology-based target enrichment and background depletion methods28 including amplification29 and hybridization capture approaches30 are increasingly being adapted for use in library preparation to yield “targeted” or “selective” sequencing31,32. Nearly all such methods require a priori knowledge to guide the design of the target-sequence-specific primers, baits, or probes required for selection.

Unique to the Oxford MinION, real-time selective sequencing was first introduced by Loose and colleagues in 201633, offering a promising alternative to these molecular biology-based enrichment approaches. Dubbed “Read Until”, the method capitalizes on the real-time data output and discretely addressable nanopore architecture of the MinION to enable selection of individual DNA molecules. Read Until makes it possible to preview the real-time data associated with DNA traversing a given nanopore, and if it fails to meet some user-defined selection criteria, reject that read by reversing the pore bias and physically ejecting the DNA (i.e., “unblocking” the pore). DNA meeting the criteria sequences to completion as usual, with selection producing a net enrichment of target versus non-target reads in the final sequence pool. Read Until sequence-based selection has no clear precedent in the literature, the closest analogs being size-based34 and methylation-based35 DNA sorting in nanochannels, while most “single-molecule sorting” methods principally consist of surface immobilization coupled with molecular-resolution fluorescence imaging36.

In the original Read Until implementation, Loose applied a dynamic time warping (DTW) algorithm to pattern-match the live current trace “squiggle” output by the MinKNOW sequencing software against a reference squiggle synthesized from the (ACGT) target sequence of interest33. The method was successfully executed at a time when the MinION sequencing rate was 70 bases/s (it is now 450 bases/s) using a 22-core server to select for 5 kb portions of lambda DNA and to normalize coverage among 2 kb amplicons. Subsequent work developed a statistical model for optimizing DTW selection37. Here we introduce a new implementation of real-time selective sequencing based on Loose’s original framework: Read-Until with Basecall and Reference-Informed Criteria (RUBRIC). Rather than pattern-matching event traces, RUBRIC relies on real-time basecalling and alignment to conventional ACGT-type reference sequences, providing significant benefits to speed, scalability, and operational flexibility. Moreover, RUBRIC is specifically designed to function with the more modest computing resources typical of portable or point-of-need MinION-based activities rather than high-end multiprocessor workstations or cluster computing platforms. In addition to characterizing the operation of the RUBRIC architecture for a series of proof-of-concept experiments, we also propose a predictive model evaluating the likely limits of real-time selection performance generally across a range of potential sample types and use cases.

Methods

RUBRIC implementation and operation

Figure 1 shows the RUBRIC real-time selection architecture, implemented with off-the-shelf, ethernet-linked laptop and desktop PCs, while Table 1 summarizes all RUBRIC experiments discussed here. Built upon the original Read Until sample code provided by Loose33, RUBRIC integrates ONT’s Nanonet basecaller (v2.0.0, included with the RUBRIC code as noted below) and replaces DTW-based target pattern-matching with sequence-based alignment using LAST (rev 759)38. For each sequencing experiment, initial MinKNOW calibration and multiplex scans were performed, MinKNOW sequencing was initiated, and RUBRIC scripts were then started on the desktop PC. Depicted in Fig. 1, the general RUBRIC control flow consisted of receiving batches of read events from the Read Until Event Sampler, formatting those events for basecalling by Nanonet, aligning the results against a desired target reference sequence with LAST, and parsing its output to make skip/sequence determinations which were then communicated to MinKNOW via the Read Until API. LAST arguments used in the RUBRIC selection process are shown in Table 1. For all experiments, the Event Sampler was set to ignore the first 100 (typically lower fidelity39) events of each processed read and then transmit an “evaluation window” comprising the next 300 events (600 for run G, see Table 1) as the input to the RUBRIC selection process. During all experiments, the RUBRIC scripts logged relevant Event Sampler read information for method improvement and downstream reconciliation with offline Albacore basecall and BWA alignment results.

Figure 1
figure 1

Schematic of the RUBRIC workflow illustrating the division of computational effort between two garden-variety PCs: a laptop that runs the MinION sequencer and its MinKNOW software interfaced through the Read Until API (via ethernet) to a desktop system that performs the key RUBRIC operations of pre-screening reads for admission to the decision process, basecalling and aligning reads to nucleic acid target reference(s) in real-time, and communicating any resulting skip/reject decisions back to MinKNOW.

Table 1 Summary of RUBRIC experiments and parametric variations for preliminary lambda DNA experiments A1-B1, mainline EagI-digested Lambda DNA experiments B2-E2, and example use case experiments F and G in which Cas9-cut rDNA was selected from E. coli gDNA and E. coli gDNA was selected from human gDNA, respectively.

Despite processing only a short initial portion of each read (~150 bases from 300 events), successfully implementing RUBRIC with garden-variety PCs necessitated careful conservation of limited computing resources. In addition to running RUBRIC on a dedicated desktop machine, Fig. 1 illustrates the additional steps that were taken to control the volume and optimize the relevance of reads admitted to the RUBRIC decision process. First, in all experiments detailed here, RUBRIC selection was applied only to even-numbered pores, while odd pores were allowed to sequence normally, providing an internal control. Second, a threshold filter was implemented by quickly computing the mean or standard deviation (Supplementary Section S2) of pore current for the evaluation window, and on that basis, excluding from selection reads that were empirically determined to be unlikely to yield mappable fast5 sequence files. Lastly, a queue was implemented to: 1) constrain the number of event traces passed to RUBRIC at a given time to avoid overwhelming available computing resources and 2) screen reads that spent too long in the queue from entering the decision process. Queue size varied between 12 and 24 reads (Table 1), but in all experiments, reads spending more than 2 seconds in the queue were deemed too old for a timely decision to be rendered, and therefore bypassed selection. As Fig. 1 indicates, during the RUBRIC development and characterization process, the default for any reads not admitted to the selection process (i.e., odd, out-of-threshold, timeout, and otherwise “undecided” reads) and for reads receiving an affirmative “sequence” decision was to sequence as usual. Only reads receiving a “skip” decision resulting in ejection by pore polarity reversal (unblocking) were not sequenced by default.

Software and computing architecture

After a preliminary experimental iteration using two laptop PCs (Table 1, runs A1-A2), the final and preferred RUBRIC sequencing setup (Fig. 1) consisted of an off-the-shelf HP Elitebook 820 G3 laptop with 4 cores (Intel® Core™ i7-6500U CPU @ 2.5 GHz, 16 GB RAM, Samsung MZNLN512HCJH-000H1 477GB SCSI SSD) connected by USB to a MinION Mk1B sequencer and by 2-foot Cat-5e Ethernet cable to a Dell Optiplex 9020 desktop with 8 cores (Intel® Core™ i7-4790 CPU @ 3.6 GHz, 16 GB RAM, Samsung 850 2TB SCSI SSD). Oxford MinKNOW version 1.6.11 sequencing software was run on the laptop for all experiments other than run G (v1.11.5), while the desktop system provided the additional computing power needed to implement RUBRIC real-time basecalling, alignment, and selection functions concurrently with sequencing. No other computing resources were used within the RUBRIC control loop. RUBRIC software communicated with MinKNOW’s Event Sampler via the Read Until API (v1) to acquire event data and provide rejection instructions in real time. Both computers operated in Windows 10, and the desktop was placed into Safe Mode during runs to prevent CPU usage by background processes and services. After sequencing, all data were basecalled offline using Albacore v1.2.6 (v2.2.4 for run G) and post-run alignment was performed using BWA v0.7.12-r1039 (with ‘mem -x pacbio’ arguments) on Sandia’s Biota computing cluster. While BWA was used for offline alignment and classification of output MinION reads, LAST was selected for use inside the RUBRIC control loop due to its speed and the comparative ease of integrating it into the real-time workflow. Downstream data analysis and visualization were performed using custom Python scripts (pandas, numpy, matplotlib, seaborn), custom R scripts, and Microsoft Excel.

Sample preparation and experimental variations

Lambda DNA Experiments

To provide a test case for RUBRIC selection, lambda-phage DNA (cat # N3011S, New England Biolabs (NEB), Ipswich, MA) was digested using the EagI enzyme (NEB, cat # R3505S) to produce three large DNA fragments of roughly similar size (20 kb, 17 kb, and 12 kb). Digestion was performed per NEB protocol in a 50 μL reaction, and the product was purified using phenol:chlororform. The 17 kb fragment was chosen as the target for RUBRIC selection, while reads not matching its sequence were skipped. For all lambda DNA experiments (A1-E2 in Table 1), digested samples were prepared using ONT’s 1D ligation kit (SQK-LSK108) and loaded into SpotON flow cells (FLO-MIN107, used for all experiments in this article) using methods described in the kit’s accompanying protocol. DNA concentrations were measured using a Qubit Fluorimeter (Thermo Fisher, Waltham, MA).

Table 1 summarizes the progression of experimental parameter variations through sequential RUBRIC experiments, with letters differentiating experiments performed on different days and numbers indicating successive RUBRIC runs with the same loaded sample (but different RUBRIC settings) on a given day. Datasets indicated with an asterisk (*) have been time-filtered as explained in Supplementary Section S3 to eliminate data from periods during which skip decisions failed to properly reject DNA. Experiments A1, A2, and B1 are included primarily for comparison, reflecting the earliest parametric iterations and system configurations, and are therefore not representative of typical RUBRIC performance. Accordingly, aggregate results distinguish between “mainline” results associated with the preferred RUBRIC system configuration (N = 5, runs B2-E2), and the set of all lambda experiments (N = 8, A1-E2). Non-lambda DNA runs F and G, described below, are preliminary proof-of-concept examples applying RUBRIC in use cases potentially relevant to pathogen diagnostics.

To summarize the variations tested for lambda DNA, runs A1 and A2, performed using two equivalent, Ethernet-coupled laptops, tested the effect of changing the settings of the LAST aligner used in the RUBRIC control loop. Experiment B1 used the same settings but implemented RUBRIC on ethernet-linked laptop and desktop machines, while B2 revealed the benefit of operating the RUBRIC-running desktop in Safe Mode. Experiment C used a previously prepared frozen library and reduced the queue size from 24 to 12. Experiment D increased the queue to 16 and adjusted the mean current-based threshold with a fresh digest and library prep. Experiment E1 implemented a standard deviation-based threshold for a frozen library, and experiment E2 further adjusted that threshold.

E. coli Ribosomal DNA Experiment

While long-fragment lambda DNA proof of concept experiments facilitated early RUBRIC optimization and troubleshooting efforts, we also performed preliminary experiments to assess the potential of RUBRIC selection in more realistic applications, specifically with an eye toward bacterial pathogen diagnostics. In experiment F, inspired by conventional bacterial ribotyping, guide RNAs for CRISPR/Cas9 cutting were designed to target the 5′ end of the 16S and the 3′ end of the 23S ribosomal DNA (rDNA) loci of E. coli (Accession number: NC_000913) to excise the ~5 kb 16S-23S region of the rDNA locus. Single-molecule guide RNA (sgRNA) templates were generated by polymerase chain reaction (PCR) (16S primer 5′-M-TGGCTCAGATTGAACGCTGG-N-3′ and 23S primer 5′-M-CGCCCAAGAGTTCATATCGA-N-3′, where M = 5′-GGATCCTAATACGACTCACTATAG-3′ and N = 5′-GTTTTAGAGCTAGAA-3′) to yield a single chimeric template containing the crRNA, tracrRNA, and a T7 promoter sequence as described by Anders40. sgRNAs were transcribed in vitro using the TranscriptAid T7 High Yield Transcription Kit (Thermo Fisher, cat # K0441) according to manufacturer’s protocol. Guide RNAs were purified using MEGAclear Transcription Clean-Up Kit (Thermo Fisher/Ambion, cat # AM1908) according to manufacturer’s protocol and diluted to 300 nM.

For the CRISPR/Cas9 digest, a 90 μL reaction was prepared by mixing 9 μL of 10X Cas9 Nuclease Reaction Buffer (NEB), 30 nM gRNA1 (targeting 16S region), 30 nM gRNA2 (targeting 23S region) and 30 nM SpyCas9 Nuclease (NEB, cat#M0386S). After a 15 min incubation to form the ribonucleoprotein complex, 10 μg of bacterial genomic DNA was added and the reaction incubated at 37 °C for 4 hours. 1 μL of proteinase K (Thermo Fisher, AM2548) was added and the reaction incubated at 65 °C for 15 minutes. DNA was purified using Agencourt AMPure XP beads (cat #A63881, Beckman-Coulter, Brea, CA) according to manufacturer’s protocol. Library preparation was performed per ONT protocol using the 1D2 ligation kit (SQK-LSK308), and RUBRIC targets were set to select for the 16S-23S rDNA sequences (NCBI).

Mixed Human/E. coli Experiment

The second example use case, experiment G, sought to select for 1% E. coli genomic DNA against a background of 99% human DNA (HeLa, NEB, cat# N4006S) in a sample mixed prior to library preparation. Escherichia coli K12 MG1655 (ATCC, Manassas, VA) culture was grown overnight in LB media at 37 °C with shaking at 250 rpm. 1 mL aliquots were spun down to make the bacterial pellet, and cells were lysed using Qiagen lysis buffer (Qiagen, Redwood City, CA) with added Proteinase K and RNase A (Thermo Fisher). The lysate mixture was incubated for 15–30 min at 50 °C. Pure genomic DNA was extracted using the phenol:chloroform extraction method. Briefly, one volume of phenol:chloroform:isoamyl alcohol (25:24:1) (Sigma-Aldrich, St. Louis, MO) was added to the lysate mixture and the samples were centrifuged at room temperature for 10 minutes at 16,000 × g. The upper aqueous phase was transferred to a fresh tube and the DNA was precipitated by the addition of 0.1 volumes 3 M sodium acetate (pH 5.0) and 2.5 volumes of 100% ethanol. The samples were stored at −20 °C overnight to precipitate the DNA. The DNA was pelleted at 4 °C for 15–30 minutes at 16,000 × g and the DNA pellets were washed twice with 500 μL of 70% ethanol. The DNA pellets were dried at room temperature for 5–10 minutes and resuspended in nuclease free water, and library preparation was accomplished using a RAD004 rapid kit per ONT protocol. During RUBRIC operation, reads were LAST-aligned in real-time against the entire 4.6 Mb E. coli K12 genome (NCBI) as the selection target. As noted in Table 1, for experiment G the evaluation window was increased from 300 to 600 events to enable greater discrimination between bacterial and human sequence, and LAST stringency was reduced to capture as many rare target reads as possible.

Results

Data flow analysis and lambda DNA results

Figure 2 illustrates the detailed data flow analysis approach used to evaluate even pore RUBRIC selective sequencing performance in comparison to the internal control provided by non-selecting odd channels for representative lambda DNA experiment B2. Equivalent Sankey diagrams for all other experiments (and filtered datasets) are provided in Supplementary Fig. S9 with results summarized in Supplementary Fig. S1. Table 2 compares performance metrics for the runs.

Figure 2
figure 2

Sankey chart depicting read and fast5 sequence file data flow analysis for Experiment B2. Because the target lambda DNA fragment was a subset of the overall lambda (background) sequence, no reads mapped exclusively to the target, and therefore all correctly mapped target reads appear in the “both” category at the 3-pronged terminal ends of each chart branch. Undecided read counts shown here include both reads that timed-out of the decision process (>2 seconds in the queue) and those that did not otherwise receive a decision.

Table 2 Performance metrics for RUBRIC selective sequencing experiments including preliminary lambda DNA runs A1 through B1, mainline lambda experiments B2 through E2, and application examples F and G.

Figure 2 underscores the importance of such detailed analysis, as simply comparing target- and background-mapping fast5 ratios for odd (10,881:20,761) and even pores (14,312:23,865) can be misleading. Despite an apparent 32% increase in RUBRIC target reads, only 68% of those reads—less than the count of odd target reads—resulted from sequence decisions, while 17% were actively skipped or diverted from the decision process by the threshold filter. The remaining 15% never received a decision, most because they were not reported to RUBRIC by the Event Sampler. We now discuss the read fractions represented in Fig. 2, referencing individual results of experiment B2 (Figs 24(a)) and aggregate results of the other lambda DNA experiments (Table 2, Supplementary Figs S1S3, S7 and S9S10).

Figure 3
figure 3

Lambda DNA sequence coverage plot for experiment B2 showing the effect of RUBRIC selection applied to even pore reads in contrast to unselected odd pore reads. Even and odd coverage numbers are normalized by total even and odd active pore times, respectively.

Figure 4
figure 4

Read length histograms for RUBRIC selection experiments illustrating the distribution of different read types (target, non-target, unmapped) and their fate as a function of RUBRIC selection applied to even numbered pores. Here, reads excluded by the selection process (i.e. not receiving an affirmative sequence decision) include skipped, out-of-threshold, and undecided reads, while reads not mapped to target include those mapped to background/non-target sequence as well as unmappable reads. (a) Lambda DNA experiment B2 showing selection for the middle (nominally ~17 kb) fragment. (b) Example use case dataset F* showing selection for Cas9-excised rDNA from E. coli gDNA. (c,d) Example use case dataset G* showing selection of 1% E. coli gDNA from a background of 99% human gDNA. Supplementary Fig. S10 provides more detailed distributions of all read types and categories.

Sampled Reads

The character of reads communicated to RUBRIC by the Read Until Event Sampler is best represented by odd pore (control) reads, which exhibited average fragment lengths of 8007 ± 5882 nucleotides (nt) and Albacore quality scores (sequencing_summary.txt-derived “mean_qscore_template”) of 9.52 ± 2.00 for n = 214,445 fast5s from N = 8 lambda experiments (Supplementary Fig. S2).

Unsampled Reads

A small percentage (0.62% ± 0.42%, N = 8 runs) of reads had fast5 files but lacked Event Sampler entries in the RUBRIC log and were therefore unavailable for selection. These “unsampled” reads typically had quality scores (9.13 ± 2.26, n = 34,455 fast5s, N = 8 runs) and proportions of target, non-target, and unmappable reads comparable to the sampled control population (Fig. 2, Supplementary Figs S2 and S9). The short length (583 ± 206 nt, n = 34,455 fast5s, N = 8 runs) of most unsampled reads (Supplementary Fig. S10), suggests that they may result from DNA transiting the pore within the sampling period of the Event Sampler.

Non-Sequence Reads

As in Fig. 2, a consistently large proportion of control (odd) sampled reads (89.5% ± 1.89%, N = 8 lambda runs) never yielded fast5 sequence files. Pore activity timelines (data not shown) reveal that these “non-sequence” reads typically appear as serial, discretely reported events occurring between identifiable sequence-producing reads. The hypothesis that these non-sequence reads primarily indicate sub-sampling of open pore time (versus degraded DNA, pore fouling, etc.) is reinforced by our observation (data not shown) that setting RUBRIC to unblock all out-of-threshold (predominantly non-sequence) reads produced no apparent change in even pore throughput. A related internal sampling artifact may cause the observed subdivision of long DNA reads2.

Uncalled Reads

The total number of fast5s that could not be basecalled offline by Albacore was essentially negligible, ranging from 0.0384% (A2) to 0.621% (D) with an average of 0.280% ± 0.246% (N = 8 lambda runs) and zero (0) sequence decision fast5s failing to basecall.

Mapped and Unmapped Reads

Supplementary Fig. S2 shows that odd unmapped reads exhibited significantly lower average quality scores (6.07 ± 1.26, n = 35,083 fast5s, N = 8 lambda runs) than reads mapping to target or background references (10.21 ± 1.23, n = 179,057 fast5s, N = 8 lambda runs) and were shorter on average (4082 ± 5556 nt vs. 8789 ± 5625 nt) than corresponding mappable reads.

Out-of-Threshold (OOT) Reads

Threshold filter settings (Table 1) were determined empirically from prior run data, requiring updates after any significant sample composition, flowcell batch, library prep, or ONT software changes. Generally, out-of-threshold fast5 quality score averages were about 15% lower than corresponding odd scores (Supplementary Fig. S2) and OOT reads about 30% shorter on average. While retrospectively-set thresholds for most mainline experiments successfully excluded 90–97% of ultimately unmappable (especially non-sequence) reads from the decision process, typically diverting >80% of even sampled reads, experiment C showed a lower out-of-threshold proportion (53.7%), rejecting only 56.6% of unmappable reads (Supplementary Fig. S9(f)). This poor threshold selectivity likely accounted for the unusually high in-threshold read/min rate of experiment C (43% higher than B2, Table 2), which in combination with its small queue, may explain its high proportion of undecided reads. Based on C, threshold adjustments in experiment D (Supplementary Fig. S9(g)) produced much improved threshold specificity, precision, and accuracy at the expense of reduced sensitivity (Table 2). Though not optimized when introduced in experiments E1 and E2 (Table 2, Supplementary Fig. S9(h–j)), thresholds based on pore current standard deviation proved superior to those based on mean current because the former helped to mitigate errors associated with current drift and other offsets (Supplementary Section S2).

Undecided and Timeout Reads

The presence of in-threshold reads not receiving skip/sequence decisions typically reflected a computational resource limitation affecting the MinKNOW or RUBRIC PCs. Table 2 indicates the fraction of undecided reads exceeding the 2 second RUBRIC queue timeout period. Excepting outlier experiment C, about 99% of in-threshold reads for mainline lambda experiments received decisions (Table 2). The high in-threshold read rate and poor decision efficiency of experiment C may indicate that as configured the RUBRIC system could effectively process 400–500 decisions/min, beyond which computing resource limitations became significant. Threshold filtering caused undecided reads to differ from control reads mainly in their lower, but variable proportion of non-sequence reads. Because undecided and timeout reads often appeared in localized clusters on the read timeline (see especially Supplementary Fig. S5(d)), this variability may reflect periods of unusually high read throughput that also affected whether fast5s were created by the MinKNOW PC.

Sequence Decision Reads

Table 2 details the performance of the RUBRIC decision process in rendering sequence decisions for target mapping reads and skip decisions for non-target reads. For experiment B2, Fig. 3 indicates the coverage of lambda (target and non-target) sequence with and without selection, while Fig. 4(a) illustrates selection as a function of DNA fragment length. On average for mainline lambda experiments, the decision process correctly excluded 97.7% ± 1.9% (N = 5) of non-target reads while capturing 91.4% ± 5.1% (N = 5 runs) of available targets, proportions that reflect both basecalling accuracy and the stringency of LAST aligner settings used within the RUBRIC control loop. On average, 98.5% ± 0.6% (N = 5) of sequence decision fast5s mapped to target, and even including the typically small proportion of unmapped fast5s (1.5% ± 0.6%), sequence decision quality scores (Supplementary Figs S2S3) were better on average (10.46 ± 1.36, n = 42,191 fast5s) than the control sampled read population (9.51 ± 2.17, n = 1,690,891 fast5s). These results suggest that for diagnostic applications, data analysis should focus on sequence decision fast5s and consider other categories (i.e., undecided, unsampled, out-of-threshold, and skipped reads, in that order) only if coverage is lacking.

Skip Decision Reads

While skipping ostensibly ejects DNA from the nanopore, on average 46.7% ± 6.1% of mainline experiment skip decisions nevertheless produced fast5s (N = 4, excluding outlier C, where the ill-set threshold admitted many non-sequence reads). Skipped-read fast5s occur for two primary reasons. First, when a skip instruction is received, MinKNOW assesses whatever read data has already been acquired and writes it to fast5 if it represents viable sequence (personal communication with ONT staff, 1-9-2018). When skipping is operating correctly with decision times substantially shorter than DNA pore-transit times, this data handling convention produces characteristic truncation of skipped reads visible in the even pore results of Fig. 4 and Supplementary Fig. S10 as a prominent mound of skipped reads typically centered in the 1500–2500 nt size range. Figure 3 also shows these skip-truncated reads as the higher-coverage “rabbit ear” features (also observed by Loose33) at the ends of the non-target lambda fragments. The absence of skip-truncation is an important indication that Read Until DNA rejection is not operating correctly, as discussed in Supplementary Section S3. Skip decision fast5s may also result when reads transit the pore before a RUBRIC decision can be rendered, whether due to relatively short DNA fragments or long decision times (see Supplementary Section S6). Unlike skip-truncated reads, which appear only in the even pore results of Fig. 4 and the like, reads short enough to escape the decision process in this manner are visible in both odd and even distributions, typically below 1000 nt. In combination, fugitive reads and skip-truncation yielded short average skipped-read lengths of 1373 nt ± 606 nt (n = 424,857 fast5s, N = 8 lambda runs), while average quality scores were 8.76 ± 2.50 (Supplementary Fig. S2).

Overall RUBRIC Performance

Table 2 reports absolute target enrichment on both a sequence- and read-basis. Overall, absolute enrichment results were not particularly encouraging, as only mixed sample run G realized both read and sequence enrichment (+15% sequence based on 66 reads for filtered dataset G*, Supplementary Fig. S9(n)), while lambda run B2 showed a nominal gain in read count (2.1%) but slight depletion (1.3%) of target sequence. Other runs saw net reductions in target sequence as great as 24.4% for lambda run E1 and 57.8% for filtered rDNA dataset F* (Supplementary Fig. S9(h,l), respectively).

To help understand these results, Supplementary Section S7 derives a model predicting the likely best-case performance of RUBRIC-style real-time selection for different libraries and computing configurations. In short, because selection only rejects non-target reads, absolute target enrichment is only realized by increasing the total throughput of (even) RUBRIC reads vs. (odd) control reads. Equation 6 in the supplement expresses the maximum absolute enrichment (and throughput enhancement) ratio

$$\frac{{N}_{sel}}{{N}_{0}}=\frac{{f}_{t}{t}_{t\_seq}+{f}_{bg}{t}_{bg\_seq}+{f}_{ns}{t}_{ns}}{{f}_{t}{t}_{t\_seq}+{f}_{bg}{t}_{skip}+{f}_{ns}{t}_{ns}}$$

as a function of read fractions (f) for target (t), background/non-target (bg), and non-sequence (ns) reads and the characteristic times required to sequence target reads (tt_seq) and background reads without selection (tbg_seq), skip background reads with selection (tskip), and pass non-sequence reads independent of selection (tns). As the formula indicates, absolute enrichment is purely a consequence of the time saved by skipping versus sequencing background reads, scaled by their relative prevalence. Furthermore, low pore occupancy (large fnstns), as in the experiments described here (Table 2 and Supplementary Table S1), significantly diminishes the benefits of selection. Discrepancies between the empirically observed throughput and absolute enrichment ratios in Table 2 mainly reflect inefficiencies and imperfections in the RUBRIC selection process.

Beyond absolute enrichment, relative enrichment (Table 2) also provides a practical indication of how depleting non-target reads improves the final sequence pool. Computed as the ratio of sequence decision target reads per non-target read divided by the ratio of odd target reads per non-target read, relative enrichment ranges from ~130x to ~330x for mainline lambda experiments. This metric underscores the idea that sequence decisions yield such highly purified target-mapping sequence that in most use cases, significant time savings can be realized by analyzing only these reads.

Example use cases

Figure 4(b) and Supplementary Figs S9(l) and S10(l) show the result of RUBRIC selection applied to Cas9-cut E. coli gDNA (dataset F*). The target-mapping peak associated with cut rDNA fragments is particularly prominent because 1) E. coli has seven copies of the rDNA locus and 2) the AMPure XP beads used in the 1D2 library prep provide some positive size selection in the relevant 4–5 kb range. While RUBRIC rDNA-mapping reads were reduced 54% versus control, only 3.2% of mappable sequence decision reads mapped to background gDNA versus 89.3% in the control case, yielding relative enrichment of ~290x. Table 2 reveals suboptimal threshold settings for this run, which realized high specificity but low sensitivity with 38% of the relatively rare target reads falling out-of-threshold. Despite overly aggressive threshold filtering, skip/sequence decisions performed well and had the lowest average decision time (0.23 sec) of any experiment here (Supplementary Fig. S7 and Table S1), likely due to the shorter rDNA target reference and low read rates (Table 2) attributable to the relatively dilute library (Table 1).

Figure 4(c,d) and Supplementary Figs S9(n) and S10(o–p) show the result of E. coli selection in the mixed human/E. coli experiment (dataset G*). Despite LAST-aligning the RUBRIC evaluation window to the entire 4.6 Mb E. coli genome for selection, decision times still averaged only 0.91 sec (Supplementary Fig. S7 and Table S1). Significantly for this application, aligner stringency was reduced to maximize the number of rare bacterial reads that would be captured, while the evaluation window was doubled to provide additional discrimination between human and bacterial sequence. Consequently, while more sequence decision reads mapped to target (66 vs. 63 control), 42.1% of sequence decision fast5s did not map to target. Moreover, of 84 total even target reads, two were lost to the threshold filter and 17 to skip decisions, as indicated by the comparatively low decision sensitivity, precision, and accuracy for this run. Specificity, however, was comparable to the best seen here, reflecting the comparatively large number of correctly skipped non-target reads. Threshold settings for run G also performed better overall than for any other experiment. Beyond providing nominal absolute target enrichment, the run achieved ~290x improvement in sequence decision target:non-target ratio due to background depletion of the original 1:99 library.

Discussion

In this article, we have introduced RUBRIC, a new adaptation of real-time selective sequencing for the Oxford MinION. Unlike the earlier pattern-matching approach33, RUBRIC operates in sequence-space, making it possible to leverage the speed, flexibility, and scalability of bioinformatic tools like LAST for selection. Significantly, RUBRIC pre-screening features seek to admit only informative and timely reads to the decision process, reducing computational requirements and enabling real-time basecalling, alignment, and selection of MinION reads without specialized, high-performance computing platforms. While real-time selective sequencing generally provides a means to enrich rare target sequence vs. background without target-specific reagents, primers, or baits, working in sequence-space simplifies the process of choosing, optimizing, and modifying RUBRIC selection targets, all of which can be done on-the-fly based on conventional nucleic acid reference sequences.

We have characterized RUBRIC operation through a series of lambda DNA digest experiments, obtaining limited absolute enrichment of target reads (<2%) but achieving very effective background depletion yielding as much as 330x relative enrichment versus control. The high degree of customization offered by RUBRIC (choice of basecaller/aligner, ratio of RUBRIC to control pores, threshold filter settings, queue size, queue timeout, evaluation window size/offset, aligner settings, etc.) makes it readily adaptable to different sample types, libraries, and computing configurations. Preliminary demonstration experiments have applied RUBRIC to select for CRISPR/Cas9-excised rDNA against a background of E. coli gDNA and to select for 1% E. coli gDNA against a background of 99% human DNA, achieving absolute target sequence enrichment of 15% in the latter case. To better understand these seemingly modest outcomes, we have proposed a model estimating the likely upper bounds on real-time selection performance and have found our results to be largely consistent with its predictions. This analysis suggests that the limited target enrichment we have seen to date is less a consequence of the speed or fidelity of our method than the relatively high rate of MinION pore vacancy, which critically limits the gains that can be realized by real-time selection.

Future work will focus on optimizing RUBRIC performance and applying the method to clinically and diagnostically relevant sample types (e.g., host/pathogen mixtures), where selection can provide the greatest benefits. In such applications, accumulating RUBRIC sequence decision reads could itself provide a rapid, presumptive diagnostic result, given sufficient specificity. These reads could also be used to prioritize which fast5s should receive concurrent full strand basecalling and analysis during sequencing, potentially shortening time to identification. With these goals in mind, we will seek to improve our library preparations to increase pore occupancy and DNA fragment length, both of which should substantially improve RUBRIC performance based on our model predictions. To avoid the pitfalls of retrospectively setting the RUBRIC threshold filter, we plan to automate this process, perhaps using real-time RUBRIC decision and mapping results to iteratively adjust the filter throughout each run. We also expect to migrate RUBRIC to the latest release of the Read Until developer API (v2), adapt the method for raw data or GPU basecalling (e.g., with ONT’s Scrappie or Guppy callers, respectively), and explore its application to MinION direct RNA sequencing.