Real-Time Selective Sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria

The Oxford MinION, the first commercial nanopore sequencer, is also the first to implement molecule-by-molecule real-time selective sequencing or “Read Until”. As DNA transits a MinION nanopore, real-time pore current data can be accessed and analyzed to provide active feedback to that pore. Fragments of interest are sequenced by default, while DNA deemed non-informative is rejected by reversing the pore bias to eject the strand, providing a novel means of background depletion and/or target enrichment. In contrast to the previously published pattern-matching Read Until approach, our RUBRIC method is the first example of real-time selective sequencing where on-line basecalling enables alignment against conventional nucleic acid references to provide the basis for sequence/reject decisions. We evaluate RUBRIC performance across a range of optimizable parameters, apply it to mixed human/bacteria and CRISPR/Cas9-cut samples, and present a generalized model for estimating real-time selection performance as a function of sample composition and computing configuration.

particularly in experiments with dramatically different positive and negative counts. Accordingly, we compare overall accuracies instead with the Matthews Correlation Coefficient (MCC) given by: Which ranges between -1 (completely inaccurate selection) and 1 (perfect selection). Note that all binary classifier performance metrics in Table 2 implicitly assume that skip decisions do not affect whether reads become fast5 files, potentially exaggerating the number of false positive threshold results, true negative decision results, and true negative RUBRIC results to an unknown degree if this assumption proves to be incorrect.  Figure S3: Overall quality score distributions for even pore sequence decision reads and odd pore (control) reads by experiment for lambda DNA, Cas9-excised E. coli rDNA, and mixed human/E. coli gDNA selection experiments. Datasets B1*, E2*, F*, and G* are filtered to remove periods of failed skipping as described in Section S-3, though the resulting difference to these distributions is minimal. Run G was basecalled with a later version of Albacore, so its scores may not be directly comparable to the other runs.

S-2. Threshold Filter Settings
Corresponding to mixed sample dataset G*, Supplementary Figure S4 provides the best available illustration of threshold filter operation (see also Supplementary Figure S9 (n)), highlighting in particular the tradeoffs between admitting as many mappable reads as possible while excluding non-sequence, uncallable, and unmappable reads. Pore current metrics were only logged for out-of-threshold reads prior to experiment G. As noted in the main article, threshold filter settings for a given run were determined retrospectively based on prior datasets, which is why they were not always well-optimized. Initial experiments utilized a threshold based on calculating the mean pore current for the RUBRIC evaluation window, but in later iterations a regression model showed that using standard deviation alone-as opposed to mean alone or a combination of mean and standard deviation-was the best predictor of whether a read would ultimately be mappable (data not shown). In addition to compensating for drift and flowcell-to-flowcell or run-to-run variations in pore current, the use of the standard deviation-based (rather than mean current-based) threshold should also help minimize the effect of programmed 5 mV sequencer voltage adjustments 2 on threshold efficiency over the course of longer runs. Figure S4: Distribution of measured even pore current standard deviations (in picoamps) for all mappable and unmappable fast5s as well as non-sequence (no fast5) reads for mixed dataset G* (skip-fail filtered). Dotted lines indicate the position of upper and lower threshold limits, with reads falling above or below this band classified as out-of-threshold. White outer violin profiles show the shape of the distribution independent of read count (fixed width), while inner gray profiles indicate relative counts across the seven categories for the run.

S-3. Read Until Event Sampler and Skip Failures
Two significant problems were observed in the course of working with the beta version of the Read Until API and Event Sampler. First, in many experiments the Event Sampler would simply stop communicating with RUBRIC for no apparent reason, sometimes hours before the end of the sequencing run, as reflected in Table 1 by the mismatch between MinION run and Event Sampler times for some experiments. Accordingly, all results discussed in this article reflect only data collected while the Event Sampler was communicating with RUBRIC, though complete datasets are available as noted in the Methods.
The second significant pathology observed in some RUBRIC experiments was the apparent intermittent failure of the DNA rejection / pore unblocking process to respond appropriately to RUBRIC skip decisions. While these skipfailures affected some runs in their entirety (data not shown), closer examination of other runs revealed discrete and readily identifiable time intervals during which the Read Until skipping mechanism had simply failed to reject any DNA, as shown in Supplementary Figure S5 for experiments B1, E2, F, and G. Because these periods of failed skipping potentially introduce uncharacteristic anomalies to our analyses, we distinguish full run data (B1, E2, F, and G) from datasets B1*, E2*, F*, and G* that have been time-filtered to eliminate all reads from the intervals during which skipping did not result in read rejection/truncation. Unless otherwise noted, aggregate results generally refer to these filtered datasets. Differences between filtered and unfiltered datasets are indicated in Tables 1 and 2, Supplementary Table S1, and Supplementary Figures S1, S9, and S10. In the case of pre-Safe Mode run B1 (Supplementary Figure S5 (a)), only the first 11 minutes of the run exhibited effective skipping, after which a catastrophic resource limitation or communication failure caused the vast majority of even pore reads to time-out of the decision process while decision times spiked as high as 40 seconds. Run E2 (Supplementary Figure S5 (b)) saw a skipping failure around the 200 minute mark that appeared to be uncorrelated to any significant change in either decision times or the incidence of undecided, timeout, or unsampled reads, though the Event Sampler failed altogether six minutes later. Run F (Supplementary Figure S5 (c)) showed two short intervals lacking skip-truncation around the 7.7 hr and 13.3 hr marks, the first lasting about 5 minutes and the second preceding final Event Sampler failure about 30 minutes later, but neither appeared to be associated with unusual decision times or undecided/unsampled reads. Lastly, experiment G (Supplementary Figure S5 (d)) exhibited perhaps the most complex behavior related to skip failures. In this run, three distinctly recognizable intervals of non-truncated skipped reads are present, commencing at the 31 min mark (~5 min duration), the 60 min mark (~1 min duration), and the 63 min mark, with the last interval again immediately preceding the failure of the Event Sampler 11 minutes later. The first and last periods of skipping failure are similar and both encompass a brief introductory period of decision times skewing shorter with reduced unsampled read counts followed by a return to the typical decision time distributions and increased unsampled read density. The second, shortest, period without skipping correlates to both much longer decision times (exceeding 8 s) and a high incidence of reads that time-out of the RUBRIC decision process. While some of these failure modes may be precipitated by reaching the computing resource limits of either the MinKNOW laptop or RUBRIC desktop, others appear to be attributable to errors or instabilities in the beta implementation of the Read Until API (v1) and Event Sampler themselves.

S-4. Pore Lifetime
One significant question regarding the viability of RUBRIC-style real-time selection is whether the act of repeatedly reversing pore polarity (unblocking) to reject non-target DNA degrades pore performance over time or leads to accelerated pore attrition. Supplementary Figure S6 addresses this question, showing even and odd active pore counts for each experiment as a function of time. While nearly all runs end with fewer even pores than odd pores, the rate of pore attrition (slope of the line) is remarkably similar for both skipping and non-skipping pores. Interestingly, large step reductions in even pore counts occur every 2 hours, perhaps coinciding with the 2-hour, 5 mV voltage adjustment interval of the sequencer 2 or the multiplex (mux) group changeover, although it was our understanding that the latter occurs on a significantly longer interval (~24 hours). If a mux changeover is to blame, this may indicate that the act of unblocking the active pore in each bank of four degrades the other inactive pores in that set, a fact which is only revealed when the next group of pores is activated. Another interesting observation: the period of skip failure visible near the 28,000 sec mark in Supplementary Figure S5 (c) appears to coincide with the large decrease in experiment F active pore count shown in Supplementary Figure S6. It is also worth noting that in our experience with the current generation of MinION flowcells, pore attrition-and not library depletion-appears to be the main factor contributing to reduced read counts over time, at least for runs less than 24 hr in duration.

S-5. Frozen Libraries
As noted above, the library for experiment C was made at the same time as the B1/B2 library and frozen at -20 °C for 1 day before use. Similarly, E1/E2 was prepared with D and frozen for 2 days before use. Supplementary Figure S3 shows no significant differences in the odd read quality score distributions of these two fresh/frozen pairs. As Table 2 shows, the frozen libraries do appear to yield substantially fewer odd reads per pore per minute, a pattern that also persists on a likely more relevant read/pore-min/(ng of input DNA) basis, showing a 45% decrease between runs B2 and C and a 53% decrease for E1 versus run D. Despite these differences in read rate/yield, comparing Supplementary Figure S10 (e) to (f) and Supplementary Figure S10 (g) to (h) shows no evidence of significant DNA fragmentation or read length bias obviously attributable to freezing. In fact, the run C size distribution shows a greater proportion of longer reads than B2. These results suggest that preparing and freezing libraries for later sequencing can be done without dramatically altering library quality or content, though sequence coverage and overall throughput may suffer, perhaps due to damage/degradation of the DNA-tethermotor complex. Figure S7: RUBRIC skip decision time and skip decision latency distributions for real-time selective sequencing experiments. Skip latency is calculated as the difference between the duration of a skipped read and its RUBRIC decision time, corresponding to the sum of any delays in the read skipping timeline that occur either before or after the RUBRIC process itself. Negative "latency" numbers indicate reads that likely received skip decisions after they had already departed the nanopore.

S-6. RUBRIC Decision Times & Read Until Latency
Supplementary Figure S7 shows the RUBRIC decision time distributions and calculated skip decision latency times for each experiment described in the article. Decision times are primarily dependent on computing resource availability, the size of the evaluation window that is to be basecalled in real-time, and the size of the target reference sequence used for LAST alignment. Because all reads receiving a decision go through the same basecall and alignment process, there is no difference between skip and sequence decision times. Skip latency, as shown in the figure, is the difference between the Albacore-reported read duration for a given skip decision read and its corresponding RUBRIC decision time. This latency likely represents the effect of 1) any delay between DNA docking at a pore and the Event Sampler reporting that event to RUBRIC and/or 2) any delay between RUBRIC communicating a skip decision and MinKNOW reversing pore polarity. Reads with a duration shorter than the RUBRIC skip decision time (appearing as negative "latencies" on the chart) most likely received decisions after they had already departed the nanopore. Depending on the balance between fragment size and decision time, shortening the RUBRIC queue timeout period may reduce the incidence of such post hoc decisions, conserving decision process computing resources at the expense of increasing the number of undecided (timeout) reads. As Supplementary Figure S7 indicates, with the exception of run C, which had a number of unusually long decision times, skip latency times were both substantially longer than RUBRIC decision times and fairly narrowly distributed around a mean of about 3 sec. Unclear is how these latency times relate to the observed 2 second duration of pore unblocking 2 . For a sequencer that is nominally operating at a DNA translocation rate of 450 bases/s, these internal latencies impose a lower limit on the size of DNA fragments that can be effectively selected, regardless of RUBRIC decision time performance or additional optimization.

S-7. Estimating the Limits of Real-Time Selective Sequencing Performance
Based on our work characterizing the performance of the RUBRIC method, we now propose a simple model providing a means to estimate the generalized performance limits of real-time selection processes like RUBRIC. Central to our analysis is the recognition that on average, the amount of time that a sequencing DNA strand occupies a given nanopore is equal to its length divided by the translocation rate, nominally 450 bases/s with current technology. Accordingly, we posit that the primary benefit of selective sequencing is that it allows some read types to be processed more quickly (i.e., when skipped) than they would be in the absence of selection.
For simplicity, we address three primary read types encountered by the nanopore: 1) reads corresponding to target sequence that are to be sequenced, 2) non-target or background-mapping reads that are to be skipped, and 3) non-sequence or open pore reads that, despite being registered by the Event Sampler, never translate into fast5 sequence files. As noted in the main article, non-sequence reads are believed to reflect the amount of time that a pore is unoccupied, sampled at somewhat regular intervals. Based on our observations, these nonsequence reads can account for 85-99% of the total reported read population and 65-98% of all active pore time ( Table 2, Supplementary Table S1), with the remainder of active pore time split between target and non-target reads in essentially the same ratio that their sequences exist in the library. It should be noted that while unmapped reads likely originate as either low quality target or background sequence, they are counted as nontarget reads for modeling and performance assessment purposes.
Within a given run, the observed proportions of read types in the read pool reflect the underlying probabilities that an unoccupied nanopore will encounter a particular kind of read, probabilities that themselves reflect the composition of the library (target:background ratio, DNA concentration) and are therefore constant and independent of any selection method. With these assumptions in mind, considering first the case of selective sequencing (sel), the total number of target (t), background (bg), and non-sequence (ns) reads obtained in an experiment is given by Dividing by Nsel we express each read category as its fraction (f) of the total read population upstream of the decision process 1 = _ + _ + _ = + + Importantly, we reiterate that while fast5 counts and proportions downstream may be affected by selection in ways that are hard to predict a priori, selection has no effect on these purely library-dependent upstream read fractions. In the bounding case of perfect selection, all target reads will be sequenced and all non-target reads skipped, each requiring on average tt_seq and tskip seconds of pore-time, respectively, per read. Generally, tt_seq approximates the average target fragment length divided by the average pore translocation rate (e.g., 450 bases/s), while tskip reflects the total time required for the selection method to assess and implement the skip decision (e.g., RUBRIC decision time plus MinKNOW latencies). We likewise define a characteristic average time tns associated with non-sequence reads, which are reported by the Event Sampler as discrete events even though they frequently represent subdivisions of continuous intervals of pore vacancy. The cumulative pore-time (T) needed to obtain all reads in the selective sequencing scenario is therefore Note that this quantity represents the aggregate sum of all active pore time (not run time), effectively representing the time required for a given population of active pores to process a given population of reads at the pore occupancy/vacancy rate implied by the proportions of target, background, and non-sequence reads. For the case of no selection (subscript 0), e.g., odd control channels operating in parallel with even RUBRIC channels but sequencing everything, the total read count of the non-selecting experiment is given as in equation (1) by We can likewise express the cumulative pore time for the non-selection case as: For randomly fragmented target and non-target DNA from the same source, tbg_seq = tt_seq, whereas the average sequence time for target and background fractions could be different if their size distributions were substantially different (e.g., amplicons of different lengths, fragments produced by targeted cuts, etc.). Note that the typical/average duration of non-sequence reads tns within a run is presumed to be independent of selection, an approximation that is substantially supported by the empirical data and that may reflect the regular sampling interval of the Event Sampler. In practice, tns is inferred from the total (odd) active pore time not occupied by fast5-producing reads divided by the count of reads without fast5s. 1 Integrated active pore-time for odd numbered pores 2 Integrated active pore-time for even numbered pores 3 Total (odd channel) active pore time not occupied by fast5-producing reads (i.e., empty/vacant pore time) 4 Absolute enrichment ratio predicted by equation (6) or equivalently equation (13) 5 Empirically observed enrichment (throughput) ratio = even sampled read count / odd sampled read count normalized by total even and odd active pore times, respectively.
* Dataset time-filtered to eliminate reads from periods of failed skipping, see Supplemental Section S-3.
Assuming equivalent pore-time utilized by both selecting (e.g., RUBRIC) and non-selecting (control) pores, as in the case of a substantially equal number of even and odd channels operating concurrently for an equal amount of sequencing run time, we equate Tsel in equation (3) and T0 in equation (5) to solve for the enrichment ratio: This ratio represents the maximum increase in read throughput theoretically obtainable by real-time selection for a particular library (determining ft, fbg, fns, tt_seq, tbg_seq, and tns) and computing setup (in part determining tskip).
Because read fractions/proportions are fixed as noted above, this ratio therefore also represents the upper bound on the absolute (numerical) enrichment of target reads achievable via selection. Note that equation (6) only applies where tbg_seq ≥ tskip. If fragment-length dependent tbg_seq is less than tskip, then skipping will have no effect and equation (6) behaves as though tskip = tbg_seq, producing a ratio of unity. For the idealized case of 100% pore occupancy, the non-sequence terms drop out, and for target and background reads with equivalent size distributions (tt_seq = tbg_seq), equation (6) simplifies to Which, in the limiting case of instantaneous skipping (tskip = 0), further simplifies to 1 + fbg/ft.
As equation (6) illustrates, deviation of the enrichment ratio away from unity is solely a consequence of the difference between tbg_seq and tskip scaled by the non-target read fraction fbg. Accordingly, the greatest benefit from selection is obtained in cases where the fraction of background reads is comparatively large and/or tbg_seq >> tskip, corresponding to long background reads with pore transit times substantially longer than the time required for the selection process to produce and execute a skip decision. For cases involving high non-sequence read fractions (~90%) as seen in the experiments detailed here, the benefit of selection is significantly muted because the time saved by skipping only affects roughly 2/3 of 10% of all reads, yielding a predicted maximum enrichment ratio of about 1.18 for experiment B2, assuming perfect selection. Figure S8 plots the value of the enrichment ratio derived from equation (6) across a range of parameters assuming equivalent target and background size distributions, while Table S1 provides model parameters determined for the runs in this article. Figure S8: Bounding limits on enrichment / throughput enhancement performance for selected library and computing parameters evaluated as in equation (6) While equation (6) is useful in analyzing existing Read Until run data, it is not ideal for predicting selection performance a priori or from standard MinKNOW sequence data, e.g., without the benefit having already obtained Event Sampler-reported non-sequence read counts via RUBRIC. Accordingly, we develop a modified formulation of the enrichment ratio that can be derived entirely from the output of a typical MinION sequencing run (nonselecting) performed with a representative library of interest. In the preceding analysis, we addressed Event Sampler-derived read counts (N, n) including those that did not yield fast5s. Here, we focus on MinKNOW-output fast5 counts exclusively (M, n). Analogous to equations (1) and (2) above, the total target and background counts for non-selecting and selecting runs are given by Like the read fractions (f) described above, sequence file fractions (g) are likewise presumed to be constant and independent of selection, conforming to gt/gbg = ft/fbg. Note that while the relevant counts of sequence-producing reads in the non-selection (odd pore) case can be derived directly from MinKNOW data, the true number of skip decision fast5s resulting from selection will not necessarily be knowable in advance due to the uncertainty about when/whether MinKNOW creates fast5s for skipped reads. As a consequence, here nbg_sel is understood as the count of skip decision reads upstream of the fast5/no fast5 determination rather than the final observed count of skip decision fast5s.
Next, we express the total pore time of a non-selecting run as the sum of all target read, background read, and non-sequence or open pore time, and ultimately the average durations of target and background reads as in equation (5) above: Where Tns_0 is the total non-sequence or open pore time. Note that all these parameters can be determined empirically from a given non-selecting model sequencing experiment: target and background fast5 counts, average fragment lengths/durations, total active pore time, and aggregate pore vacancy time. For the case of selective sequencing, equation (10) becomes Unlike equation (10), here the non-sequence term is not obviously determined a priori from quantities in a nonselecting model sequencing experiment. Equating T0 and Tsel from equations (10) and (11) as in the preceding analysis and rearranging, we obtain the enrichment ratio Which is very similar in form to equation (6). The term, Tns_0/M0, in the numerator is a constant readily obtained empirically from a model MinION run, reflecting the average amount of open pore time per fast5-producing read. While read counts (n) and associated time fractions will change as a consequence of selection, read and fast5 proportions (f and g) are not affected by selection. Analogous to our treatment of the non-sequence read fraction product (fnstns) in the derivation of equation (6), which was found to be largely consistent regardless of selection, we argue that the same should be true of the average open pore time per fast5, allowing us to equate Tns_0/Mo = Tns_sel/Msel and revise equation (12) Here, the enrichment ratio is expressed solely in terms of parameters that can be derived from a sample nonselecting sequencing experiment, with the exception of tskip, which must be assumed. In practice, empirically determined Tns_sel/Msel can differ significantly from Tns_0/Mo due to the unpredictable effect of skipping on fast5 creation, so the assumed equality of these terms again reflects an idealized condition immediately upstream of the decision process. As in equation (6), equation (13) For the further idealized case of instantaneous skipping with no latency (tskip = 0), the expression collapses to 1+ gbg/gt.     (target, non-target, unmapped) and their fate as a function of RUBRIC selection applied to even numbered pores. Figure S10 (i-p): Read length histograms for lambda DNA experiment E2 and filtered dataset E2*, Cas9-exised rDNA experiment F and filtered dataset F*, and human/E. coli experiment G and filtered dataset G*, illustrating the distribution of different read types (target, nontarget, unmapped) and their fate as a function of RUBRIC selection applied to even numbered pores.