Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design

Nanopore sequencers can select which DNA molecules to sequence, rejecting a molecule after analysis of a small initial part. Currently, selection is based on predetermined regions of interest that remain constant throughout an experiment. Sequencing efforts, thus, cannot be re-focused on molecules likely contributing most to experimental success. Here we present BOSS-RUNS, an algorithmic framework and software to generate dynamically updated decision strategies. We quantify uncertainty at each genome position with real-time updates from data already observed. For each DNA fragment, we decide whether the expected decrease in uncertainty that it would provide warrants fully sequencing it, thus optimizing information gain. BOSS-RUNS mitigates coverage bias between and within members of a microbial community, leading to improved variant calling; for example, low-coverage sites of a species at 1% abundance were reduced by 87.5%, with 12.5% more single-nucleotide polymorphisms detected. Such data-driven updates to molecule selection are applicable to many sequencing scenarios, such as enriching for regions with increased divergence or low coverage, reducing time-to-answer.

1 Supplementary methods

Defining priors of genotype probabilities and updating posteriors after observing sequencing reads
Priors of genotype probabilities In the simplest case of a haploid genome without indels, we define the prior of reference nucleotide b R with b R ∈ B and B = {A, C, G, T} at position i as π i (b R ) = 1 −θ, with θ the genetic diversity of the considered population. Conversely, π i (g) = θ/3 if g ̸ = b R , with g ∈ G and G = B in this case.
When considering diploid sequenced genomes, we still assume a haploid reference genome, with reference nucleotide at a given position denoted b R . And equivalently to the main text the set of possible genotypes G instead consists of the unordered pairs g = {b 1 , b 2 }, with b 1 , b 2 ∈ B.
For an unphased genome without indels, we define π i ({b R , b R }) = 1−θ, and π i ({g, g}) = p homo θ/3 if g ̸ = b R , with p homo being the proportion of site differences from a reference that are expected to be homozygous, and π i ({g, b R }) = (1 − p homo )θ/3 for g ̸ = b R .
Updating posteriors In the main text we showed how to calculate the posterior probability f i (g|D) of genotypes g at position i, conditional on data D (Eq. 1). For completeness, here we provide details on the observation probabilities ϕ, and further details on how to update the posterior distribution after observing additional data.
First, if observed data D contains n reads covering position i with bases d j,i for j = 1 . . . n, we denote the base from a new hypothetical read at position i by d n+1,i . We then represent D ′ as the union of D with the new hypothetical read, so that D ′ contains n + 1 reads covering position i, with bases d j,i for j = 1 . . . n + 1. After observing the new read we update the prior genotype probabilities π i (g) to get the posterior probabilities: As in the main text, Z i (D) represents a normalising constant ensuring that the posterior probabilities at site i sum to 1. ϕ(d j,i |g) is the probability of calling base d j,i assuming genotype g at position i, and will depend on assumptions about the probabilities of observing errors. For example, for a haploid genome without indels we define where e denotes the per-base substitution error probability, i.e. any position in a read has probability e of representing a wrong nucleotide. In the scenario of an unphased diploid genome without indels we instead consider

Incorporating deletions in the model
In this section we discuss how deletions, appearing either as mutational events or sequencing errors, are incorporated into our framework. Insertions and rearrangements are not considered due to increased complexity. If we include deletions the possible set of observed bases for haploid genomes becomes B = {A, C, G, T, −}, with the genotypes G = B. For an unphased diploid genome, g ∈ G instead becomes one of 15 unordered pairs g = {b 1 , b 2 }, with b 1 , b 2 ∈ B.
In order to define priors on these genotypes, we use parameter r to express how often variation in the form of a deletion is observed relative to SNPs. Across a haploid sequenced genome we expect N θ substitutions from the reference and rN θ deleted bases. In our experiments we use values θ = 0.01 and r = 0.4, in line with values reported for human populations and microbiomes [6,7].
Taking deletions into account requires modification of prior genotype probabilities π and sequencing probabilities ϕ(d j,i |g). For a haploid genome the prior of reference nucleotide b R at For a diploid unphased genome, we define (1 − p homo )θ/3 , if g 2 = b R and g 1 ̸ = b R , − , p homo rθ , if g 1 = g 2 = − , Sequencing probabilities ϕ are modified in the following way. As before, e represents the error probability for substitutions, and in all applications we use e = 0.06. We also define e − as the probability of a deletion sequencing error, i.e. observing a deletions instead of a base, set to e − = 0.05 in our applications; finally, e + represents the probability that an actual deleted base in the sequenced genome is misread as a nucleotide, set to e + = 0.1 in our applications. For a haploid genome the sequencing probabilities then become In the scenario of an unphased diploid genome we instead have: (S.6)

Calculating KL-divergence to express positional site-wise score
In the main text we use the Kullback-Leibler divergence between the posterior probability distributions before and after observing a new read (f i (g|D) and f i (g|D ′ ), respectively) at site i as a measure of potential information gain, where D denotes already observed data and D ′ represents augmentation of the data by one sequencing read, d n+1,i . Additionally, we take the probability of observing any base in that read into account (main text, Eqs. 2 and 3). Here we present a derivation and practical form for calculating the KL divergence between the two aforementioned posterior probability distributions. We give some examples of positional site-wise scores S i resulting from different coverage patterns in Suppl. Fig. 6.

Practical calculation of the expected benefit of sequencing reads
In the main text we defined a positional benefit score S i for each position i of a genome, and combined the scores of multiple positions and the distribution of previously observed read lengths into an expected benefit of reads U i , assuming that each read maps to a series of contiguous bases in the reference genome (see main text, Eq. 5).
As part of Eq. 5, we define S l i,1 to be the sum of l consecutive S j score values starting at position i, that is, the score of a forward-oriented read of length l starting at position i: Similarly, for a reverse-oriented read: Since we do not know the length of a sequencing read in advance we account for the uncertainty in l. For this, we assume a single distribution of fragment lengths that applies to all fragments irrespective of genomic origin or orientation. We denote the fragment length distribution by L(l) for lengths l = 1 . . . N , with mean λ = N l=1 L(l)l. In our real-time applications we use a truncated normal distribution with parameters λ = 6, 000 and sd = 4, 000 as a prior. Throughout the sequencing experiment this prior distribution is updated by the observed read lengths of full-sized, accepted reads to guarantee accurate calculation of the expected benefit of sequencing reads. This is especially important when targeting sparsely distributed variant sites, due to the potential information gained when a read might cover them. We observed that this adaptive, empirical read length distribution is learned within the first minutes of sequencing.
Since there will be lower and upper limits on the length of fragments, it is computationally convenient to define D L to be the domain of L, i.e. the set of values of l with L(l) > 0.
Analogously, for reverse reads: Calculating U i,1 and U i,0 for all genome positions with a naive algorithm would require time proportional to | D∼ CL |. As U i,1 needs to be calculated for each i, the total cost for the whole genome would be in the order of O(N × | D∼ CL |), which would be excessively slow. We therefore efficiently and accurately approximate U i , with total computational cost linear in genome size, using an approach based on approximating ∼ CL with a piece-wise constant function.
Assuming that ∼ CL (l) is a piece-wise constant function means that there are η values 1 = Equivalently to the case without approximation we calculate the benefit of the first position: In general, if we know U i,1 we can calculate U i+1,1 as The same approach can be used for efficiently calculating the expected benefit of reverse reads (U i,0 ). In our experiments we use a piece-wise constant function with η = 11 different values. 8

Details on deriving the decision framework
In the main text we defined a strategy S to be a function I S i,o returning 0 or 1 for reads from any position i in a genome with orientation o. Thus, I S i,1 = 0 indicates that a forward fragment starting at position i should be rejected, while I S i,0 = 1 indicates that a reverse fragment starting at position i should be read to its end, and so on. Here, we present some more details on the calculation of the decision strategy and some generalisations. We say that S includes (i, o) if Since we do not know S a priori, our aim is to determine an optimal strategy S given the current data D.
Given the definitions in the main text and above, the expected benefit of a DNA fragment of orientation o starting at position i is with S µ i,o denoting the expected benefit of the initial µ bases of a read and calculated according to Suppl. Eqs. S.8 and S.9. This accumulation of benefit is achieved in time Accounting for bias in the origin of sequencing reads In the main text we present simplified equations assuming a uniform probability for the origin of DNA fragments. In reality, however, calculating the strategy-wise average time costt S and benefitŪ S requires knowledge about how often fragments from certain positions i and orientation o are sequenced by pores.
In many cases variation in this distribution could be ignored without much impact on the computed strategies, especially if there is little coverage bias, e.g. when sequencing input DNA from a single species without prior amplification. Sometimes however, ignoring these probabilities could negatively influence the optimality of the decision strategy. For example regions with very low coverage, i.e. negative coverage bias, will have high expected benefit but low probability of being covered by future reads. This could lead to over-rejection of fragments due to expecting lots of benefit from areas where reads are unlikely to originate in the future.
Accordingly, in our implementation we generalise and account for bias is the origin of sequencing reads. For this, we use the notation F i,o to refer to the probability of a random fragment's first base mapping to position i in orientation o, so that o=1,0 In this case, the average benefit per fragmentŪ S of strategy S becomes and its average fragment-wise costt S is given bȳ Note that in the simplest case, with no bias in read origin or orientation, . Modeling and updating of the distribution of read origins F i,o is further detailed in Suppl.

Sect. 1.6.
Formalised procedure to find optimal strategy To find optimal strategies, we first rank all the 2N position-orientation pairs (i, o) according to decreasing value of U i,o − S µ i,o and index them such that (i 1 , o 1 ) takes the highest value, (i 2 , o 2 ) the next and so on: and 0 otherwise, and it is the optimal strategy of size σ. Starting with σ = 0, the empty strategy, we successively increase σ, at each stage testing whether to discover whether S σ+1 gives an improvement over S σ . Once we reach a value σ * such that there is no further improvement, we have the optimal S = S σ * .

Accounting for variation in the distribution of sequencing reads
In Suppl. Sect. 1.5 we show how we account for variation in the origin of sequenced fragments by incorporating a distribution F i,o . In most scenarios simply incorporating observed read starting positions in an empirical distribution would suffice. However, there are other occasions when this is not enough: specifically, when parts of the reference genome are not present in the sequenced sample, e.g. when sequencing diverged bacterial strains, some regions might not receive any coverage at all and would continuously dominate the ranking of yet-to-be-gained expected benefit.
We thus model variation in F i,o and estimate it from currently observed data using a Bayesian approach. Since the F i,o define a discrete multinomial probability distribution, it makes sense to choose a Dirichlet distribution, its conjugate prior, as the prior over the F i,o parameters. We expect, however, a non-negligible proportion of F i,o to be exactly zero (sites at which no read's mapping starts, for example due to deletions relative to the reference). Therefore, we use a mixed distribution of a point mass at 0 and a Dirichlet distribution as a prior for F i,o . In Since we assume a mixture of point mass at 0 and a Dirichlet distribution, we have that any F i,o has a marginal prior distribution P (F i,o = p) given by where ∂p is an abuse of notation representing a differential in p: that is, while the probability the prior is continuously distributed with a beta distribution (the marginal distribution of the Dirichlet distribution) with parameters a and (2N − 1)a, and rescaled by 1 − p 0 . In other words, Analogously to a, we use previously observed information to define p 0 and approximate it by Updating the posterior expectation is more complicated with sites where C i,o = 0. In this is a binomial likelihood and that we are restricting ourselves to a beta density for F i,o . The above equation then becomes: The only term above that needs further derivation is P ( . Using the definition of density functions of the beta and binomial distributions, this is equal to: where B(a, b) is the typical normalizing factor of the beta distribution density β(a, b), that is, we do not use the Gamma function directly due to potential overflow, but instead use the natural logarithm of the beta function to get the logarithm of B. Finally, the estimators F i,o are normalised to ensure summation to 1.
In some scenarios, for example when sequencing only certain species from within a metagenomic sample, some reads will not map to the reference genomes considered. In such cases, we additionally account for the frequency of fragments in the input material that are not of interest.
Otherwise, if a species of interest constitutes a low percentage of the input DNA, our model 13 would overestimate the expected benefit of a new read and reject more reads than it should ideally. To circumvent this issue, we estimate the frequency of reads not mapping onto the reference by recording the number of on-and off-target reads and use that ratio for normalisation of the the estimators F i,o .

Proof of strategy optimality
Here we prove the results stated in the main text regarding the optimality of the fragment selection strategies proposed. We start by presenting two very general algebraic results that come in handy, represented here in terms of the variables we will use later in our proof. Assuming and Now returning to optimal strategies, for greater generality we consider the case of multiple distinct reference chromosomes, potentially from different species. The only change needed is to refer to locations within a reference as k, i to indicate their chromosome identifier k as well as nucleotide location i; consequently, DNA fragments that are candidates for sequencing are referred to by triplets (k, i, o) that indicate their chromosome and starting position within it, and orientation o. Now, we assume that positions (k, i, o) have been ranked in an ordered list for which t k,i,o ≤ 0, which we assume have been ranked first or, equivalently, their score set to We define a strategy S to be better than strategy S ′ (represented as S ≻ S ′ ) if and only if it has greater expected benefit per unit time:Ū S /t S >Ū S ′ /t S ′ . Next, we show that the best strategy is one that accepts positions ı s with s ≤ s * and rejects positions ı s with s > s * for some value of s * . We ignore for simplicity the effects of positions with equal rank and scores. We use reductio ad absurdum, starting from the assumption that the best strategy includes a position ı but not a position ȷ with u ȷ /t ȷ > u ı /t ı . If we denote the expected per-fragment benefit of this strategy but excluding ı by U , and the corresponding expected time by T , then the expected value of the best strategy is (U +u ı )/(T +t ı ). Then (U +u ı )/(T +t ı ) ≥ U/T by the assumption of optimality and so u ı /t ı ≥ U/T (Eq. S.26); u ȷ /t ȷ > u ı /t ı by assumption and hence u ȷ /t ȷ > U/T ; and thus u ȷ /t ȷ > (U + u ı )/(T + t ı ) (Eq. S.25). This means that including position ȷ would improve the best strategy (Eq. S.26 in reverse direction), which is a contradiction.
We now show that if we start from the null strategy S 0 (i.e. the strategy that rejects all fragments), successively add positions ı 1 , ı 2 . . . (creating the series of strategies S 1 , S 2 . . .), and stop at the first s * such that S s * ≻ S s * +1 , then S s * is the optimal strategy. From above, we already know that one of the S s must be the best strategy, and obviously for each s < s * we have S s * ≻ S s . We only have to show that for each δ ≥ 1, S s * ≻ S s * +δ . This is true for δ = 1 (definition of s * ), so thanks to Eq. S.26 we have U s * /T s * > u ı s * +1 /t ı s * +1 , where we represent the expected benefit and cost per fragment of strategy S s * as U s * =S µ + s * s=1 u ıs and T s * = α + µ + ρ + s * s=1 t ıs , respectively. By definition we have U s * +δ = U s * + δ j=1 u ı s * +j and T s * +δ = T s * + δ j=1 t ı s * +j , and therefore with the last step coming from Eq. S.26. Since we know that U s * /T s * > u ı s * +1 /t ı s * +1 , it is sufficient to prove that for any δ ≥ 1 we have u ı s * +1 /t ı s * +1 ≥ δ j=1 u ı s * +j / δ j=1 t ı s * +j . This is obviously true for δ = 1, and we use the induction principle to prove it for any δ ≥ 2. Inductively, assuming that u ı s * +1 /t ı s * +1 ≥ ( δ−1 j=1 u ı s * +j )/( δ−1 j=1 t ı s * +j ) then from the fact that u ı s * +1 /t ı s * +1 ≥ u ı s * +δ /t ı s * +δ and from Eq. S.25 we obtain the required thesis.

Efficiency improvements required to ensure strategy optimality
To ensure optimality of dynamic sequencing strategies we require an efficient algorithm that can update the strategy in short intervals in order to keep up with the data stream from the sequencing device. One of the bottlenecks is ranking sites by their expected benefit. We therefore conceived a fast algorithm based on approximating the expected benefit of reads by discretized values. To do this, we decompose the floating point benefit values into their significand and exponent components, and use the exponents to form a grid approximation while ignoring the significands. This way, instead of sorting 2N floats (for total reference genome size N ), we tally the counts of integers, which is easily parallelizable. Consequently, this also means that for finding the size σ of the decision strategy, i.e. the number of ranked positions from which to accept reads, we do not operate on a per-site basis but instead test for an improvement of our optimality criterion after adding the multiple sites from one point of the discretised grid.
Therefore, we reduce the number of considered instances from 2N sites to the number of points in the grid approximation. In reality the truly optimal threshold will likely lie between two points on the grid, so as a trade-off for computational speed we are likely either accepting or rejecting reads from few additional sites compared to a strategy calculated with an exact approach. This algorithm is described in Algorithm 1.
This procedure speeds up the calculation of new strategies but is still computationally impractical when considering large genomes or multiple species in a single experiment. In such settings, the real-time data stream from the sequencing device could outpace the generation of new decision strategies. To prevent this, we use a further approximation based on the assumption that neighboring sites will often have very similar values of expected benefit. This is justified by the fact that reads starting at some position will have a high probability of covering very similar consecutive sites as a read with the same orientation starting at a neighbouring site. Therefore, we reduce the resolution of the generated decision strategy by taking the sum of positional benefit scores in non-overlapping windows of size w, i.e. we calculate exact posterior probabilities and scores for each site, but obtain values of expected benefit of reads starting within a window of w bases instead of calculating the benefit of reads starting at any position i. Subsequently, we also generate decisions for windows of w bases instead of decisions for individual positions, i.e. we might use index w in the indicator function I S w,o to describe whether a read from that window Algorithm 1: Finding approximate decision strategies. By using a grid approximation, we consider whether reads from points on the grid, i.e. all sites at one point of the grid instead of individual sites, should be added to the strategy. Since the number of grid points considered is typically much smaller than the number of sites, this leads to a significant reduction in the number of comparisons needed, in addition to avoiding the step of sorting a vector of 2N values. Input: vector U of expected benefits per site; vector t of sequencing time cost per site Output: decision strategy S 1 U ′ ← scale expected benefit U by max(U ) 2 S, E ← discretize benefit by decomposing U ′ into significands and exponents 3 G ← dictionary storing counts of points in the grid approximation, with keys k 4 for e in E do // count occurrences to form grid approx.

5
G |e| ← G |e| + 1 6 end 7 U G ← vector for expected benefit of grid approximation 8 K ← sorted vector of grid points (keys k of G) 9 for k in K do // recreate approx. expected benefit of points in grid 10 U G k ← 2 k max(U ) 11 end 12Ū 0 ← average benefit if all fragments are rejected (mean benefit of initial, length µ fragment) 13t 0 ← average time cost if all fragments are rejected (time to acquire fragment, sequence µ bases and make decision) 14 t ′ ← average time cost of sites at grid point

Sequencing ROIs of two species in the presence of abundance bias
In the main text we present a sequencing experiment of a microbial mixture where we are interested in the entire genome of every species. In contrast, we might only want to investigate a smaller proportion of a few genomes in a community. This could be, for example, in order to quickly interrogate the presence of antimicrobial resistance-associated (AMR) loci in a clinical setting. We chose to use the same microbial mock community with logarithmically distributed abundances (ZymoBIOMICS DNA Standard II D6311, Zymo Research), but focused our sequencing effort on AMR loci of the two most abundant species, L. monocytogenes and P. aeruginosa.
For this, we employed the same closely related but not identical reference genomes as described in the main text and used them to inform priors for the genotype probability distribution. Before sequencing we identified AMR loci using the CARD database [1], resulting in ROIs covering 12.4% and 9.8% of the bacterial genomes, respectively. We added 10kb of flanking regions ahead of each ROI to account for reads that start close enough to cover them. This resulted in initial strategy sizes of 64.7% and 57.9% of the genome for P. aeruginosa and L. monocytogenes, respectively.
In this experiment we did not only compare BOSS-RUNS to sequencing without using adaptive sampling, but also to readfish. Readfish is an established tool that is well-suited to target ROIs in genomes. With this comparison we aim to highlight the additional benefit of dynamically adjusted adaptive sampling in acquiring more sequencing data at the most relevant positions, e.g. at regions within genomes affected by coverage bias.
As with the application presented in the main text, sequencing was conducted on a GridION using R9.4 flowcells. Out of 512 total channels on the flowcell each of the three conditions was assigned 128 sequencing channels. Readfish was configured to reject reads if they were found to map to one or more off-target sites, i.e. to sites outside of the specified ROIs, or if they did not map at all or failed to produce reliable basecalls. Again, we used a data chunk of 0.8s as the initial fragment for inferring the genomic origin and orientation in order to make decisions.
As expected, the proportion of sites from which reads are accepted decreases very quickly for L. monocytogenes, followed by a more steady decline for the less abundant P. aeruginosa (Suppl. to resolve, such as homopolymers (Suppl. Fig. 10A, inset).
The mean coverage depth of the two species demonstrates that we successfully trade-off additional data from L. monocytogenes in order to boost the coverage of P. aeruginosa. Especially in the first few hours of sequencing we achieve the highest coverage values for the rarer of the two species compared to both the control and readfish (Suppl. Fig. 10B).
The advantage of focusing data collection on ROIs with BOSS-RUNS compared to readfish becomes more evident, however, when we consider how the sequencing data is distributed within ROIs. Here, we observe that the proportion of sites that remain covered at less than 5× is lowest for BOSS-RUNS. Equally, the remaining total uncertainty, i.e. the remaining entropy of the genotype probability distributions across all sites of interest, is also lowest when using our new method (Suppl. Fig. 10C,D). In fact, BOSS-RUNS surpasses the total acquired information content of the control and readfish in the ROIs of P. aeruginosa after 27.1% and 43.0% of the sequencing run, respectively.
Visualising the entire distribution of coverage depth at multiple time points throughout the experiment is another way to show how BOSS-RUNS redistributes data within species. In Suppl.  Fig. 10B) and the entire distribution (Suppl. Fig. 11C). BOSS-RUNS on the other hand does not continue to collect data from all of the specified ROIs, but instead stops sampling data from regions after observing enough data to resolve the genotypes and subsequently manages to enrich coverage at specific areas where it is needed most. This results not only in fewer sites at coverage <5×, but overall at distributions with lower variance (Suppl. Fig. 10B).
Analysing the enrichment of on-target versus off-target regions confirms previous observations. By normalising the total yield of sequencing to the control sector of the flowcell, we see that readfish collected ∼1.5× more data of on-target sites in both bacteria compared to the control and 0.6× the amount of data from the remaining parts of the genome (Suppl. Fig. 12).
BOSS-RUNS instead only accumulates 50% of the amount of data in L. monocytogenes compared 21 to the control as this bacteria's chromosome is quickly resolved. In the case of P. aeruginosa, BOSS-RUNS collects only slightly more data from ROIs but depletes off-target regions equally well as readfish. This smaller enrichment compared to readfish when measuring total yield is the trade-off for redistribution of data described previously. For this experiment we used the same input material and preparation methods as described in the main text, therefore resulting in a similar distribution of read lengths (see Fig. 2F). Since the amount of enrichment (and depletion) we can achieve is largely dependent on input read lengths, i.e. the expected difference between the initial fragment used to make decisions and the length of entire reads, the differences in yield presented here are well-suited to demonstrate the advantage of a dynamic aspect to adaptive sampling, but they could potentially be much larger given longer reads. Whereas we did not observe obvious differences between fragments from different bacterial species, we noticed that the peak at ∼450bp composed of rejected reads was visible even for some species where we did not expect to eject reads. The vast majority of these false negative decisions is caused by an inability to map the DNA fragment given its initial (µ) bases. Increasing the amount of data used for the decision process could decrease the number of false rejections but would in turn decrease the advantage of rejection relative to reading the entire fragments. An alternative approach would be for reads of unknown origin to always be sequenced fully, which might be appropriate given prior knowledge about the input DNA. Supplementary Figure 3: Coverage is effectively redistributed by sampling more data from sites with low coverage. If our method successfully focuses on reads from areas with highest uncertainty, we expect the mean coverage at sites spanned by accepted reads to be lower than sites spanned by rejected reads. Such an effect might be amplified when considering minimum coverage, since the decision strategy might be driven by low coverage at individual sites as opposed to low average coverage in an area. To investigate this, we separated the reads by the decision made during the experiment and recorded the mean and minimum coverage of all sites that a read maps to and pooled those measurements at different timepoints. Plotted values are means ± standard error of all newly observed reads since the previous timepoint. We focus on the results for B. subtilis, for which we accept all reads until ∼80min into the experiment; then reject an increasing proportion; and rejecting most reads from ∼200min onward. Looking at the mean coverage at sites spanned by accepted and rejected reads (left), we see an overlap up until the size of the strategy starts to diminish. Throughout the rest of the experiment, as expected, both mean and minimum coverage (right) of sites spanned by accepted reads are consistently lower than for rejected reads, albeit with larger fluctuation due to the decreasing number of fragments sequenced in their entirety. This demonstrates that data is continually sampled from areas of low coverage even after most of this species' genome has already been resolved. Notably, observed rejections of reads from this species right from the start are due to reads failing to map using only their initial parts. State of channels during the sequencing experiment. A) Left: Barplot showing the cumulative time of each channel spent sequencing, idling (i.e. rejecting a fragment or waiting for another one) and inactive (i.e. the final period not transmitting data before the end of the experiment). Right: Boxplots summarising these results across channels show that there is minimal difference between the time spent in each state. Inactive channels include those associated with blocked or damaged pores, which account for the long tails of outliers in the yellow boxplots. The 512 total channels on the flowcell were split into two sections, i.e. n = 256 channels are shown in each panel. Boxplots represent the median and the first and third quartiles, with whiskers extending to include all data points within 1.5 * IQR and points beyond these thresholds plotted individually. B) The number of actively transmitting channels over time shows that using BOSS-RUNS does not significantly impact the rate of flowcell degradation. Active channels are all those that are not inactive (see A). A similar number of channels assigned to both methods were inactive before the experiment started, indicated by the initial drops in both lines shown at t = 0. i,o , after which a new DNA fragment is acquired (green, taking time α) and mapped (µ), and with expected benefit, prior to determination of position and orientation, of (S µ 1 +S µ 0 )/2. Alternatively, the (blue) fragment can be rejected (taking time ρ; no further benefit) and a new fragment acquired (mauve; additional time α). This in turn gets mapped (µ; determining its location j, orientation υ, and benefit S µ j,υ ) and decided upon. Initial effects of other fragments are shown in gold and red. Filled circles mark points where new fragments are acquired; decision points are marked by open diamonds. D-E) Local spatial autocorrelation identifies few hot and cold spots, but with low Moran's I and not conferring any obvious performance disadvantage to either BOSS-RUNS or the control. The quadrants of the scatterplot distinguish the type of local spatial autocorrelation, describing either similarities or dissimilarities between the yield of a channel and its neighboring channels, e.g. "high-high" in the upper right quadrant indicates channels with high yield surrounded by other channels of high yield and so on. Analyses performed using PySAL 2.6.0 [4] and GeoPandas 0.11.0 [2]. We used two-sided statistical tests and assumed significance at p < 0.05.

A B
Supplementary Figure 9: Divergence between bacterial strains in mock community and reference genomes used. In order to mimic a realistic sequencing scenario, in which we do not have prior knowledge about the exact strains contained in a mixture, and to be able to analyse the ability to detect differences from the resulting data, we used references that are closely related but not identical to the true genomes. To quantify the difference we used both (A) the percentage of aligned nucleotide stretches (>90% sequence identity) between the true and reference assemblies and (B) ANI (average nucleotide identity) values determined using JSpecies in blast mode [5]. The references used in our experiment are ordered along the x-axis by their abundance in the microbial community, while highly accurate assemblies [3] are shown along the y-axis. The color scale used transitions from red to green with a saddle point (white) at 0.9 in order to emphasize differences at high values.  Probability that a fragment starts at i and has orientation o π i (g) Prior probability of genotype g ∈ G at position i f i (g|D) Posterior probability of genotype g ∈ G at position i given data D ϕ(d|g) Probability of character d with genotype g P (d|D) Posterior probability of sequencing character d given data D