Reconstruction algorithms for DNA-storage systems

Motivated by DNA storage systems, this work presents the DNA reconstruction problem, in which a length-n string, is passing through the DNA-storage channel, which introduces deletion, insertion and substitution errors. This channel generates multiple noisy copies of the transmitted string which are called traces. A DNA reconstruction algorithm is a mapping which receives t traces as an input and produces an estimation of the original string. The goal in the DNA reconstruction problem is to minimize the edit distance between the original string and the algorithm’s estimation. In this work, we present several new algorithms for this problem. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, in order to decode the original string. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as 0.27. The algorithms have been tested on simulated data, on data from previous DNA storage experiments, and on a new synthesized dataset, and are shown to outperform previous algorithms in reconstruction accuracy.

for every pair (x, y) ∈ (Σ * q ) 2 .Note that it is not assumed that the lengths of the input and output sequences are the same as we consider also deletions and insertions of symbols.As an example, it is well known that if S is the binary symmetric channel (BSC ) with crossover probability 0 p 1/2, denoted by BSC(p), it holds that Pr BSC(p) {y rec.|x trans.}= p d H (y,x) (1 − p) n−d H (y,x) , for all (x, y) ∈ (Σ n 2 ) 2 , and otherwise (the lengths of x and y is not the same) this probability equals 0.
The maximum-likelihood (ML) decoder for a code C with respect to S, denoted by D ML , outputs a codeword c ∈ C that maximizes the probability Pr S {y rec.|c trans.}.That is, for y ∈ Σ * q , D ML (y) = arg max c∈C {Pr S {y rec.|c trans.}} .
It is well known that for the BSC, the ML decoder simply chooses the closest codeword with respect to the Hamming distance.
The conventional setup of channel transmission is extended to the case of more than a single instance of the channel.Assume a sequence x is transmitted over some t identical channels of S and the decoder receives all channel outputs y 1 , . . ., y t .This setup is characterized by the conditional probability Pr (S,t) {y 1 , . . ., y t rec.|x trans.}= t i=1 Pr S {y i rec.|x trans.}.Now, the input to the ML decoder is the sequences y 1 , . . ., y t and the output is the codeword c which maximizes the probability Pr (S,t) {y 1 , . . ., y t rec.|x trans.}.
For two sequences x, y ∈ Σ * q , the number of times that y can be received as a subsequence of x is called the embedding number of y in x and is defined by Note that if y is not a subsequence of x then Emb(x; y) = 0.The embedding number has been studied in several previous works; see e.g.[1,3] and in [9] it was referred to as the binomial coefficient.In particular, this value can be computed with quadratic complexity [3].
While the calculation of the conditional probability Pr S {y rec.|x trans.} is a rather simple task for many of the known channels, it is not straightforward for channels which introduce insertions and deletions.In the deletion channel with deletion probability p, denoted by Del(p), every symbol of the word x is deleted with probability p.For the deletion channel it is known, see e.g.[8,9], that for all (x, y) ∈ (Σ * q ) 2 , it holds that Pr Del(p) {y rec.|x trans.}= p |x|−|y| • (1 − p) |y| • Emb(x; y).
According to this property, the ML decoder for one or multiple deletion channels is stated as follows [8].
Lemma 1. Assume c ∈ C ⊆ (Σ q ) n is the transmitted sequence and y 1 , . . ., y t ∈ (Σ q ) * are the output sequences from Del(p), then D ML (y 1 , . . ., y t ) = arg max x∈SCS(y 1 ,...,y t ) Note that since there is more than a single channel, when the goal is to minimize the average decoding error probability, the ML decoder does not necessarily have to output a codeword but any sequence that minimizes the average decoding error probability.In the next sections it will be shown how to use the concepts of the SCS and LCS together with the maximum likelihood decoder in order to build decoding algorithms for the deletion DNA reconstruction and the DNA reconstruction problems.

An Algorithm for Small Fixed Values of t
A straightforward implementation of this approach on a cluster of size t is to compute the set of shortest common supersequences of y 1 , y 2 , . . ., y t , i.e., the set SCS(y 1 , y 2 , . . ., y t ), and then return the maximum likelihood sequence among them.This algorithm has been rigorously studied in [8] to analyze its Levenshtein error rate for t = 2.The method to calculate the length of the SCS commonly uses dynamic programming [5] and its complexity is the product of the lengths of all sequences.Hence, even for moderate cluster sizes, e.g.t 5, this solution will incur high complexity and impractical running times.However, for many practical values of n and p d , the original sequence x can be found among the list of SCSs while taking less than t traces or even only two of them.This fact, which we verified empirically, can drastically reduce the complexity of the ML-based algorithm.Furthermore, note that x is always a common supersequence of all traces, however it is not necessarily the shortest one.Hence, our algorithm works as follows.The algorithm creates sorted sets of r-tuples, where each tuple consists of r traces from the cluster.The r-tuples are sorted in a non-decreasing order according to the sum of their lengths.For each r-tuple (y i1 , . . ., y ir ), the algorithm first calculates its length of the SCS, i.e., the value SCS(y i1 , . . ., y ir ).Observe that if SCS(y i1 , . . ., y ir ) = n then the sequence x necessarily appears in the set of SCSs of (y i1 , . . ., y ir ), that is, x ∈ SCS(y i1 , . . ., y ir ).However it is not necessarily the only sequence in SCS(y i1 , . . ., y ir ).Hence, all is left to do is to filter the set SCS(y i1 , . . ., y ir ) with sequences that are supersequences of all t traces and finally return the maximum likelihood among them.The algorithm iterates over all possible r-tuples for r = 2, 3, 4 and if none of them succeeds, the algorithm computes all SCSs of maximal length that were observed throughout its run and returns the one that minimizes the sum of Levenshtein distances from all copies in the cluster.
In Algorithm 1, we present a pseudo-code of our solution for the deletion DNA reconstruction problem.Note that the algorithm uses another procedure which is presented in Algorithm 2 to filter the supersequences and output the maximum likelihood supersequence.The input to the algorithm is the length n of the original sequence, and a cluster of t traces C. Algorithm 1's main loop is in Step 2; first in Step 2-a it generates the set F , which is a sorted set of all r-tuples of traces by the sum of their lengths.Then, in Step 2-b it iterates over all r-tuples in F and checks for each r-tuple, (y i1 , . . ., y ir ), if the length of their SCS, i.e., SCS(y i1 , . . ., y ir ), equals n.If it is equal to n, it computes the set of all its SCSs, SCS(y i1 , . . ., y ir ), and invokes Algorithm 2. Algorithm 2 checks if one or more of those SCSs are supersequences of all of the traces in the cluster, and if so it returns the maximum likelihood among them.In case that SCS(y i1 , . . ., y ir ) < n, the algorithm checks also if it is equal or greater than n max , which is the longest SCS that was found so far.In this case, the algorithm saves C max , which is the set of all r-tuples such that the length of their SCS equals n max .In Step 3, the algorithm computes S max = c∈Cmax SCS(c), which is the union of sets of SCSs of the r-tuples that the length of their SCS was n max .In Step 4, the algorithm invokes again Algorithm 2 to check if S max includes supersequences of all traces in C and returns the maximum likelihood among them.If none of the sequences in S max is a supersequence of all traces in C, the algorithm returns in Step 5 the sequence which minimizes the sum of Levenshtein distances to all the traces in C. Algorithm 1 ML-SCS Reconstruction Input: • Cluster C of t noisy traces: y 1 , y 2 , . . ., y t sorted by their lengths from the longest to the shortest.
Output: x -Estimation of the original sequence.

2.
for r = 2, 3, 4 do (a) Denote F ={c t} the set of all r-tuples from C, sorted by non-decreasing order of the sum of the lengths of the copies in each tuple.
Return the sequence from S max that has the minimum sum of Levenshtein distance to the copies in the cluster.end if

Simulations
We evaluated the accuracy and efficiency of Algorithm 1 by the following simulations.These simulations were tested over sequences of length n = 200, clusters of size 4 t 10, and deletion probability p in the range [0.01, 0.10].The alphabet size was 4. Each simulation consisted of 100,000 randomly generated clusters.Furthermore, we had another set of simulations for n = 100 with deletion probability p in the range [0.11, 0.20] and clusters of size 4 t 10.Each simulation for these values of p,n, and t included 10,000 Algorithm 2 ML-Supersequence Input: • Cluster C of t noisy traces: y 1 , y 2 , . . ., y t .

Output:
Maximum likelihood candidate of S, that is supersequence of all copies from the cluster.If it is not exists, the algorithm returns .
1. Filter S so it contains only sequences which are supersequence of all traces from the cluster.

2.
if S = ∅ then Return the maximum likelihood sequence from S with a respect to cluster C. end if 3. Return .randomly selected clusters.We calculated the Levenshtein error rate (LER) of the decoded output sequence as well as the average decoding success probability (referred as the success rate).We also calculated the k-error success rate, which is defined as the fraction of clusters where the Levenshtein distance between the algorithm's output sequence and the original sequence was at most k.Note that for k = 0, this is equivalent to calculate the success rate.We also calculated the minimal k for which its k-error success rate is at least q, and denote this value of k by k q succ .Note that for q = 1 this value determines the minimal number of Levenshtein errors that an error-correcting code must correct in order to fully decode the original sequences using Algorithm 1 with an error-correcting code.In addition, each cluster was also reconstructed using the BMA algorithm [2].
Figure S1 presents the LER as computed in our simulations of Algorithm 1 and the BMA algorithm for clusters of sizes t = 7 and t = 10.We also added the trivial lower bound of p t on the LER [8,9].This bound corresponds to the case when the same symbol is deleted in all of the traces.In this case, this symbol will not appear in the list of SCSs of any possible r-tuple or even the entire cluster since it cannot be recovered.Hence, it is not possible to recover its value and thus it will be deleted also in the output of the ML decoder.
In order to simulate also high deletion probabilities, we simulated 1000 clusters of sequences over 4-ry alphabet of length n = 100 with cluster size t between 4 and 10, while the deletion probability was p = 0.25.Supplementary Figure S2: k-error success rate, k 1 succ and k j succ values by the cluster size 4 t 10.The deletion probability was 0.25.

Large Cluster
In case the cluster is of larger size, for example in the order of Θ(n), we present in Algorithm 3, a variation of Algorithm 1 for large clusters.In this case, since the cluster is large, the probability to find a pair, triplet, or quadruplet of traces that their set of SCSs contains the original sequence x is very high, if not even 1.In fact, in all of our simulations, which we will elaborate below in this section, we were always able to successfully decode the original sequence with no errors even when the deletion probability was as high as 0.2.Hence, our main goal in this part is to decrease the runtime of Algorithm 1 while preserving the success rate to be 1.Algorithm 3 keeps the same structure of Algorithm 1, however, it performs two filters on the cluster in order to reduce the computation time.
The complexity of finding the length of the SCS of some set of r traces is the multiplication of their lengths, i.e., Θ(n r ) [5].Therefore, the complexity of finding the length of the SCS of a pair of traces is Θ(n 2 ), while there are Θ(n 2 ) pairs of traces (assuming the cluster size is Θ(n)).Therefore, in this case, calculating the length of the SCS of each pair of traces before considering some triplets is not necessarily the right strategy when our goal is to optimize the algorithm's running time.Hence, in Algorithm 3 we focused on filtering the traces in the cluster in order to check only a subset of the traces which are more likely to succeed and produce the correct sequence.
To define the filtering criteria for Algorithm 3, we simulated Algorithm 1 on large clusters.The length of the original sequence x was n = 200 and the cluster size was t = n 2 = 100.We generated 10,000 clusters of size t, where the deletion probability p was in the range [0.01, 0.15].The success rate of all the simulations was 1.We evaluated the percentage of clusters that the first r-tuple to have an SCS of length n was consisted of the longest 20% traces in the cluster.We observed that when the deletion probability was at most 0.07, in all of the clusters the first r-tuple of traces that had an SCS of size n consisted from the longest 20% traces in the cluster.For deletion probabilities between 0.08 and 0.11 these percentages ranged between 94.76% and 99.98%, while for p = 0.15 this percentage was 60.88%.Therefore, by filtering the longest 20% traces, it was enough to check only 20  2 pairs instead of 100 2 pairs in order to succeed and still reach the successful pair.The results of these simulations are depicted in Figure S4(a).
This observation lead us to the first filter in Algorithm 3, where we picked the longest 20% traces of the cluster.The second filter computes a cost function (in linear time complexity), to be explained below, on a given r-tuple of traces in order to evaluate if the traces in this r-tuple are likely to have an SCS of length n.Thus, the algorithm skips on the SCS computation of r-tuples that are less likely to have an SCS of length n.First, before performing the first filter, the algorithm calculates the average length of the traces in the cluster and uses it to estimate the deletion probability p.Then, if p > 0.1, the algorithm calculates the cost function on every r-tuple and checks if it is higher than some fixed threshold.This threshold depends on the estimated value of p and the cost function is based on a characterization of the sequences, as will be described in Section 1.4.2.
Supplementary Figure S3: k-mer distance demonstration for 4 traces.The original strand x is of length n = 20.

An Algorithm for Large Values of t
In this section we present Algorithm 3. We list here the steps that are different from Algorithm 1.In Step 2 the algorithm estimates the deletion probability in the cluster by checking the average length of the traces n and then calculates p = 1 − n n .In Step 3, the algorithm filters the cluster so it contains only the longest 20% traces.The last difference between Algorithm 3 and Algorithm 1 can be found in Step 4-b.In this step, before the computation of the SCS of a given r-tuple of traces, the algorithm computes the k-mer cost function (for k-mers of size k = 2) and checks if it is larger than the threshold T p .
We evaluated the performance of Algorithm 3 and verified our filters by simulations.Each simulation consisted of 10,000 clusters of size t = 100, the length of the original strand was n = 200, the alphabet size was q = 4, and the deletion probability p was in the range [0.01, 0.2].Algorithm 3 reconstructed the exact sequence x in all of the tested clusters.A comparison between the runtime of Algorithm 1 and Algorithm 3 can be found in Figure S4(b).Note that we did not compare the running time with the BMA algorithm since its success rate was significantly lower, for example when the deletion probability was 15%, its success rate was roughly 0.46.

Algorithm 3 ML-SCS Reconstruction for Large Clusters
Input: • Cluster C of t = Θ(n) noisy traces: y 1 , y 2 , . . ., y t sorted by their lengths in a non-decreasing order.
• Design length= n.Output: x -Estimation of the original sequence.
2. Compute n the mean length of the traces in C, and define p = 1 − n n .3. Filter traces from C so it contains only the t = 0.2t first traces in the cluster.4.
for r = 2, 3, 4 do (a) Denote F ={c the set of all r-tuples from C, sorted by non-decreasing order of the sum of the lengths of the copies in each tuple.(b) for i= 1, 2, ..., t r do if p > 0.1 and c k-mer (c Return the sequence from S max that has the minimum sum of Levenshtein distance to the copies in the cluster.end if 1.4.2The k-mer Distance and the k-mer Cost Function The k-mer vector of a sequence y, denoted by k-mer(y), is a vector that counts the frequency in y of each subsequence of length k (k-mer).The frequencies are ordered in a lexicographical order of their corresponding k-mers.For example for a given sequence y = "ACCT CC" and k = 2, its k-mer vector is k-mer(y) = 0100020100000101, according to the following calculation of the frequencies {AA : 0, AC : 1, AG : 0, AT : 0, CA : 0, CC : 2, CG : 0, CT : 1, GA : 0, GC : 0, GG : 0, GT : 0, T A : 0, T C : 1, T G : 0, T T : 1}.We define the k-mer distance between two sequences y 1 and y 2 as the L 1 distance between their k-mer vectors.The k-mer distance is denoted by d k-mer (y 1 , y 2 ).
For a given set of r sequences Y = {y 1 , y 2 , . . ., y r }, we define its k-mer cost function, which is denoted by c k-mer (y 1 , y 2 , . . ., y r ), as the sum of the k-mer distance of each pair of sequences in Y.That is, Observe that the k-mer distance between a sequence x and a trace y 1 which results from x by one deletion is at most 2k − 1.Every deleted symbol in x decreases the value of at most k entries in k-mer(x) and increases the number of at most k − 1 of the entries.Hence, each deletion increases the k-mer distance by at most 2k − 1, which means that an upper bound on the k-mer distance between the original strand x and a trace y i with np deletions is np(2k − 1).However, when comparing the k-mer distance of two traces, y 1 and y 2 , with more than one deletion, the k-mer distance can also decrease.An example of such a case is depicted in Figure 3. Combining these two observations, Algorithm 3 estimates if two traces have relatively large Levenshtein distance.If these traces have large Levenshtein distance, it is more likely that both of them will have an SCS of length n.Hence, the algorithm checks if the k-mer distance is larger than the threshold T p = 0.25np(2k − 1) and continues to compute the SCS, only if the condition holds.A similar computation is done for tuples with more than two traces.We use the value of 0.25 in the threshold to consider the cases where the k-mer distance decreases as depicted in Figure S3.We selected this value after simulating other values as well, reaching the best result with 0.25.An optimization of this value can be done in further research.

The DNA Reconstruction Problem
This section studies the DNA reconstruction problem.Assume that a cluster consists of t traces, y 1 , y 2 , . . ., y t , where all of them are noisy copies of a synthesized strand.This model assumes that every trace is a sequence that is independently received by the transmission of a length-n sequence x (the synthesized strand) through a deletion-insertion-substitution channel with some fixed probability p d for deletion, p i for insertion, and p s for substitution.Our goal is to propose an efficient algorithm which returns x, an estimation of the transmitted sequence x, with the intention of minimizing the edit distance between x and x.In our simulations, we consider several values of t and a wide range of error probabilities as well as data from previous DNA storage experiments.
Before we present the algorithms, we list here several more notations and definitions.An error vector of y and x, denoted by EV (y, x), is a vector of minimum number of edit operations to transform y to x.Each entry in EV (y, x) consists of the index in y, the original symbol in this index, the edit operation and in case the operation is an insertion, substitution the entry also includes the inserted, substituted symbol, respectively.Note that for two sequences y and x, there could be more than one sequence of edit operations to transform y to x.The edit distance between a pair of sequences is computed using a dynamic programming table and the error vector is computed by backtracking on this table.Hence, EV (y, x) is not unique and can be defined uniquely by giving priorities to the different operation in case of ambiguity.That is, if there is an entry in the vector EV (y, x) (from the last entry to the first), where more than one edit operation can be selected, then, the operation is selected according to these given priorities.The error vector EV (y, x) also maps each symbol in y to a symbol in x (and vice versa).We denote this mapping as V EV (y, x) : {1, 2, . . ., |y|} → {1, 2, . . ., |x|} ∪ {?}, where V EV (y, x)(i) = j if and only if the i-th symbol in y appears as the j-th symbol in x, with respect to the error vector EV (y, x).Note that in the case where the i-th symbol in y was classified as a deleted symbol in EV (y, x), V EV (y, x)(i) =?.This mapping can also be represented as a vector of size |y|, where the i-th entry in this vector is V EV (y, x)(i).The reversed cluster of a cluster C, denoted by C R , consists of the traces in C where each one of them is reversed.The reverse trace of y i ∈ C is denoted by y R i ∈ C. The symbols of each trace y R i ∈ C R are arranged in reverse order compared to how they appear in their original form y i ∈ C. For example, for cluster

The Iterative Reconstruction Algorithm (ITR Algorithm)
In this section we present Algorithm 4, the ITR Algorithm.The algorithm receives a cluster of t traces C and the design length n.Algorithm 4 uses several methods to revise the traces from the cluster and to generate from the revised traces a multiset of candidates.Then, Algorithm 4 returns the candidate that is most likely to be the original sequence x.The methods used to revise the traces are described in this section as Algorithm 5 and Algorithm 6. Algorithm 4 invokes Algorithm 5 and Algorithm 6 on the cluster in two different procedures as described in Algorithm 7 and Algorithm 8.The first method is described in Algorithm 5 is the EV Algorithm.The algorithm receives C, a cluster of t traces, the design length n, and y i , a trace from the cluster.Algorithm 5 calculates for every 1 k t, k = i, the vector EV (y i , y k ).In some fo the cases, there may be more than one error vector for EV (y i , y k ), which corresponds to the edit operations to transform y i to y k .In these cases, the algorithm prioritizes substitutions, then insertions, then deletions in order to choose one unique vector.These priorities were selected to support our definition of the deletion-insertion-substitution channel.However, for practical uses, one can easily change these priorities if some preliminary knowledge of the error rates in the data is given.Following that, the algorithm performs a majority vote in each index on these vectors and creates S, which is a vector of edit operations.Lastly, Algorithm 5 performs the edit operations on y i , and returns it as an output for Algorithm 7 and Algorithm 8. Algorithm 5 is used as a procedure in Algorithm 7 and Algorithm 8 to correct substitution and insertion errors of the traces in the cluster.
The second method is described in Algorithm 6, is known as the PP Algorithm.Similarly to Algorithm 5, Algorithm 6 receives C, a cluster of t traces, the design length n, and y k , a trace from the cluster.Algorithm 6 uses similar patterns (defined in Section 2.1.1)on each pair of traces and creates a weighted graph from them.Each vertex of the graph represents a pattern, and an edge connects patterns with identical prefix and suffix.The weight on each edge represents the frequency of the incoming pattern, the number of pairs of traces in the cluster that have this pattern in their sequences.Algorithm 6 is described in detail in Section 2.1.1.Algorithm 6 is used as a procedure in in Algorithm 7 and Algorithm 8 to correct deletion errors in the traces in the cluster.
Algorithm 7, the HR algorithm receives a cluster of t traces C and the design length n.Algorithm 7 performs k cycles, where in each cycle it iterates over all the traces in the cluster.For each trace y k , it first uses Algorithm 5 to correct substitution errors, then it uses Algorithm 6 to correct deletion errors, and lastly, it uses Algorithm 5 to correct insertion errors.When it finishes iterating over the traces in the cluster, Algorithm 7 updates the cluster with all the revised traces and continues to the next cycle.At the end, Algorithm 7 performs the same procedure on C R .Algorithm 7 returns a multiset of all the revised traces.
Algorithm 8, the VR algorithm also receives a cluster of t traces C and the design length n.Algorithm 8 uses the same procedures as Algorithm 7.However, in each cycle, it first corrects substitutions in all of the traces in the cluster using algorithm 5, then it invokes algorithm 6 on each trace to correct deletions, and finally invokes Algorithm 5 to correct insertions.Similarly to Algorithm 7, Algorithm 8 performs the same operations also on C R and returns a multiset of the results.
Algorithm 4 invokes Algorithms 7 and 8, with k = 2 cycles and combines the resulted multisets to the multiset S. If one or more sequences of length n exists in the multiset S, it returns the one that minimizes the sum of edit distances to the traces in the cluster.Otherwise, it checks if there are sequences of length n − 1 or n + 1 in S, and returns the most frequent among them.If such a sequence does not exists, it returns the first sequence in S. The number of cycles is k = 2 since we have found out that for most of the clusters, the set of candidates converges after two cycles.Thus, a possible third cycle can not improve the results in most of the cases.

The Pattern Path Algorithm (PP Algorithm)
In this section we present the Pattern Path algorithm.The algorithm, described also in Algorithm 6, is the main procedure of the iterative algorithm (ITR Algorithm) that corrects edit errors.Denote by w an arbitrary LCS sequence of x and y of length .The sequence w is a subsequence of x, and hence, all of its symbols appear in some indices of x, and assume these indices are given by i x Iterative Reconstruction (The ITR Algorithm) Input: • Cluster C of t noisy traces: y 1 , y 2 , . . ., y t .
• Design length = n. Output: • x -Estimation of the original sequence.
1. S = ∅ 2. Use Algorithm 7 and Algorithm 8, with C, n, k = 2 as parameters, to compute a multiset of candidates.
Save the candidates and their frequencies in it in S.
3. If S has one or more sequence of length n, return one that minimizes the sum of edit distance to the traces in C (ties are breaking randomly).
4. Otherwise, if S includes sequences of length n − 1 or n + 1 that minimize the sum of edit distance to the traces in C, return the sequence which is most frequent in the multiset S (if there is more than one choose randomly).
5. Return the first sequence in S.
noted that a subsequence can have more than one set of such indices, while the number of such sets is defined as the embedding number [1,3].In our algorithm, we chose one of these sets arbitrarily.Furthermore, given a set of such indices i x 1 < i x 2 < • • • < i x , we define the embedding sequence of w in x, denoted by u x,w , as a sequence of length |x| where for 1 j , u x,w (i x j ) equals to x(i x j ) and otherwise it equals to ?.The gap of x, y and their length-LCS sequence w in index 1 j |x| with respect to u x,w and u y,w , denoted by gap uy,w ux,w (j), is defined as follows.In case the j-th or the (j − 1)-th symbol in u x,w equals ?, gap uy,w ux,w (j) is defined as an empty sequence.Otherwise, the symbol u x,w (j) also appears in w.Denote by j , the index of the symbol u x,w (j) in w.Recall that the sequence w is an LCS of x and y, and u y,w is the embedding sequence of w in y.Given u y,w , we can define the sequence of indices i y 1 < i y 2 < • • • < i y , such that w(j ) = y(i y j ) for 1 j .Given such a sequence of indices, gap uy,w ux,w (j) is defined as the sequence y [i y j −1 +1:i y j −1] , which is the sequence between the appearances of the j -th and the (j − 1)-th symbols of w in y.Note that since i y j can be equal to i y j −1 + 1, gap uy,w ux,w (j) can be an empty sequence.Roughly speaking, the gap uy,w ux,w (j) holds every symbol that appears in y between the (j − 1)-th and j -th symbols of the LCS w, based on the embedding sequence u w,y .
The pattern of x and y with respect to the LCS sequence w, its embedding sequences u x,w and u y,w , an index 1 i |x| and a length m 2, denoted by P tn(x, y, w, u x,w , u y,w , i, m), is defined as: P tn(x, y, w, ux,w, uy,w, i, m) (ux,w(i − 1), gap uy,w ux,w (i), ux,w(i), . . ., gap uy,w ux,w (i + m − 1), ux,w(i + m − 1)), where for i < 1 and i > |x|, the symbol u x,w (i) is defined as the null character and gap uy,w ux,w (i) is defined as an empty sequence.The parameter m defines the length of the pattern, that is the number of embedding sequences and gaps that comprises the patterns.In our implementation of the algorithm, the length of the patterns is defined as m = 2.
The Pattern Path Algorithm receives a cluster C of t traces and one of the traces in the cluster y k .First, the algorithm initializes L[y k ], which is a set of |y k | empty lists.For 1 i |y k |, the i-th list of L[y k ] is denoted by L[y k ] i .The algorithm pairs y k with each of the other traces in C. For each pair of traces, y k and y h , the algorithm computes an arbitrary LCS sequence w, and an arbitrary embedding sequence u y k ,w .Then it uses w and u y k ,w to compute P (y k , y h , w, u y k ,w , u y h ,w , m).For 1 i |y k |, the algorithm saves P tn(y k , y h , w, u y k ,w , u y h ,w , i, m) in L[y k ] i .Then, the algorithm builds the pattern graph G pat (y k ) = (V (y k ), E(y k )), which is a directed acyclic graph, and is defined as follows.
1. V (y k ) = {((P tn(y k , y h , w, u y k ,w , u y h ,w , i, m), i) : The vertices are pairs of patterns and their index.Note that the same pattern can appear in several vertices with different indices i.The value |V | equals to the number of distinct pattern-index pairs.
3. The weights of the edges are defined by w : E → N as follows: For e = (v, u), where u = (P tn(y k , y h , w, u y k ,w , u y h ,w , i, m), i), it holds that w(e) = |{P tn ∈ L[y k ] i : P tn = P tn(y k , y h , w, u y k ,w , u y h ,w , i, m)}|, which is the number of appearances of P tn(y k , y h , w, u y k ,w , u y h ,w , i, m) in L[y k ] i .
4. The vertex S, which does not correspond to any pattern, is connected to all vertices of the first index.
The weight of these edges is the number of appearances of the incoming vertex pattern.
5. The vertex U has incoming edges from all vertices of the last index and the weight of each edge is zero.
Finally, the Pattern Path Algorithm identifies a longest path from S to U in the graph.This path induces a sequence, denoted by y k , which is formed by concatenating the patterns of y k (including their gaps if such exist), that appears in the vertices of the longest path in the pattern graph.It's important to note that y k represents a modified version of y k , as it incorporates the patterns present in y k .The algorithm returns y k , which is an updated rendition of y k , while also incorporating any gaps inherited from the vertices along the longest path.To illustrate the Pattern Path Algorithm's workflow, we provide an example in the following section.

Example of the PP Algorithm
We present here a short example of the Pattern Path Algorithm and its related definitions.

Input of the algorithm
The original strand in this example is x which is given below.The cluster of its t = 5 traces is C = {y 1 , . . ., y 5 }.The original design length of x is n = 10.The traces are noisy copies of x and include deletions, insertions, and substitutions.In this example Algorithm 6 receives the cluster C and the trace y k = y 1 as its input.

Computation of the LCSs and the patterns in cluster
After receiving the input the PP algorithm continues with the next step of computing the LCSs and the patterns of the cluster.Figure S5 presents the process of computing the patterns of (y 1 , y 2 ), (y 1 , y 3 ), (y 1 , y 4 ), (y 1 , y 5 ).For each pair, y 1 and y i , Figure S5 depicts w i , which is an LCS of the sequences y 1 and y i .Then, the figure presents u y 1 ,wi and u y i ,wi , which are the embedding sequences that the Pattern Path Algorithm uses in order to compute the patterns.Lastly, the list of patterns of each pair is depicted in an increasing order of their indices.Note that lowercase symbols present gaps and X presents the symbol ?. Supplementary Figure S5: Algorithm 6 Example -Patterns of y 1 .

Evaluating the patterns and their accuracy
The following list summarizes the patterns and their frequencies.Each list includes patterns from specific index.The numbers on the right side of each pattern in a list represents the pattern's frequency.

Creating the pattern path graph
Next, based on the list of patterns above, the pattern path is created.it can be shown that every pattern is a vertex in the graph, and that is represented by its sequence and its position in the sequence (1 i 10).Furthermore, the weights of the edges in the graph represents the frequencies of the patterns.As can be seen in Figure S6, the created graph is a directed acyclic graph.

Output
It is not hard to observe that the longest path in the pattern path graph of this example is: S→GT→GT A→T AX→AXG→XGT→GT G →T GC→GCC→CCtG→CtG→U, and the algorithm output will be y 1 = GT AGT GCCT G = x.Algorithm 8 Iterative Reconstruction -Vertical (The VR Algorithm) Input: • Cluster C of t noisy traces: y 1 , y 2 , . . ., y t .

Output:
• S={s 1 , s 2 , . . ., s p }, a multiset of p candidates, sequences that estimates the original sequence of the cluster.
Figure S2(a) presents the k-error success rate of this simulation and Figure S2(b) presents the values of k 1 succ and k 0.99 succ by the cluster size in the simulation.
(a) t = 7 (b) t = 10 Supplementary Figure S1: Levenshtein error rate by the deletion probability p, for clusters of size 7 (left) and 10 (right).This figure presents results from Algorithm 1, the BMA algorithm [2], and the p t lower bound.Note that the LER was 0 for p 0.06 and p 0.03 for t = 10 and t = 7, respectively.The X-axis represents the different values of the deletion probability in the range [0.01, 0.20] and the Y -axis represents the average LER of the clusters.
(a) k-error success rate by cluster size t and different values of k.The X-axis represents the cluster size t, the Y -axis represents the value of k for the calculation of the k-error success rate, and the Z-axis represents the k-error success rate.(b)The values of k 1 succ and k 0.99 succ by cluster size t.Denote that, k 1 succ is the minimal Levenshtein errors, that an error correcting code must correct in order to fully reconstruct the tested clusters using Algorithm 1.The X-axis represents the cluster size t, the Y -axis represents the k j succ .
(a) Percents of clusters that the r-tuple of traces that was used by Algorithm 1 to reconstruct the original sequence x, was consisted of the longest 20% traces.The X-axis presents the deletion probability and the Y -axis presents the percents of the 10,000 clusters that satisfied this property.(b) Running time in minutes of performing Algorithm 1 and Algorithm 3 on 10,000 clusters.The X-axis presents the deletion probability and the Y -axis presents the running time in minutes of performing Algorithm 1 and Algorithm 3 on 10,000 clusters of size t = 100.Supplementary Figure S4: Performance evaluation of Algorithm 1 and Algorithm 3. The simulation were for clusters of size t = 100, design length of n = 200, for each probability p we simulate 10,000 clusters.In all of the simulations the original sequence was reconstructed by the algorithms, introducing a success rate of 1.
(a) Patterns of y 1 and y 2 .(b) Patterns of y 1 and y 3 .(c) Patterns of y 1 and y 4 .(d) Patterns of y 1 and y 5 .

1 .
S = ∅, C orig = C 2.for j= 1, 2, ..., k do (a)C tmp = ∅ (b) for y i ∈ C do i.Perform Algorithm 5 on y i to correct substitution errors.ii.C tmp = C tmp ∪ {y i }.(c) end for(d) C = C tmp (e) C tmp = ∅ (f) for y i ∈ C do i.Perform Algorithm 6 on y i to correct deletion errors.ii.C tmp = C tmp ∪ {y i }.(g) end for(h) C = C tmp (i) C tmp = ∅ (j) for y i ∈ C do i.Perform Algorithm 5 on y i to correct insertion errors.ii.C tmp = C tmp ∪ {y i }.(k) end for (l) C = C tmp end for 3. S = S∪C 4. Set C orig = C R orig and repeat Steps 2-3 on C R orig .Add the results to S.

( a )
Edit error rate by the error probability for t = 10.The Xaxis represents the error probability of the simulated clusters and the Y -axis represents the edit error rate.(b) Edit error rate by the error probability for t = 20.The Xaxis represents the error probability of the simulated clusters and the Y -axis represents the edit error rate.(a) k 1 succ values by the error probability for t = 10.The Xaxis represents the error probability and the Y -axis represents the value of k 1 succ .(b) k 1 succ values by the error probability for t = 20.The Xaxis represents the error probability and the Y -axis represents the value of k 1 succ .