Introduction

Lineage tracing arose from embryonic fate mapping in developmental biology. Labeling a single cell with heritable markers provides information on progeny population, location, and lineage branching1. The earliest lineage tracing dates back to studying cell fate determination by observing the leech and nematode embryonic2. Embryonic lineage analysis became effective and informative later when cells can be labeled by dyes or radioactive tracers3. However, direct observation under microscopies would be unfeasible to trace the somatic cell lineage in larger organisms, like the fruit fly Drosophila larvae and even mammals, whose total number of cells is tens of orders of magnitude more. The evolution of genetic studies and molecular cloning techniques in the 1970s expanded the toolbox for cell lineage studies. Including somatic mosaicism4, retroviruses labeling5, cre/lox cassette-mediated recombination6, and transposable element insertions7. Technology advancements, on the other hand, support lineages studies in alternative biological questions and pathological conditions, such as the hematopoiesis8, hemostasis1, and cancer cell heterogeneity9. Recent efforts have been made to integrate the lineage labeling methods into single-cell sequencing techniques to improve lineage tracing resolution and to characterize comprehensive cellular dynamics in vivo10.

Single-cell RNA sequencing (scRNA-seq) has become state-of-the-art for cell identification and clustering through the integrated transcriptomic profiling11. Although its accessorial trajectory inference analysis gives valuable insights into the cell kinetics study12, this pseudotemporal ordering has limitations. First, most trajectory analyses are subjectively assumed to have a fixed start, branching, and endpoints, yet cells may follow more complex and dynamic differentiation or transformation patterns13. Second, mapping cellular dynamic paths solely based on global gene expression similarities may distort the interpretation of the genuine lineage commitment14; analytic bias may be introduced during data preprocessing when particular statistically insignificant but biologically essential features, or genes, are filtered out. As a result, dominated populations may obscure lineage branches15. Recent advances in sequencing-based lineage tracing resolve the lineage information across the transcriptional landscape by sequencing the heritable nucleic acid barcodes16,17. Parallel advancements and modifications in genome editing tools, such as clustered regularly interspaced short palindromic repeats (CRISPR), elevated the barcoding capacity by multiplexing editing targets18,19,20,21), engineering guide RNAs22,23, or Cas 9 proteins24. CRISPR array repair lineage tracing (CARLIN) is one of the approaches to generate stochastic mutations in the process by cell repairing the Cas 9 mediated double-strand breaks (DSB), while the Cas 9 protein activity is controled by Doxycycline induction. The edited CARLIN barcodes would be recorded as transcripts and sequenced for lineage tree constructions, and the standard scRNA-seq analysis would independently examine the cellular identities25.

CRISPR-mediated sequencing-based Lineage tracing techniques like CARLIN have been applied to many clonal dynamic studies. While they offer advantages such as high throughput, multiplexing, and compatibility in screening, they may bring computational challenges when building the lineage trees. One challenge is to estimate the accuracy of the reconstructed lineage tree, in other words, very few studies quantitatively assessed the accuracy of their lineage reconstruction methods26 due to the lack of ground truth or benchmark lineage. Yang et al27 used lineage tracing to study tumor initiation and progression. Phylogenies were constructed with Cassiopeia28, and the validity of the lineage reconstruction was estimated from the correlation between phylogenetic and allelic distances across the rebuilt trees. Using an algorithm called STAR Methods Bowling et al25 built general consensus trees that are not of the binomial type, for example, one node may have 20+ direct children, which does not accurately reflect cell division mechanism, and therefore this complicates the accuracy estimation and hinders higher resolution lineage tracing. Quality of the rebuilt lineage is assessed by comparing a computed pairwise similarity matrix of the tissues to known lineage relationships at tissue level resolution. Chan et al29 applied two methods, namely biased search and greedy search, to reconstruct lineage trees and determined the best reconstruction by summing the log-likelihoods for all indels that appear in the tree using likelihoods estimated from embryo data. Quinn et al19 also used Cassiopeia method28 to reconstruct lineage trees, and used the agreement between phylogenetic distance and allelic distance as a measure to assess the accuracy of rebuilt lineage.

Technical difficulties also impose challenges on lineage tracing. For example, another frequent challenge is primarily dealing with lineage information loss, probably because i). inefficient barcode capture during the scRNA-seq25, ii). incomplete barcode labeling due to the short editing duration30, and iii). ancestor’s barcodes being overwritten by progenies23. Also, PCR errors in library preparation may introduce artifacts for the barcodes and lead to an inaccurate lineage tree reconstruction31. Furthermore, noisy data from the oversaturated barcodes makes it difficult to separate the lineages and cell types and constrain the tracing capacity. This is due to the barcoding system over induction and poor experimental plan16. Also, insufficient sampling may dramatically affect the complete lineage construction for rare cell populations32. Without corresponding approaches to overcome these limitations and challenges, the uninformed use of the new generation of lineage tracing techniques may result in misleading outcomes. We reason that detailed simulation before the large-scale screening might lead to a better understanding of the limitations and inform experimental design.

Simulations make accuracy estimation of lineage trees possible, because the actual lineage trees(reference or benchmark trees) are known, though most work adopted simplified models. For example, it is assumed in Cassiopeia28 that each barcode consists of n targets, where each target can mutate to one of m possible states, with n and m being fixed. Furthermore, once mutated it can no longer change. Then phylogenetic tree is built according to maximum parsimony. Since the reads of edited barcodes contain a lot of randomness, it remains a challenge of how to fit a mutated target into one of the m states. Similar models are also adopted in several papers, see, for example,33 and34. Zafar et al17 counted the number of unique Cas9 edits that are used to mark distinct barcodes. However, experiments show that during the course of CRISPR-Cas9 barcode editing, nucleotide and its segments are continually and randomly deleted, mutated, or inserted25. Therefore Cas9 edit events are continually changing, and it might be questionable to clearly distinguish edit events, especially when two edits differ by only a few nucleotides.

In this work, unlike most of the simplified models, we design a purely stochastic simulation program that mimics the CRISPR-Cas9 editing and double strand break (DSB) repair procedure. In this framework, barcodes are cut at random sites whose rate is controlled by inducible reagent (Doxycycline), and due to DSB repair the following events occur with certain probabilities, respectively: deletion of large segments, nucleotide insertions, nucleotide substitution and deletion. After each edit event, due to the DSB repair mechanism in cells, the barcode pieces are rejoined and the CRISPR-Cas9 editing repeats. This editing procedure proceeds along with cell divisions. Cells are all labeled so that the “true” lineage tree is known. The simulation program is able to handle full or partial binomial lineage trees, and is able to include in the tree non-dividing cells and/or dead cells as well. Then we sample the barcodes at leaf level in the tree and reconstruct the lineage, which is compared to the “true” lineage tree in order to estimate the accuracy. Simulations with different lineage reconstruction algorithms under a variety of parameter settings are carried out. The simulation with 2 independent barcodes is also performed, and factors that affect lineage tracing accuracy are probed. We believe that high-quality simulation could help determine and optimize the conditions for using the new generation of lineage tracing models like CARLIN25 and DARLIN35.

Results

Comparison of alignment methods

Our modified alignment algorithm is compared to the well-established Gap Penalty alignment. The gap-penalty method introduces a cost/penalty (\(gapstart<0\)) whenever the sequence starts a new gap in alignment. So the reward becomes \(gapstart+mismat\cdot L\), where L is the gap length. This is equivalent to the affine gap penalty algorithm if one rewrites the reward as \((gapstart+mismat)+mismat\cdot (L-1)\). When \(gapstart=0\), this method is identical to the regular alignment. Figure 1 illustrates a sample of alignment results using different algorithms. Figure 1A is obtained in a similar manner as in Fig. 20A-bottom, which contains the real structure of edited leaf barcodes. In actual experiment and DNA sequencing, this structural information is lost because only collapsed sequence of barcodes are obtained (see Fig. 20B), however, in simulation we may use this information as benchmark to test the efficacy of different alignment algorithms.

Using the root barcode as the pivot, we align the collapsed leaf barcode using regular alignment, modified alignment, and gap penalty alignment, in an effort to rebuild the structural information of edited barcodes (such as Fig. 1A), where Fig. 1B–D show a sample for each method, respectively. We set the reward of a match at 1 and the penalty of a mismatch at -2. The reward for consecutive matches is \(consmat=0.2\), and fraction of penalty on consecutive mismatches is \(fr=0.5\). The root barcode has a length of 50, and the tree depth is 5. In the gap penalty alignment, the root barcode serves as the pivot, so we believe it is not viable to penalize a gap in the root barcode. Therefore, we only put a penalty on the leaf side. Figure 1B is comparable to Fig. 20C where many nucleotides are put in spaces which are supposed to be deleted large segments, resulting in many small gaps; 1(C) is comparable to Fig. 20D where many large gaps are well recovered. Figure 1D shows a result of the gap penalty method, and similar to that of the modified alignment, large gaps are also recovered.

Figure 1
figure 1

Comparison of Alignment Algorithms. (A) Leaf barcodes that contain structural information. This information is available in simulations but is lost in actual DNA sequencing. (B) Aligned barcodes using regular alignment. (C) Aligned barcodes using modified alignment. (D) Aligned barcodes using gap penalty alignment. (E) Leaf-leaf comparison scores using the three alignment algorithms.

To test the efficacy of each alignment algorithm, we compare the aligned leaf barcodes, as shown in Fig. 1B–D, to the actual leaf barcodes pairwisely, as shown in Fig. 1A. Because the structure of each leaf barcode is supposed to be recovered, in this leaf-leaf comparison we just apply the regular alignment. Since Fig. 1A contains the actual structure of each leaf barcode, a mismatch to an empty spot is considered the same as a mismatch to a nucleotide, hence has the same penalty. Let \(Rew(s_1,s_2)\) be the reward score, using regular alignment, of comparing two strings \(s_1,s_2\), whose leading and ending empty entries have been removed, then each pairwise comparison will yield a contribution score

$$\begin{aligned} \frac{Rew(s_1,s_2)}{\sqrt{Rew(s_1,s_1)}\sqrt{Rew(s_2,s_2)}}, \end{aligned}$$

which is similar to the correlation formula. The contribution score is less than or equal to 1, but could be less than -1 depending on the penalty of mismatches. Since the reward of a match is set at 1, the part \(Rew(s_1,s_1)\) (\(Rew(s_2,s_2)\)) may be simplified to the length of \(s_1\)(\(s_2\)). Figure 1E shows a sample of the pairwise leaf-leaf comparison, for the 32 leaves in Fig. 1A, using different alignment algorithms. Afterwards, the average contribution score is calculated over the many leaves for each algorithm. Then this simulation is run for 100 times and the average of each alignment algorithm is calculated and compared.

For the gap penalty method, we first test the parameter gapstart from –1 to –5 (when \(gapstart=0\) it gives the same result as regular alignment). It appears that when \(gapstart=-2\) this algorithm performs the best and when gapstart tends to be large negative, its performance degrades. The reason is that, although large negative gapstart reduces the number of gaps by putting a large penalty to open a gap, it also tends to stack the nucleotides, which results in many mismatches. With gapstart being set at -2, a simulation of 100 run yields the following average scores - regular alignment = 0.488, modified alignment=0.698, gap penalty alignment = 0.543. We then increase barcode length to 100 and generation number to 10, and tested gap penalty method to find that \(gapstart=-1\) yields the best result. With this setting, the average of 100 run yields the following scores - regular alignment = -0.0855, modified alignment=0.346, gap penalty alignment = 0.0176. Both results indicate that the modified alignment outperforms gap penalty method, and gap penalty method outperforms regular alignment. The question of what parameter settings in gap penalty algorithm yields the best result is beyond the scope of this paper. From the simulation results, we conclude that the modified alignment is at least comparable to gap penalty method. Therefore, we adopt the modified alignment algorithm in an effort to recover the structure of leaf barcodes.

Evolution of barcode matching scores

To see how the matching scores, when compared to the root, evolve as cells divide in the simulation, we run the program from generation 1 to generation 15, and for each generation the program is repeated 10 times. This part of simulation does not involve sequence alignment or lineage tree reconstruction, so the computational burden is not heavy even if we reach generation 15. The simulation parameters are tuned according to some experimental results. For instance, from Figs. 1C–E and 3C–E in25, Fig. 3F.I in23, and Fig. 3A in36 we conclude that large segment deletion is a major phenomenon in barcode editing; single nucleotide insertion is more frequent than two or more nucleotides insertion; single nucleotide deletion is more frequent than two or more nucleotides deletion; unedited target sites (perfect repairs in our program) account for considerable proportion of barcode editing events, etc. Therefore at a cutting site on the barcode, we set the probability of perfect repair at 0.7; the probabilities of inserting 1,2,3 nucleotides are 0.1, 0.03, 0.02, respectively; the probability of a substitution is 0.05; and the probability of a single nucleotide deletion is 0.1. The Cas9 mutation rate, defined as a cutting probability at each nucleotide, is set at \(mupb=0.1\). Given two or more cuts occurring on the barcode, the probability of a large segment deletion is 0.15.

The average matching scores as a function of generation number is shown in Fig. 2A, where n is the barcode length. The scoring scheme adopted to compute matching scores is given by the following.

$$\begin{aligned} \begin{aligned} \ \ {}&\ \ A&C&\ \ \ \ G&T\\ A&\ \ \ 1&-2&\ -2&-2\\ C&-2&\ 1&\ -2&-2\\ G&-2&-2&\ \ \ \ \ 1&-2\\ T&-2&-2&\ -2&1 \end{aligned} \end{aligned}$$
Figure 2
figure 2

Evolution of Matching Scores. (A) Average matching score decays geometrically as cell division/barcode mutation continues. (B) Standard deviation of matching scores also decreases as cell division/barcode mutation continues.

It is seen from Fig. 2A that when the mutation rate (mupb) is small, the matching score decreases slowly as cells divide; and when the mutation rate is high the matching score decays fast, roughly at a geometric rate.

The standard deviation of the matching scores for the 10 runs is shown in Fig. 2B. Mutations accumulate in each generation, which brings in variation in the matching scores, and in the meanwhile the length of the remaining barcode shortens gradually, which causes standard deviation to decrease. As cells divide, the number of descendants are doubled in each generation and this factor further reduces the standard deviation of the matching scores.

Test on RMP and NBJ methods

Before we apply the RMP and NBJ methods, with or without filtering, to reconstruct the lineage tree and test the effect of some parameter settings, we need to test the efficacy of these methods. The mechanism of barcode editing in our simulation is different from most of the existing literature in that each nucleotide, with 4 possible states (A,C,G,T), could be a potential cutting site, and the barcode length varies from generation to generation. Furthermore, the Cas9 editing continues along with cell division, that is, any nucleotide in the current barcode could be cut again. Therefore, many existing public datasets do not fit this framework. For example, in the simulated in silico datasets in33, each barcode consists of fixed number (in a few hundreds) of Cas9 targets, and each target has 30 possible states with respectively assigned switching probabilities. In addition, it is assumed that this switch/jump may occur at most once, which greatly simplifies the simulation scheme.

In the in vitro dataset of Challenge 1 in33, however, each Cas9 target has 3 states (0,1,2), and it is possible to test our RMP and NBJ methods on this dataset. We built a simplified version of our program that works on the 76 in vitro training sets of Challenge 1 in33, where the barcode length is fixed, and each target has 3 states, and pairwise entry-entry comparison is used (in place of alignment) when comparing two barcodes. The tree depth is estimated based on the number of cells/barcodes. For RMP method, we set \(propm=0.7\) and for NBJ method \(propm=0.3\). When rebuilding the parent node in our methods, there are some randomness involved (see section "Reconstruct parent node"), so for each of the 76 training sets we run RMP and NBJ, with and without filtering, for 50 times, and the average (Avg) and maximum (Max) accuracy of dividing/paired nodes of these 50 runs are recorded (detailed data is available on github.com/xzhanglab/CRISPR-based-Lineage-Tracing-Simulation). Then the overall average of the 76 training sets are calculated, as shown in Table 1.

Table 1 Accuracy (%) of in vitro data test.

Recall that in the case of full binomial tree, the accuracy of paired nodes in our simulation is the complement of the RF-distance. Thus, if NBJNF Avg has an accuracy of 42.038% in the case of full binomial tree, then it corresponds to a RF-distance of 0.58.

The 76 trees are further classified as small (cell number<10 cells), medium (10\(\le\)cell number<20), and large (cell number\(\ge\)20), and the accuracy of each method on each class are also carried out. This result is provided in Figure S1, which is comparable to Figure 2-F in33 (except that we use accuracy rather than RF-distance when comparing two lineage trees). As an example, Fig. 3 shows the accuracy of NBJNF method. In view of the result in Box 1-Figure F in33, we then conclude that our RMP and NBJ methods are comparable to some benchmarked approaches for reconstruction of in vitro cell lineages, such as DCLEAR(WHD), DCLEAR(KRD), Liu method, and Guan method.

Figure 3
figure 3

Accuracy of NBJNF method on in vitro dataset. (x ticks represent the upper-bound of the sub-interval. For example, 8.3 represents the interval [0, 8.3]; 16.6 represents the interval (8.3, 16.6], etc.).

In the second test on our RMP and NBJ methods we use our simulation to generate lineage tree with barcode length of 50 and generation number of 5, and apply the RMP and NBJ methods to rebuild it. Other parameters are as follows: probability of perfect repair is 0.5; the probabilities of inserting 1,2,3 nucleotides are 0.15, 0.1, 0.05, respectively; probability of a substitution is 0.1; probability of a single nucleotide deletion is 0.1; probability of large segment deletion is 0.1; sample size is \(ss=1\); Cas9 cutting rate is \(mupb=0.1\); pairing threshold propm varies from 0.4 to 0.9. Comparing to the parameter settings in section "Evolution of barcode matching scores", we see that the large segment deletion probability is slightly lower, and the probabilities of insertions are slightly higher. It is expected that the leaf barcodes generated under this setting possess more diversity/entropy. This modification is supported by a recent experimental result35 where Cas9-TdT results in fewer deletions but twice the insertion events per allele than Cas9 expression, with all four nucleotides well-represented in the inserted sequences.

The rebuilt tree is then compared to the actual lineage tree and accuracy is calculated. For each propm value and each lineage rebuilding method we run the simulation 100 times and the average (Avg) and maximum (Max) accuracy are computed, as shown in Table 2.

Table 2 Lineage accuracy (%) of 100 runs.

We see from Table 2 that our RMP and NBJ methods, with or without filtering, could well rebuild the lineage, and in particular, the lineage accuracy of RMPNF and NBJNF methods could reach 100%, which means the full binomial lineage tree is perfectly rebuilt. Figure 4 shows a scenario of barcode evolution where NBJNF method with \(propm=0.8\) rebuilds lineage with 100% accuracy. Other parameter settings are the same as described for Table 2.

Figure 4
figure 4

A barcode evolution scenario where lineage is perfectly rebuilt using NBJNF method. The rebuilt lineage is identical to this original tree, hence is omitted.

Further examination on the barcode evolution in Fig. 4 reveals that along with each cell division, sibling nodes have similar mutations which builds up in a way that barcodes belonging to different branches are easily distinguishable, see the paired leaf nodes in Fig. 4.

In the 100 runs using NBJNF method with \(propm=0.8\), as seen in Table 2, we also show the barcode evolution scenario which has the lowest accuracy of rebuilt lineage in Fig. 5. In this case, we see that after the first division, the two children barcodes both experience large segment deletion, which occur, although independently, around the same sites. Further mutations do not introduce enough varieties to distinguish leaf nodes that belong to different branches. As a result, the leaf barcodes all look similar, which causes homoplasy effect37, and that is the reason why NBJNF method cannot pair them accurately. This finding suggests that in actual experiment, it is better control mutation rate to avoid large segment deletions in early generations, for example, reduce Dox induction in the first few divisions.

Figure 5
figure 5

A barcode evolution scenario where the rebuilt lineage has lowest accuracy in 100 runs using NBJNF method. (A) Original lineage tree. (B) Rebuilt lineage tree. Green nodes represent matched nodes to the original counterparts. The accuracy is \(7/30=23.3\%\).

The effect of barcode length and sampling proportion

Due to the large computational burden, we may not perform exhaustive tests on all combinations of methods and parameter settings. We assume that some attributes found in one method also apply to other methods unless there is legitimate reason against it. In this section we first test the effects of barcode length and sampling size on the accuracy of reconstructed lineage using RMP and NBJ methods.

We pick different lengths in the barcode, \(n=100,200,300,400\) (bp)— and different sampling proportions (ss-sampling size) to test their effects on the accuracy of the reconstructed lineage trees, which is explained in sections "Binomial tree comparison" and "Fractional sampling". We simulate cell divisions for 10 generations with barcode mutations under certain mutation rate. The simulation is run 10 times for each setting, and the average accuracy is plotted in Fig. 6A for the RMP method with \(propm=0.85\), and in Fig. 6B for the NBJ method with \(propm=0.4\). The effect of propm on accuracy is investigated in section "The effect of matching proportion in pairing".

Figure 6
figure 6

Whole tree accuracy with different barcode lengths and sampling proportions. (A) Lineage accuracy using RMP method. (B) Lineage accuracy using NBJ method.

It is seen from Fig. 6 that accuracy of both the internal nodes and the paired nodes generally increases for longer barcode, though the increase does not appear dramatic. In a few scenarios the lineage accuracy slightly decreases with longer barcode. This is a bit anti-intuitive because we were expecting that longer barcode will greatly increase the accuracy of reconstructed lineage. We discover that this is an inborn attribute with purely stochastic models, and will explain it in details in the next section together with the effect of mutation rate.

Nevertheless, the sampling proportion plays a much more influential role in the accuracy of lineage. When the sampling proportion is high, most internal nodes are paired nodes, and as a result the accuracy of internal nodes and the accuracy of paired nodes are very close. As the sampling proportion decreases, many internal nodes become singleton nodes, and these two accuracy measurements differ greatly - a large proportion of internal nodes are matched in the rebuilt lineage tree, while a much smaller proportion of the paired nodes are correctly matched.

The RMP method with high propm is stringent in pairing the nodes, while NBJ method with low propm encourages the paring of nodes. Therefore, when most nodes are singleton nodes, RMP method yields higher accuracy than NBJ method, and if most nodes are paired nodes, NBJ method generally outperforms RMP method.

The effect of mutation rate

The control of the mutation rate (or Cas9 cutting rate) is one of the key factors that affect the accuracy in reconstructing cell lineage. We pick barcode length \(n=300\), generation level at 10, and run the simulation for a variety of mutation rates (mupb) and sampling proportions. Both RMP and NBJ methods are performed. The accuracy of the rebuilt lineage tree, in terms of percentage of matched all internal nodes and the percentage of matched all paired nodes using RMP method and NBJ method are illustrated in Fig. 7.

Figure 7
figure 7

Whole tree accuracy with different mutation rates. (A) Lineage accuracy using RMP method with \(propm=0.85\); (B) Lineage accuracy using NBJ method with \(propm=0.4\).

For the result of RMP method in Fig. 7A, when sampling size ss is large, most sampled internal nodes are dividing/paired nodes, and the two percentages - all matched internal nodes vs matched paired nodes (see Fig. 23)—are very close. And in this case, as mupb increases, both percentages increase and then flatten out. When ss is small, most internal nodes are singleton nodes, and there appears to be a big difference between these two percentages - the percentage of all matched internal nodes is high while the percentage of matched paired nodes is low. It is also seen from Fig. 7A that for \(ss=0.5\), the percentage of matched paired nodes achieves a local maximum at mupb around 0.06.

The NBJ method with low propm encourages the pairing of two barcodes. Figure 7B shows that the accuracy of paired nodes stays stable as the mutation rate increases. The situation is different for all internal nodes. When the mutation rate is low and sampling size is small, the percentage of matched all nodes is low, and as mutation rate increases, this accuracy increases quickly, and then flattens out when mutation rate is very high. The reasons are as follows: Low mutation rate does not generate much variation in the barcodes, so it becomes harder to distinguish barcodes that should belong to different clades; small sampling size results in more singleton nodes in the lineage tree which is sparse; and NBJ method with low propm encourages the pairing of two barcodes. As a result, many singleton nodes are paired incorrectly, which lowers the percentage of matched nodes. As mutation rate increases, much variation is created in the barcodes which prevents them to be incorrectly paired, and the accuracy improves greatly. The trade-off between all matched nodes and paired nodes is observed in both Fig. 7A, B.

Furthermore, when mutation rate is high, lineage accuracy tends to flatten out or slightly decrease. To see the combined effect of barcode length and mutation rate, we run RMP and NBJ methods 10 times, respectively, with various barcode lengths (\(n=100, 200, 300, 400\)) and various mutation rates (\(mupb=0.04, 0.08, 0.12, 0.16, 0.2\)). The average full tree (\(ss=1\)) accuracy is provided in Fig. 8. Other parameter settings remain the same as in this section.

Figure 8
figure 8

Full tree accuracy (\(ss=1\)) with different barcode lengths and different mutation rates. (A) RMP method; (B) NBJ method.

Some interesting findings from Fig. 8 are as follows: when mutation rate is low, longer barcode consistently yields higher lineage accuracy; while mutation rate is high, however, lineage accuracy tends to decrease, and longer barcode does not show advantage over shorter barcode, which is a little surprising. Obviously, the first reason is that higher mutation rate brings in higher chance of large-segment-deletions, which wipes out lineage information, and longer barcode tends to have longer segment deletions. The second reason, which appears to be deeper, is the following. In many widely adopted simulation models, each target in the barcode is assumed to mutate at most once. This is a strong and ideal assumption in that the lineage information, which is accurately contained in these mutated targets, will be retained throughout the whole editing process. Longer barcode, which contains more lineage information, will certainly yield higher accuracy in the rebuilt lineage. In the pure stochastic model, however, Cas9 editing activities proceed continuously and randomly, and newly mutated nucleotides may mutate again later on. As a result, lineage information, that is contained in mutated nucleotides, may be wiped out due to further mutations. Not only that, these further mutations could alter the previously established lineage information, and consequently promote mismatches in rebuilding lineage. To our best knowledge, this phenomenon has not been addressed in other related work.

The effect of matching proportion in pairing

The parameter propm, as introduced in sections "Pairwise alignment and barcodes pairing—RMP method" and "Neighbor joining method (NBJ) on barcodes pairing", controls the pairing of two barcodes. High value of propm sets a high bar to pair two barcodes because they must match a large proportion of nonzero nucleotide, while low value of propm allows two barcodes to be paired even if a few nonzero nucleotide are matched. The comparison of the lineage tree accuracy of high and low propm at mutation rate \(mupb=0.1\) are show in Fig. 9A for RMP method, and in Fig. 9B for NBJ method.

Figure 9
figure 9

Whole tree accuracy with different matching proportions. A: all internal nodes, P: paired nodes, L: low \(propm=0.4\), H: high \(propm=0.85\). (A) RMP Method; (B) NBJ Method.

For RMP method, Fig. 9A shows that the setting \(propm=0.85\) yields higher accuracy than that of \(propm=0.4\) for all internal nodes. But when the sample size is small, lower propm yields a better result on paired nodes than that of higher propm. Recall that lower propm means the pairing of two barcodes is encouraged.

For NBJ method, when sample size is high, high and low promp have similar matching accuracies. However, when ss is small, high promp yields higher accuracy of all internal nodes, but low promp yields higher accuracy of paired nodes. Therefore, there appears to be a trade-off between matched singleton nodes and paired nodes. If ss is small, the lineage tree is ‘sparse’ in that most nodes are singleton nodes, and a few nodes are paired nodes that are sparsely distributed in the tree, hence hard to pair correctly. An algorithm with certain parameter setting may match more paired nodes but mismatch many singleton nodes, and vice versa. Therefore, it is important to properly choose the algorithm and parameter settings in order to balance this trade-off.

Comparison of RMP and NBJ methods

The results of section "The effect of matching proportion in pairing" may be combined to compare the performances of RMP method to NBJ method. As an illustration, we first choose the same \(propm=0.4\), \(mupb=0.1\), for both RMP and NBJ methods, and the accuracy comparison is shown in Fig. 10A, B.

Figure 10
figure 10

Comparison of RMP and NBJ methods. (A) Accuracy of all internal nodes, same \(propm=0.4\) for both RMP and NBJ methods. (B) Accuracy of paired nodes, same \(propm=0.4\) for both RMP and NBJ methods. (C) Accuracy of all internal nodes, \(propm=0.85\) for RMP and \(propm=0.4\) for NBJ. (D) Accuracy of paired nodes, \(propm=0.85\) for RMP and \(propm=0.4\) for NBJ.

Recall that low propm indicates encouragement of barcode pairing, and from Fig. 10A, B it is seen that under this condition, NBJ method outperforms RMP method. From the result in section "The effect of matching proportion in pairing" it appears that RMP method performs better when \(propm=0.85\), so we compare the result of RMP method with \(propm=0.85\) and the result of NBJ method with \(propm=0.4\), as shown in Fig. 10C, D.

The sub-graph of Fig. 10C shows the accuracy of all internal nodes of RMP and NBJ methods, respectively, with different sampling sizes (ss). We see that when ss is high, there is no big difference between the accuracy of these two methods. When ss is low, RMP method has higher accuracy than NBJ method. However, we recall that when ss is low, most sampled barcodes are singleton. RMP tends to not pair these barcodes, so it successfully constructs lineage for most singleton barcodes, but misses those that should be paired.

The sub-graph of Fig. 10D shows accuracy on paired/dividing internal nodes. Again when ss is large, RMP and NBJ do not show much difference in accuracy of paired/dividing nodes. However, when ss is small, we see that NBJ outperforms RMP, that is, when ss is small, NBJ correctly pairs more barcodes than RMP.

The effect of pulse induction

In the work of Bowling et al25 pulse induction of Doxycycline is implemented to control the barcode mutation. To test the effect of pulse induction on the accuracy of the rebuilt lineage tree, we assign mupb alternatively between a set rate and a base value (0.005) as barcode editing continues. The ’No-Dox’ line in Fig. 1G in25 indicates that even in the case of no Dox induction, small percentage of edited alleles is observed. Therefore we set mupb at a very low rate to indicate the scenario of no Dox application. In the case of Dox induction, we tested the set rate at 0.05 and 0.1. Other parameter settings are the same as in section "The effect of barcode length and sampling proportion" so that we may compare the results. The results of RMP method are shown in Fig. 11, while the results of NBJ method are shown in Fig. 12.

Figure 11
figure 11

Effect of Pulse Induction, RMP Method, \(propm =0.85\). (A) Lineage accuracy with \(mupb=0.05\). (B) Lineage accuracy with \(mupb=0.1\).

Figure 12
figure 12

Effect of Pulse Induction, NBJ Method, \(propm =0.4\). (A) Lineage accuracy with \(mupb=0.05\). (B) Lineage accuracy with \(mupb=0.1\).

From the simulation results of RMP method in Fig. 11 it is seen that when sampling size is large, pulse Dox induction yields higher accuracy than that of constant induction. When the sampling proportion is low and mupb is low, as in Fig. 11A, constant Dox induction yields higher accuracy on internal nodes, though in this situation most of the internal nodes in the sample are singleton nodes. However, when mupb is high as in Fig. 11B, pulse Dox induction generally yields better lineage accuracy than constant induction, especially for paired nodes.

Similar observations are seen for NBJ method in Fig. 12, i.e., when sampling size is big and mupb is high, pulse induction outperforms constant induction. One difference is that when sampling size is small, constant Dox induction yields better accuracy than pulse induction on both the internal nodes and paired nodes.

Comparing the results in Figs. 11 and 12, it appears that RMP method with \(propm=0.85\) and pulse induction with \(mupb=0.1\) yields the best lineage accuracy, see Fig. 11B.

Effect of Dox induction at specific time

For successful lineage tracing with molecular barcodes, it is important to have sufficient diversity and randomness in the barcodes. However, the cellular DSB repair system has a bias towards large sequence deletions. To prevent the loss of lineage information, it is recommended to use an intermediate dosage of Dox (25ug/g mice) with the CARLIN model, as suggested by Bowling et al25. Despite this recommendation, there is a lack of justification for this experimental setting, and optimizing the usage conditions for CARLIN in mice remains a costly and variable process.

We conducted a series of simulations to evaluate the efficacy of paused intermediate Dox inductions for the CARLIN system. The simulations were performed using a set of parameters including a barcode length of 276, propm of 0.95, a sample size of 0.1, large deletion probability of 0.3, divp of 0.99, and clive of 0.99, which accounts for the possibility of cell death and non-dividing. Mutations were introduced at generation numbers 4, 6, and/or 8, with varying rates. The simulation was run 10 times for each setting, and the average accuracy was calculated. To evaluate the accuracy of the rebuilt lineage tree, we calculated the percentages of paired nodes and internal nodes for three scenarios: one generation above the leaf level, two generations above the leaf level, and the entire tree. The focus of our analysis was on tracing back a limited number of generations, as real experiments often involve single time point sampling from animal tissues, allowing for the reconstruction of only the most recent one or a few generations. Both RMP and NBJ methods were employed, and accuracy was calculated using either all internal nodes or just paired nodes.

Figure 13
figure 13

Effect of Dox induction at specific time. (A) and (B): accuracy is calculated by using RMP method under different induction patterns and mock Dox concentrations, the accuracy curves were plotted in Fig S2. Average accuracy for each curve is calculated and they are summarized here for comparisons. (C) and (D): accuray is calculated by NBJ method.

Supplementary Figure S2 presents the results of simulations performed across multiple induction patterns, various Dox dosages (mupb), and lineage reconstruction methods. The x-axis highlights the induction time points, which are depicted in red, and the 7 Dox dosages are represented by distinct colored lines. In Fig. 13, we calculated and summarized the average accuracy for each mock Dox concentration (mupb) over the 10 cell divisions (d3-d12). By comparing the accuracy calculated in paired nodes versus all internal nodes, we observed that in the initial 3 cell divisions, the majority of the leaves are singletons. This is evidenced by the fact that the initial accuracy in all internal nodes is significantly high, while it is extremely low in paired nodes. These results suggest that Dox induction is unnecessary for lineage tracing at very early stages, irrespective of the induction patterns and mupb values, background Cas9 activities could generate moderate barcode diversity. We also found that when comparing the CARLIN performance in tracing back different generations, the accuracy is higher when tracing back fewer generations. Furthermore, testing the induction patterns revealed that the double paused pattern has the highest accuracy under the same conditions. In addition, the NBJ methods demonstrated advantages over the RMP, as the intermediate dosage of Dox induction showed much similar accuracy with high Dox dosage inductions, compared to the same groups in RMP methods.

Effect of non-filtering by root

While reconstructing the parent node from children cells, a natural idea is to use available information - such as root barcode - as much as possible. This is how the filtering step was introduced, as illustrated in Fig. 21C. This filtering step will make internal barcodes gradually converge to root while tracing backward. To test whether filtering improves lineage accuracy, we compare the results of NBJ and NBJNF, with and without pulse Dox induction. Barcode length is set at \(n=100\)(bp) and mutation rate is set at \(mupb=0.1\). Other parameter settings are the same as that of section "The effect of barcode length and sampling proportion". propm increases from 0.4 to 0.9, and sampling size takes values \(ss=1,0.8,0.5,0.1\). For each setting the program is run 10 times and the average is calculated. The result with pulse Dox induction is shown in Fig. 14, and the result with constant Dox induction is shown in Figure S3.

Figure 14
figure 14

Comparison between NBJ and NBJNF with pulse Dox induction. (AD): accuracy of all internal nodes with different sampling size; (EH): accuracy of all paired nodes with different sampling size.

From both Figs. 14 and S3, it is seen that NBJNF outperforms NBJ in almost all the parameter settings, which, in our opinion, is counter-intuitive. We notice from simulation results that the filtering mechanism blurs the difference among different clones, in other words, internal barcodes become more and more similar as the algorithm traces backward in the lineage tree. Consequently, many internal barcodes are paired incorrectly which reduces lineage accuracy. We also made a comparison between RMP and non-filtering RMP methods with and without pulse Dox induction, as in Figs. S4 and S5 where we see that in most cases non-filtering RMP method outperforms RMP method. As a conclusion, it is better not perform the filtering with root while rebuilding parent node, as shown in Fig. 21C, and the lineage accuracy becomes higher.

Effect of adding a second barcode

As introduced in section "Adding a second barcode", we test the lineage accuracy with two barcodes for a grid of propm and propmi. Other parameter settings are similar to that of section "Effect of non-filtering by root", except that there are two barcodes that are being edited independently as cells divide. Both the NBJ and NBJNF methods are applied, with and without pulse Dox induction. The simulation is run 10 times for each setting and the average accuracy is calculated. The results of NBJ method are shown in Figs. S6 and S7; the result of NBJNF method with constant Dox induction is shown in Figure S8. Figure 15 shows the result of NBJNF method with pulse Dox induction.

Figure 15
figure 15

Lineage accuracy with 2 Independent Barcodes using NBJNF Method and Pulse Dox Induction. (AD) accuracy of all internal nodes with different sampling size; (EH) accuracy of all paired nodes with different sampling size.

The accuracy of paired nodes tells when two children cells are generated from a common parent, so we may mostly focus on the accuracy of paired nodes. In comparison of Figures S6 to S7, and Figures S8 to 15, we see that pulse Dox induction results in higher accuracy of paired nodes. If we compare Figures S6 with S8, and Figures S7 with 15, we see that NBJNF method generally outperforms NBJ method, which agrees with the findings in section Effect of Non-Filtering by Root.

Next we compare Figures S6 to S3, Figures S8 to S3, Figures S7 to 14, Figs. 15 to 14, we see that adding a second barcode improves the lineage accuracy. So we anticipate that adding more barcodes will improve lineage accuracy, if this additional information is used properly and a good algorithm on lineage reconstruction is applied.

In our NBJ and NBJNF algorithms on double barcodes, it appears to be tricky to choose parameters propm and propmi. From the simulation result, it is observed that the setting of propm-propmi combination where the highest accuracy occurs also depends on other factors such as the way of Dox application (constant v.s. pulse), sampling size (ss), and the nodes of interest (all internal nodes or paired nodes only).

Effect of changing indel probabilities

As introduced in section Test on RMP and NBJ Methods, the use of Cas9-TdT in35 results in fewer deletions but twice the insertion events per allele than Cas9 expression, with all four nucleotides well-represented in the inserted sequences. We therefore test the effect of changing indel probabilities on lineage tracing. These indel probabilities are chosen the same as in section Test on RMP and NBJ Methods: probability of perfect repair is 0.5; the probabilities of inserting 1,2,3 nucleotides are 0.15, 0.1, 0.05, respectively; probability of a substitution is 0.1; probability of a single nucleotide deletion is 0.1; probability of large segment deletion is 0.1. We call this change of indel probabilities the simulation with Cas9-TdT. Barcode length is set at \(n=100\) and generation number is 10. Two independent barcodes are used, NBJ and NBJNF methods are performed with and without pulse Dox induction. Simulation is run 10 times for each parameter setting. The average lineage accuracy using NBJ method is shown in Figures S9 (constant Dox induction) and S10(pulse Dox induction); the result of NBJNF method is shown in Fig. 16 (constant Dox induction) and Fig. 17 (pulse Dox induction).

Figure 16
figure 16

Lineage accuracy with 2 Independent Barcodes using NBJNF Method and Cas9-TdT. (AD) accuracy of all internal nodes with different sampling size; (EH) accuracy of all paired nodes with different sampling size.

Figure 17
figure 17

Lineage accuracy with 2 Independent Barcodes using NBJNF Method and Cas9-TdT with Pulse Dox Induction. (AD) accuracy of all internal nodes with different sampling size; (EH) accuracy of all paired nodes with different sampling size.

Comparing the results with new indel probabilities (which favors insertion) with results of the original settings (as in section Effect of Adding a Second Barcode), it is easily seen that the results with new indel probabilities substantially improved lineage accuracy. As a further examination, we plot the maximum accuracy of the 10 runs for NBJNF method with pulse Dox induction in Fig. 18, where we see that the maximum accuracy could exceed 75% for some parameter settings.

Figure 18
figure 18

Maximum lineage accuracy of 10 runs with 2 Independent Barcodes using NBJNF Method and Cas9-TdT with Pulse Dox Induction. (AD) accuracy of all internal nodes with different sampling size; (EH): accuracy of all paired nodes with different sampling size.

Furthermore, we observe once again that the results with pulse Dox induction generally yield better accuracy than that of constant induction, and the NBJNF method outperforms NBJ method in the accuracy of lineage reconstruction. Our understanding is that the new indel probabilities, that favor insertion and inhibit large deletion, promote the diversity/differentiation of alleles as the editing process proceeds, yet filtering by the root barcode acts against this differentiation. As a conclusion, the root barcode information should be discarded while rebuilding the lineage tree.

Discussion

We were inspired by experimental results (see Fig 1C in CARLIN model25) to design this stochastic simulation scheme in order to reflect the randomness in CRISPR-Cas9 editing and DSB repair. Our findings from the simulations are summarized as follows.

  • When comparing two barcodes and perform alignment, we find that a dynamic programming algorithm that rewards consecutive matches better recovers the structure of the mutated barcodes which is due to large segment deletion.

  • The average matching score decreases approximately geometrically as cell division and barcode mutation continue, and the decay rate is controlled by mutation probability due to Dox induction. The standard deviation of the matching scores also decreases as generation number increases.

  • Large segment deletions in early generations could greatly compromise lineage accuracy. It is better restrain Dox induction to reduce the chance of large segment deletions in early generations.

  • Properly chosen barcode length and moderate mutation rate will yield higher accuracy in lineage. Very long barcode with high mutation rate do not necessarily improve lineage accuracy, and on the contrary, lineage accuracy may even decrease.

  • The sampling proportion of the leaf barcodes has a major influence on the lineage tree comparison. If most leaf nodes are sampled, the lineage tree is near a full binomial tree, and most internal nodes are paired nodes; if very few leaf nodes are sampled, the lineage tree becomes a ‘sparse’ tree and most internal nodes are singleton nodes.

  • When both paired and singleton nodes exist in the lineage tree, i.e., sampling proportion is relatively small, there appears to be a trade-off between the accuracy of paired nodes and accuracy of singleton nodes. If an algorithm encourages pairing of two barcodes, the accuracy of paired nodes may increase but the accuracy of singleton nodes will decrease, and vice versa.

  • The mutation rate controlled by Dox induction also plays an important role. If the mutation rate is very low, it does not generate much variations in the leaf barcodes, and therefore many nodes cannot be matched correctly. As mutation rate increases, it generates reasonable amount of variations in the leaf barcodes, and this information is used to successfully match more nodes in the reconstructed lineage tree. However, if the mutation rate is too high, the accuracy does not increase, and it even decreases for some parameter settings. High mutation rate will cause more large-segment-deletions, and alter lineage information that was already established in previous mutations. As a consequence, more mismatches could occur while rebuilding lineage.

  • The accuracy of pulse induction appears to be sensitive to the sampling size. If sampling size is large, pulse induction generates better accuracy, and if sampling size is small, constant induction generally yields better result but slightly.

  • When comparing the CARLIN performance in tracing back different generations, the accuracy is higher when tracing back fewer generations. Furthermore, testing the induction patterns revealed that the double paused pattern has the highest accuracy under the same conditions. In addition, the NBJ methods demonstrated advantages over the RMP, as the intermediate dosage of Dox induction showed much similar accuracy with high Dox dosage inductions, compared to the same groups in RMP methods.

  • While reconstructing the parent node from children cells, not performing the filtering with root barcode results in higher lineage accuracy.

  • Adding a second independent barcode will bring in additional information, and if this information is used properly, the lineage tracing accuracy will be improved. However, so far we do not observe significant increase on lineage accuracy by adding a second barcode while using the same indel probabilities.

  • Changing indel probabilities would substantially affect lineage accuracy. If the barcode editing mechanism favors nucleotide insertion while inhibits large segment deletion, the lineage tracing accuracy could be significantly improved.

In a recent work Choi et al38 designed a genome editing system called DNA typewriter where the recording medium (DNA tape) consists of a sequence of partial CRISPR-Cas9 targets. At the beginning of genome editing only the first target is editable. Then each of the many prime editing gRNA (pegRNA) is designed to mediate the insertion of a k-mer which consists of a segment of pegRNA-specific barcode and a 3-bp key, which, after being inserted, automatically concludes the current target and activates the next. Because genome editing is sequential in this scheme, the temporal order of recorded events can be read out by their physical order along this edited DNA tape. The temporal order of recorded events can improve lineage resolution by reducing ambiguity about the order in which editing events occurred, and eschewing DSBs and thereby minimizing the risk of inter-target deletion. This idea of lineage reconstruction using DNA typewriter initiates a new field of lineage tracing with higher resolution. It is promising to incorporate the ordered recording of DNA tape into lineage reconstruction in our simulation program because it helps determine the temporal order to merge/pair the clones. Thus, higher accuracy is expected. In a further extension we plan to replace the stochastic barcode with a DNA tape, and simulate sequential editing events on the tape according to observations of actual experiment. In the lineage reconstruction module, we plan to implement a new algorithm that efficiently uses the order of recorded events on the DNA tape, and then compare the rebuilt tree to the original one and calculate accuracy.

Method

In this section we introduce the details of the program that simulates the evolution of barcodes and the cell DSB repair mechanism induced by CRISPR-Cas9.

Cell division with evolving barcodes

Starting from a root cell with an original barcode, we assume that in each generation a cell is divided into two children cells, and therefore the entire evolution process is modeled as a binomial tree. The barcode of the parent cell is inherited by the children cells with random mutations that are caused by DNA DSB repair mechanism. Doxycycline (Dox) is used in the CRISPR-Cas9 editing to induce double-strand DNA breaks, and these breaks are repaired by the cell to result in a variety of altered barcodes, such as large segment deletion, nucleotide insertion and substitution.

A probability mupb is assigned as the Cas9 cutting rate that simulates the introduction of Dox, whose concentration can be controlled. Suppose the parent barcode has n base-points (bp), and for each bp there is a probability mupb that a cut (DNA break) occurs. Then the total number of cuts follows a binomial distribution \(\sim b(n,mupb)\). If the total number of cuts is less than or equal to one, then there is no large deletion in the barcode; if the number of cuts is greater than one, then we assign a probability lgdelprob that a large deletion occurs. The actual probability that a large deletion occurs is calculated as

$$\begin{aligned} \begin{aligned}{}&\ \ \ \ lgdelprob\cdot P(\text {at least two cuts})\\&\quad =lgdelprob\cdot (1-P(\text {no cut})-P(\text {only one cut}))\\&\quad =lgdelprob\cdot (1-(1-mupb)^n-n\cdot mupb\cdot (1-mupb)^{n-1}). \end{aligned} \end{aligned}$$

Whenever a large deletion occurs, two of the cut sites are randomly chosen and the bps in-between are deleted. For the remaining cut sites, if any, we assign probabilities for each of the following events, respectively: perfect repair, insertion of one nucleotide, insertion of two nucleotides, insertion of three nucleotides, substitution, and single nucleotide deletion. The inserted nucleotide is randomly chosen among ACGT, and for substitution a letter other than the original nucleotide is randomly chosen. These probabilities are stored in a vector \(ins\_sub\). Figure 19 shows a simplified example of various barcode editing events on a segment of one strand. The exact mechanism of how a large deletion occurs and the associated parameters are unknown to our best knowledge. In the current version of simulation program Cas9 nuclease is stochastic. In a future extension, we will replace the stochastic barcode with experimentally designed barcode, and assign nucleotide-dependent cutting probabilities to mimic the hotspots of nuclease loci.

Figure 19
figure 19

Barcode editing (A) There are six random cuts on the barcode. (B) Remaining barcode after editing. At cut position 1, there is a single nucleotide mutation/substitution; At cut position 2, there is a single nucleotide deletion; the segment from cut position 3 to 5 is deleted; at cut position 6 there is a single nucleotide insertion.

We tuned the parameters in our model so that the output barcodes mimic the result of some actual experiments23,25,36. For example, one may compare the leaf barcode profile in Fig. 20A with Fig. 1C in25 to find similar pattern of mutations after about 5\(\sim\)6 divisions. We then set a bit higher probability for perfect repair at each cutting site in observation of the considerable percentage of unedited alleles in Fig. 1E in25. Then from Fig. 1D in25, Fig. 3I in23, and Fig. 2A in36, we see that insertion and deletion at a cutting site (not large segment deletion) mostly involve a few nucleotides, say, 1\(\sim\)3 bps, and therefore we set respective probabilities for these events accordingly.

For each cell division, this procedure is performed twice to generate the barcodes of left child and right child, respectively. In the simulated evolution tree we label the cells in the way as shown in Fig. 20A.

Figure 20
figure 20

Cell division and evolving barcodes. (A) Cells/barcodes are labeled. (B) Sampled barcodes from the last generation. All intermediate empty entries in the barcodes are collapsed. Barcodes are renumbered, but the order of sampled barcodes is random. (C) Barcode alignment using classical dynamic programming algorithm. Many nucleotides are inserted in spaces which are supposed to be deleted large segments. (D) Barcode alignment using modified algorithm.

Each labeled cell belongs to a certain generation in the tree, which is also referred to as the tree level. The root cell belongs to generation 0, cells 2 and 3 belong to generation one, cells 4,5,6 and 7 belong to generation two, and so forth.

A sample of the evolving barcodes is shown in Fig. 20A where the original barcode length is 50 with 5 cell division rounds. In simulation each barcode has leading and tail spaces to accommodate nucleotide insertions. Large segment deletions are observed.

Barcode alignment

In actual experiment, only the original barcode and the leaf barcodes (barcodes of the last generation) are known, and the intermediate barcode information is hard, if not at all impossible, to extract. Furthermore, the leaf barcodes are all collapsed, see Fig. 20B. That is to say, all the empty entries in the barcodes in Fig. 20A are missing. That means, we do not exactly know which nucleotides in the barcode have been mutated or which segments have been deleted. Therefore, it is necessary to perform barcode alignment to recover, to the most extent, the structure of the leaf barcodes.

We compare each leaf barcode to the root barcode and perform alignment. Because all the intermediate barcodes are not know, the leaf barcodes are renumbered starting from 1 for convenience. Classical DNA alignment, like Needleman-Wunsch Algorithm, uses dynamical programming to find the best alignments of two DNA sequences according to a scoring scheme. Given two strings s and t with respective length ls and lt, the (ij)th entry of the reward/score matrix, rewm(ij) is calculated according to the following formula

$$\begin{aligned} rewm(i,j)=&\max \{ rewm(i-1,j-1)+scm(s(i-1),t(j-1)),\\&rewm(i-1,j)+mismat, rewm(i,j-1)+mismat\}, \end{aligned}$$

where \(mismat (<0)\) is a penalty on mismatches. Here mismatch includes cases where two different letters (ACGT) are aligned, or a valid letter corresponding to an empty spot. After the reward matrix is computed, the algorithm traces backward to find the alignment of s and t that has the maximum matching score. We call this method regular alignment for reference.

Experimental results showed that large segment deletion is very common in mutated barcode25. We therefore modify this algorithm to capture this feature, where consecutive matches are more rewarded, while consecutive mismatches are less penalized. The new reward matrix is hereby calculated as follows

$$\begin{aligned} rewm(i,j)=&\max \{rewm(i-1,j-1)+scm(s(i-1),t(j-1))+consmat,\\&rewm(i-1,j)+mismat\cdot (fr)^{mismat_t(i-1,j)},\\&rewm(i,j-1)+mismat\cdot (fr)^{mismat_s(i,j-1)}\}. \end{aligned}$$

In the above expression, \(consmat(>0)\) gives a small reward to consecutive matches, fr (\(0<fr\le 1\)) indicates the fraction of penalty to be applied on mismatches, and \(mismat_s, mismat_t\) are two matrices that record the number of consecutive mismatches. The results of these two algorithms are shown in Fig. 20C, D.

The numbered barcodes/cells (1–32) in Fig. 20C, D are the leaves that correspond to the last 32 barcodes in Fig. 20A (32–63), respectively. From Fig. 20C, D we can see that classical alignment algorithm may occasionally insert matched nucleotides in places where are supposed to be large gaps/deletions, while our modified algorithm successfully recovered the large-gap structure. To rebuild the lineage tree more accurately, it is critical to recover the structure of leaf barcodes so that we can tell, with more confidence, which nucleotides in a barcode have been mutated or which segments have been deleted. In section Comparison of alignment methods of simulation results, we compare our alignment methods with the well-established gap penalty algorithm and test their efficacy.

Lineage tree reconstruction

The real cell lineage tree is a binomial tree, and in many papers however, phylogenetic trees that are not of the binomial type are built to approximate the binomial lineage tree17,34. This is mainly due to the fact that the intermediate barcode information, i.e., internal node information, is missing. In this paper, however, we present an algorithm to rebuild the binomial lineage tree and estimate its accuracy. In this procedure we rebuild the pseudo internal nodes whose barcodes gradually converge to the root barcode, and details are introduced in what follows.

Pairwise alignment and barcodes pairing—RMP method

We start from the leaf barcodes, which is known information, and work backwards. As we mentioned earlier, the leaf barcodes have been renumbered and an example is given in Fig. 20D with generation number 5. When the barcodes are sampled and sequenced, their order becomes random, see Fig. 21A.

Figure 21
figure 21

Lineage tree reconstruction. (A) Leaf barcodes are sequenced and aligned, and their order is random. (B) Leaf barcodes are paired and parent nodes are rebuilt. (C) Parent barcodes are filtered by the root barcode so that gradually they converge to the root. (D) The rebuilding process continues retrospectively until the root node is reached.

Next we sort the leaf cells in descent order according to their respective matching scores to the root, which is explained in section "Barcode alignment". Barcodes with higher matching scores to the root preserve more information, so we start from the barcode (call it s) with the highest matching score and search for its pair. In other words, the pairing priority is given to those barcodes that best match the root barcode, and we call this method Root Matching Prioritized (RMP). Next we perform pairwise comparisons/alignments among leaf barcodes using a new scoring matrix as follows:

$$\begin{aligned} \begin{aligned} \ \ {}&\ \ \ O\ \ {}&\ \ A&\ \ \ \ \ C&\ \ \ \ G&\ \ \ \ T\\ O&\ \ \ 0.25&-1.5&\ \ -1.5&-1.5&\ \ -1.5\\ A&-1.5&\ \ \ 1&\ \ -2&\ -2&\ \ -2\\ C&-1.5&-2&\ \ \ \ \ \ 1&\ -2&\ \ -2\\ G&-1.5&-2&\ \ -2&\ \ \ \ \ 1&\ \ -2\\ T&-1.5&-2&\ \ -2&\ -2&\ \ \ \ \ \ 1 \end{aligned} \end{aligned}$$

In this new scoring matrix the letter O represents an empty nucleotide that has been created by the barcode alignment as explained in section "Barcode alignment". This alignment (section "Barcode alignment") recovered the ‘structure’ of the leaf barcodes, and empty entries in the recovered barcodes are part of this structure, hence should not be discarded in the pairwise searching step. In other words, if an empty entry in one (recovered) barcode matches an empty entry in another (recovered) barcode, there is a reward associated with this match; and if an empty entry corresponds to a nonempty entry, the penalty is less than that of mismatched nonempty nucleotide.

Suppose we have found another barcode t that has the highest pairwise matching score to s, we then compare this score to \(propm\cdot nnz(s)\), where nnz(s) is the number of nonzero entries in s and \(propm\in (0,1)\) is a fraction. This step is necessary to avoid pairing of two barcodes with massive zero entries. If the pairwise matching score of s and t is greater than \(propm\cdot nnz(s)\), then st are paired and removed from the candidate cohort. If the pairwise matching score is less than \(propm\cdot nnz(s)\), the barcode s is left as a singleton barcode which will be explained in more details later. This procedure is then repeated on the remaining barcodes until all leaf barcodes have been processed, see Fig. 21B. Because the pairing priority is given to a leaf that is closest to root, but may not be given to closest pair of leaves, we think it is reasonable to pair two leaves cautiously. Therefore we suggest to choose \(propm\in [0.7,0.9]\) for RMP method.

Neighbor joining method (NBJ) on barcodes pairing

The Neighbor Joining Method (NBJ) is different from RMP in that, after leaf barcodes have been aligned pairwisely, the pairing priority is given to the pair that have the highest pairwise matching score, regardless of the root barcode. The pairing process is as follows: the highest pairwise matching score is found in the matrix with corresponding barcodes s and t, and if this matching score is greater than \(propm*\max (nnz(s),nnz(t))\), then s and t are paired and removed from the matrix of pairwise matching. Then the next highest pairwise matching score is sought. In each iteration the pair of barcodes that have the highest matching score (among the remaining cells) are sought, so they are encouraged to pair, and therefore the parameter promp can be lowered. This process is continued until the highest pairwise matching score among remaining cells is lower than a threshold.

In NBJ method the pairing priority is given to pair of leaves that are closest to each other, so we believe it is reasonable to encourage pairing in this case, and therefore we suggest to choose \(propm\in [0.3,0.6]\) for NBJ method.

Both RMP and NBJ will need to go through Pairing Validity Check, as explained in section "Pairing validity check". In the simulation part, both methods will be tested.

Pairing validity check

To build parent nodes of generation \(m-1\) from generation m, some leaf barcodes of generation m have been paired and the rest are left as singleton barcodes. Suppose there are n barcodes in generation m and 2k of them are paired, then the rest \(n-2k\) barcodes are singleton barcodes, and in generation \(m-1\) there are \(k+n-2k=n-k\) parent nodes. A natural constraint is that in each generation, say m, the number of nodes should not exceed \(2^m\). Therefore, in generation \(m-1\) the condition \(n-k\le 2^{m-1}\) should be satisfied. If this constraint is not met, we keep pairing more barcodes in generation m (by dropping the parameter propm as introduced in section "Pairwise alignment and barcodes pairing—RMP method") until the condition \(n-k\le 2^{m-1}\) is satisfied. This validity check is performed in each generation as we work backward to reconstruct the lineage tree.

Reconstruct parent node

Suppose the barcodes s and t with generation number m have been paired after the pairwise alignment, we try to reconstruct their parent barcode. The alignment algorithm has assigned s and t new structures, denoted \(new_s\) and \(new_t\), respectively. For these two new barcodes we do entry by entry comparison, that is, if the corresponding entries in \(new\_s\) and \(new\_t\) are equal, we keep this entry, if the two corresponding entries are not equal, we randomly pick one with equal probability. The resulting barcode is adopted as temporary parent barcode, denoted \(t\_par\). Next we compare \(t\_par\) to the root by performing another alignment, and then do an entry by entry comparison: if an entry in \(t\_par\) matches the corresponding entry in the root, it is left unchanged; if an entry in \(t\_par\) does not match the corresponding entry in the root, with probability 1/m it is replaced with the entry in the root, see Fig. 21C. We call this step ‘filtered by the root’. After all the parent barcodes have been reconstructed, we obtain generation \(m-1\). This generation is then taken as the set of leaf barcodes and this procedure is repeated until we reach the root, see Fig. 21D. In this way, the parent barcodes gradually converge to the root.

Filtering vs non-filtering with root barcode

The idea of filtering with root barcode, as introduced in section "Reconstruct parent node", is to use extensively any available information such as the root barcode. Therefore, while rebuilding the parent node, it is a natural thought to use the root barcode as a benchmark and let the internal barcodes converge to it gradually. However, it remains a question whether this filtering, as seen in Fig. 21C, will improve the accuracy of the reconstructed phylogenetic tree. In this work, we compare the simulation results of RMP and NBJ methods with and without filtering with root, and find that the methods without filtering give slightly higher accuracy, which is a bit counter intuitive. Details are given in section "Results".

Binomial tree comparison

After all the pseudo internal nodes have been constructed, as explained in section "Reconstruct parent node", we obtain the reconstructed lineage tree which needs to be compared to the real lineage tree. In the simulation of evolving barcodes, see e.g., Fig. 21A, the real lineage tree is known. So the next task is to compare two binomial trees to test accuracy of lineage reconstruction. We adopt a method similar to the Robinson Foulds (RF) method where we compare the corresponding clades of two trees. A clade is the set of all leaves that originate from a certain ancestry node (internal node, excluding the root). If two clades from these two trees, respectively, match exactly, we count it as a match of that corresponding ancestry node. An example is shown in Fig. 22, where Fig. 22A is the real lineage, Fig. 22B is the rebuilt lineage based on the information of leaf barcode and root barcode, and Fig. 22C illustrates how the accuracy is calculated when lineage in Fig. 22B is compared to that in Fig. 22A.

Figure 22
figure 22

Full lineage tree comparison. (A) The real lineage tree is assumed to be a full binomial tree, and the leaf nodes are ordered. (B) The reconstructed lineage tree is compared to the real lineage tree. (C) The percentage of matched internal nodes is calculated.

Recall that leaf barcodes have been sorted in the RMP method according to their respective matching scores to the root. This reordering does not affect the tree comparison in our method. For the example in Fig. 22B, we see, for instance, that node A with clade \(\{5,6\}\), node B with clade \(\{5,6,7,8\}\), node C with clade \(\{1,2,3,4,5,6,7,8\}\), have exact matches in the lineage tree in Fig. 22A, while node D with clade \(\{9,10,12,13\}\) does not have a match. In the full binomial tree every internal node has two children, and we call such nodes paired nodes. So in Fig. 22A there are a total of 14 internal nodes (all are paired nodes) and 7 matched paired nodes in Fig. 22B, so the accuracy of the reconstructed lineage in Fig. 22B is \(7/14=50\%\). We point out that this pairing method is very strict - if two clades differ by just one leaf node, then they are not matched. We also need to point out that in the case of full binomial tree, the accuracy as calculated in our program (either internal or paired nodes, which are equal) is exactly the complement of the widely used RF distance, in other word, \((accuracy)+(RF\ distance)=1\). That is because each clade in this case corresponds to a unique internal node of the tree.

Fractional sampling

In actual experiment, only a small sample, rather than all the cells in a whole organ, are sequenced. As a result, the lineage tree will not be a full binomial tree, see for example, Fig. 23 where half of the leaf barcodes are sampled. In the simulation, the full binomial tree is still obtained but in the last step we collect a random sample of a fraction of the leaf barcodes.

Figure 23
figure 23

Partial lineage tree comparison. (A) The real lineage tree is not a full binomial tree. (B) Reconstructed lineage tree is compared to the real lineage tree. (C) Percentage of matched internal nodes and matched paired nodes are calculated, respectively.

In Fig. 23A where the lineage tree is not a full binomial tree, nodes that have two children are again called paired (or dividing) nodes, while nodes that have only one child are called singleton nodes. Recall that in section "Pairwise alignment and barcodes pairing—RMP method" if we cannot find a counterpart for a certain barcode, that barcode is left as a singleton barcode. The method to trace back from singleton barcodes is similar to that of paired barcodes, except that to build the parent barcode, it is derived directly from its only child, and then it is filtered by the root barcode, see section "Reconstruct parent node".

Our tree comparing method still applies to partial binomial trees, see, for example, Fig. 23B that is to compare to Fig. 23A. For instance, nodes ABC in Fig. 23B are able to find their exact matches, respectively, in Fig. 23A, and they are all paired nodes. Node D has clades \(\{1,2\}\) but in Fig. 23A node D is a singleton node, so this is a matched singleton node. Same reasoning also applies to nodes EF and G.

If the sampling proportion ss is small, it is possible that the percentage of matched internal nodes is still high (see, e.g., Fig. 23), even though most of those nodes are singleton nodes. The information of cell association is contained in paired nodes, but not in the singleton nodes. Therefore, in our simulation, in addition to calculating the percentage of total matched internal nodes, we also calculate the percentage of matched paired nodes. If the sampling fraction \(ss=100\%\), most internal nodes are paired nodes except a few nodes becoming singleton because some of their children barcodes are completely cut. Therefore, if ss is close to \(100\%\), the percentage of matched internal nodes and the percentage of matched paired nodes should be very close. However, if \(ss<<1\), we expect a considerable difference between these two quantities. The percentage of all matched internal nodes and the percentage of matched paired nodes provide us valuable information on accuracy of the reconstructed lineage tree.

We claim that the matching of two nodes from the two trees, respectively, is strict because the two nodes must belong to the same level/generation and have the same clades. For example, node D in Fig. 23B matches node D in Fig. 23A at the tree level 2, but it does not match the direct parent node of leaf \(\{1,2\}\) in Fig. 23A at level 3, even though they have the same clades. Future work may investigate less strict measures such as partially matched clades, or matched nodes but belonging to different generations.

Handling non-dividing cells and dead cells

To simulate the scenario where some cells do not divide in certain cycles, and some cells die at some point of tumor progression, we introduce two parameters, namely divp and clive, into our program, where divp is the probability that a cell will divide in a certain cycle, and clive is the probability that a cell will survive into the next cycle. Each node/cell in the evolving tree has an indicator which tells if the cell is alive or dead. Dead cells will have all empty barcodes in their descendants. If a cell is alive and marked non-dividing at some point, the program generates one valid child barcode for it, where mutation still occurs, and the other child is empty. In this way, the structure of the full binomial tree of evolution, as seen in Fig. 23A, is preserved. For example, leaf nodes \(\{13,14,15,16\}\) are all empty that may due to a dead ancestry node at level 2.

Occasionally a barcode is completely cut due to accumulated mutations, and it is then marked dead. It still generates pseudo children barcodes that are all empty/dead. When we sample or partially sample (as explained in section "Fractional sampling") leaf barcodes in the last generation, we only consider non-empty barcodes.

Adding a second barcode

The DARLIN mice as introduced in35 contain three independent target arrays (barcodes), and this experiment brings in a new topic of how to reconstruct the lineage tree with multiple barcodes. In our program we add a second barcode that is being edited independently. In this case of double barcodes, RMP method faces a challenge because for many alleles, one barcode may match the root well but the other may contain much mutations and is not closely matched to root, and therefore it is hard to order the cells well. However, NBJ method appears to be capable to handle this situation. Given two cells, namely Cell I with barcodes (\(I_1\), \(I_2\)), and Cell II with barcodes (\(II_1\), \(II_2\)), let \(M_1\) be the alignment score between \(I_1\) and \(II_1\), and let \(M_2\) be the alignment score between \(I_2\) and \(II_2\). Set \(M=M_1+M_2\), then cells are paired, starting from the currently highest M, if \(M>propm*\max (nnz(I_1),nnz(I_2),nnz(II_1),nnz(II_2))\). Here we may set propm a bit bigger because M is the total matching score of two barcodes. This round of pairing will pair cells whose both barcodes are matched well. Among the remaining cells we perform a second round of pairing. We search for the largest \(M_1\) and \(M_2\) in remaining cells. Suppose \(M_1>M_2\), then these two cells are paired if \(M_1>propmi*\max (I_1,II_1)\), where propmi stands for proportion of match - individual (barcode). The case where \(M_1<M_2\) can be done in a similar manner. This round of pairing will pair cells where one of the two barcodes matches well, i.e., \(I_1\) vs \(II_1\) or \(I_2\) vs \(II_2\), but the other contains much mutation and differ a lot due to independent editing events. It becomes a tricky task of how to choose propm and propmi in order to maximize lineage accuracy. In simulation we calculate the lineage accuracy for a grid of propm and propmi and discuss the effect of both parameters.

Extensions

Our simulation program builds a framework on CRISPR-Cas9 barcode editing and lineage reconstruction, with plenty of flexibility for further extensions. In this paper we generate barcode in pure randomness, and we assume the Cas9 cutting rate is uniform throughout the entire barcode. In our program two alleles are considered identical only if they are exactly the same, bp by bp. For this reason, the chance to obtain identical alleles in our simulation is small. However, similar alleles may also cause homoplasy effect. Simulation results suggest that early Dox induction should be restrained to reduce the chance of large segment deletion in early generations, which helps inhibit homoplasy effect. So, a direct extension is to use nonuniform cutting rate depending on the generation number as well as the type of each nucleotide, and in stead of purely random nucleotides, the barcode may consist of several copies of a target with certain length, for example, 10 copies of a target that has 27 bps, which makes a barcode with length 270 bp. In a future exploration we will perform this extension based on actual experimental design, and investigate effects such as homoplasy.

Computational count

We take NBJNF method as an example. Let n be the initial barcode length, and let k be the maximum generation number, then in the case of full binomial tree there are \(2^k\) leaves, and in generation i, \(1\le i \le k\), there are \(2^i\) nodes. Let \(\alpha \in (0,1]\) be a shrinking factor due to barcode editing - because of the large segment deletion, we expect that barcodes will become shorter and shorter as cells divide, at least on average basis. Then the leaf barcode will have an estimated length of \(\alpha ^i n\) in generation i. Each pairwise alignment among leaves has a computational count of \(O(\alpha ^{2i}n^2)\), and in generation i there will be \(2^i(2^i-1)/2)\) alignments. So for generation i the computational count is estimated at \(O(\alpha ^{2i}n^22^{2i-1})\). Then sum up from \(i=1\) to \(i=k\), the total computational count is \(O(2^{2k-1}n^2\alpha ^{2k})\), and if \(\alpha =1\), it is \(O(2^{2k-1}n^2)\).

Conclusions

In this work we designed a program to simulate the evolution of barcode and DSB in dividing cells due to CRISPR-Cas9 editing, where features such as segment deletion, nucleotide insertion and mutation are all included, and as cells divide, a phylogenetic tree is built. We also designed algorithms - RMP, RMPNF, NBJ, and NBJNF methods- to rebuild the lineage tree from the information of leaf nodes and root node, and compare it to the original phylogenetic tree to test their accuracy. We also performed simulations and reported results with 2 independent barcodes. In the comparison of two lineage trees we claim matched nodes in the strict sense in that the following two conditions must be satisfied 1) The two nodes must have exactly the same clade of leaf nodes, and 2) matched nodes must belong to the same generation. In this work the comparison result of both paired nodes and internal nodes are provided. This program is also able to calculate the accuracy of matched internal nodes and paired nodes for each respective generation. Among the many factors that could affect lineage tracing accuracy, we found that sampling size and indel probabilities are the two major factors. Also, large segment deletions in early generations could greatly impact lineage accuracy. Larger sampling size with indel probabilities that favor insertion while inhibit deletion will produce the best lineage accuracy. These simulation results provide important guidance for actual experimental design. Furthermore, our program builds a framework for simulating CRISPR-Cas9 barcode editing, which is very flexible to incorporate further extensions. Future work may integrate this program with actual experiments to generate insightful information that improves lineage tracing accuracy.