Abstract
We designed a simulation program that mimics the CRISPRCas9 editing on evolving barcode and double strand break repair procedure along with cell divisions. Emerging barcode mutations tend to build upon previously existing mutations, occurring sequentially with each generation. This process results in a unique mutation profile in each cell. We sample the barcodes in leaf cells and reconstruct the lineage, comparing it to the original lineage tree to test algorithm accuracy under different parameter settings. Our computational simulations validate the reasonable assumptions deduced from experimental observations, emphasizing that factors such as sampling size, barcode length, multiple barcodes, indel probabilities, and Cas9 activity are critical for accurate and successful lineage tracing. Among the many factors we found that sampling size and indel probabilities are two major ones that affect lineage tracing accuracy. Large segment deletions in early generations could greatly impact lineage accuracy. These simulation results offer insightful recommendations for enhancing the design and analysis of Cas9mediated molecular barcodes in actual experiments.
Similar content being viewed by others
Introduction
Lineage tracing arose from embryonic fate mapping in developmental biology. Labeling a single cell with heritable markers provides information on progeny population, location, and lineage branching^{1}. The earliest lineage tracing dates back to studying cell fate determination by observing the leech and nematode embryonic^{2}. Embryonic lineage analysis became effective and informative later when cells can be labeled by dyes or radioactive tracers^{3}. However, direct observation under microscopies would be unfeasible to trace the somatic cell lineage in larger organisms, like the fruit fly Drosophila larvae and even mammals, whose total number of cells is tens of orders of magnitude more. The evolution of genetic studies and molecular cloning techniques in the 1970s expanded the toolbox for cell lineage studies. Including somatic mosaicism^{4}, retroviruses labeling^{5}, cre/lox cassettemediated recombination^{6}, and transposable element insertions^{7}. Technology advancements, on the other hand, support lineages studies in alternative biological questions and pathological conditions, such as the hematopoiesis^{8}, hemostasis^{1}, and cancer cell heterogeneity^{9}. Recent efforts have been made to integrate the lineage labeling methods into singlecell sequencing techniques to improve lineage tracing resolution and to characterize comprehensive cellular dynamics in vivo^{10}.
Singlecell RNA sequencing (scRNAseq) has become stateoftheart for cell identification and clustering through the integrated transcriptomic profiling^{11}. Although its accessorial trajectory inference analysis gives valuable insights into the cell kinetics study^{12}, this pseudotemporal ordering has limitations. First, most trajectory analyses are subjectively assumed to have a fixed start, branching, and endpoints, yet cells may follow more complex and dynamic differentiation or transformation patterns^{13}. Second, mapping cellular dynamic paths solely based on global gene expression similarities may distort the interpretation of the genuine lineage commitment^{14}; analytic bias may be introduced during data preprocessing when particular statistically insignificant but biologically essential features, or genes, are filtered out. As a result, dominated populations may obscure lineage branches^{15}. Recent advances in sequencingbased lineage tracing resolve the lineage information across the transcriptional landscape by sequencing the heritable nucleic acid barcodes^{16,17}. Parallel advancements and modifications in genome editing tools, such as clustered regularly interspaced short palindromic repeats (CRISPR), elevated the barcoding capacity by multiplexing editing targets^{18,19,20,21}), engineering guide RNAs^{22,23}, or Cas 9 proteins^{24}. CRISPR array repair lineage tracing (CARLIN) is one of the approaches to generate stochastic mutations in the process by cell repairing the Cas 9 mediated doublestrand breaks (DSB), while the Cas 9 protein activity is controled by Doxycycline induction. The edited CARLIN barcodes would be recorded as transcripts and sequenced for lineage tree constructions, and the standard scRNAseq analysis would independently examine the cellular identities^{25}.
CRISPRmediated sequencingbased Lineage tracing techniques like CARLIN have been applied to many clonal dynamic studies. While they offer advantages such as high throughput, multiplexing, and compatibility in screening, they may bring computational challenges when building the lineage trees. One challenge is to estimate the accuracy of the reconstructed lineage tree, in other words, very few studies quantitatively assessed the accuracy of their lineage reconstruction methods^{26} due to the lack of ground truth or benchmark lineage. Yang et al^{27} used lineage tracing to study tumor initiation and progression. Phylogenies were constructed with Cassiopeia^{28}, and the validity of the lineage reconstruction was estimated from the correlation between phylogenetic and allelic distances across the rebuilt trees. Using an algorithm called STAR Methods Bowling et al^{25} built general consensus trees that are not of the binomial type, for example, one node may have 20+ direct children, which does not accurately reflect cell division mechanism, and therefore this complicates the accuracy estimation and hinders higher resolution lineage tracing. Quality of the rebuilt lineage is assessed by comparing a computed pairwise similarity matrix of the tissues to known lineage relationships at tissue level resolution. Chan et al^{29} applied two methods, namely biased search and greedy search, to reconstruct lineage trees and determined the best reconstruction by summing the loglikelihoods for all indels that appear in the tree using likelihoods estimated from embryo data. Quinn et al^{19} also used Cassiopeia method^{28} to reconstruct lineage trees, and used the agreement between phylogenetic distance and allelic distance as a measure to assess the accuracy of rebuilt lineage.
Technical difficulties also impose challenges on lineage tracing. For example, another frequent challenge is primarily dealing with lineage information loss, probably because i). inefficient barcode capture during the scRNAseq^{25}, ii). incomplete barcode labeling due to the short editing duration^{30}, and iii). ancestor’s barcodes being overwritten by progenies^{23}. Also, PCR errors in library preparation may introduce artifacts for the barcodes and lead to an inaccurate lineage tree reconstruction^{31}. Furthermore, noisy data from the oversaturated barcodes makes it difficult to separate the lineages and cell types and constrain the tracing capacity. This is due to the barcoding system over induction and poor experimental plan^{16}. Also, insufficient sampling may dramatically affect the complete lineage construction for rare cell populations^{32}. Without corresponding approaches to overcome these limitations and challenges, the uninformed use of the new generation of lineage tracing techniques may result in misleading outcomes. We reason that detailed simulation before the largescale screening might lead to a better understanding of the limitations and inform experimental design.
Simulations make accuracy estimation of lineage trees possible, because the actual lineage trees(reference or benchmark trees) are known, though most work adopted simplified models. For example, it is assumed in Cassiopeia^{28} that each barcode consists of n targets, where each target can mutate to one of m possible states, with n and m being fixed. Furthermore, once mutated it can no longer change. Then phylogenetic tree is built according to maximum parsimony. Since the reads of edited barcodes contain a lot of randomness, it remains a challenge of how to fit a mutated target into one of the m states. Similar models are also adopted in several papers, see, for example,^{33} and^{34}. Zafar et al^{17} counted the number of unique Cas9 edits that are used to mark distinct barcodes. However, experiments show that during the course of CRISPRCas9 barcode editing, nucleotide and its segments are continually and randomly deleted, mutated, or inserted^{25}. Therefore Cas9 edit events are continually changing, and it might be questionable to clearly distinguish edit events, especially when two edits differ by only a few nucleotides.
In this work, unlike most of the simplified models, we design a purely stochastic simulation program that mimics the CRISPRCas9 editing and double strand break (DSB) repair procedure. In this framework, barcodes are cut at random sites whose rate is controlled by inducible reagent (Doxycycline), and due to DSB repair the following events occur with certain probabilities, respectively: deletion of large segments, nucleotide insertions, nucleotide substitution and deletion. After each edit event, due to the DSB repair mechanism in cells, the barcode pieces are rejoined and the CRISPRCas9 editing repeats. This editing procedure proceeds along with cell divisions. Cells are all labeled so that the “true” lineage tree is known. The simulation program is able to handle full or partial binomial lineage trees, and is able to include in the tree nondividing cells and/or dead cells as well. Then we sample the barcodes at leaf level in the tree and reconstruct the lineage, which is compared to the “true” lineage tree in order to estimate the accuracy. Simulations with different lineage reconstruction algorithms under a variety of parameter settings are carried out. The simulation with 2 independent barcodes is also performed, and factors that affect lineage tracing accuracy are probed. We believe that highquality simulation could help determine and optimize the conditions for using the new generation of lineage tracing models like CARLIN^{25} and DARLIN^{35}.
Results
Comparison of alignment methods
Our modified alignment algorithm is compared to the wellestablished Gap Penalty alignment. The gappenalty method introduces a cost/penalty (\(gapstart<0\)) whenever the sequence starts a new gap in alignment. So the reward becomes \(gapstart+mismat\cdot L\), where L is the gap length. This is equivalent to the affine gap penalty algorithm if one rewrites the reward as \((gapstart+mismat)+mismat\cdot (L1)\). When \(gapstart=0\), this method is identical to the regular alignment. Figure 1 illustrates a sample of alignment results using different algorithms. Figure 1A is obtained in a similar manner as in Fig. 20Abottom, which contains the real structure of edited leaf barcodes. In actual experiment and DNA sequencing, this structural information is lost because only collapsed sequence of barcodes are obtained (see Fig. 20B), however, in simulation we may use this information as benchmark to test the efficacy of different alignment algorithms.
Using the root barcode as the pivot, we align the collapsed leaf barcode using regular alignment, modified alignment, and gap penalty alignment, in an effort to rebuild the structural information of edited barcodes (such as Fig. 1A), where Fig. 1B–D show a sample for each method, respectively. We set the reward of a match at 1 and the penalty of a mismatch at 2. The reward for consecutive matches is \(consmat=0.2\), and fraction of penalty on consecutive mismatches is \(fr=0.5\). The root barcode has a length of 50, and the tree depth is 5. In the gap penalty alignment, the root barcode serves as the pivot, so we believe it is not viable to penalize a gap in the root barcode. Therefore, we only put a penalty on the leaf side. Figure 1B is comparable to Fig. 20C where many nucleotides are put in spaces which are supposed to be deleted large segments, resulting in many small gaps; 1(C) is comparable to Fig. 20D where many large gaps are well recovered. Figure 1D shows a result of the gap penalty method, and similar to that of the modified alignment, large gaps are also recovered.
To test the efficacy of each alignment algorithm, we compare the aligned leaf barcodes, as shown in Fig. 1B–D, to the actual leaf barcodes pairwisely, as shown in Fig. 1A. Because the structure of each leaf barcode is supposed to be recovered, in this leafleaf comparison we just apply the regular alignment. Since Fig. 1A contains the actual structure of each leaf barcode, a mismatch to an empty spot is considered the same as a mismatch to a nucleotide, hence has the same penalty. Let \(Rew(s_1,s_2)\) be the reward score, using regular alignment, of comparing two strings \(s_1,s_2\), whose leading and ending empty entries have been removed, then each pairwise comparison will yield a contribution score
which is similar to the correlation formula. The contribution score is less than or equal to 1, but could be less than 1 depending on the penalty of mismatches. Since the reward of a match is set at 1, the part \(Rew(s_1,s_1)\) (\(Rew(s_2,s_2)\)) may be simplified to the length of \(s_1\)(\(s_2\)). Figure 1E shows a sample of the pairwise leafleaf comparison, for the 32 leaves in Fig. 1A, using different alignment algorithms. Afterwards, the average contribution score is calculated over the many leaves for each algorithm. Then this simulation is run for 100 times and the average of each alignment algorithm is calculated and compared.
For the gap penalty method, we first test the parameter gapstart from –1 to –5 (when \(gapstart=0\) it gives the same result as regular alignment). It appears that when \(gapstart=2\) this algorithm performs the best and when gapstart tends to be large negative, its performance degrades. The reason is that, although large negative gapstart reduces the number of gaps by putting a large penalty to open a gap, it also tends to stack the nucleotides, which results in many mismatches. With gapstart being set at 2, a simulation of 100 run yields the following average scores  regular alignment = 0.488, modified alignment=0.698, gap penalty alignment = 0.543. We then increase barcode length to 100 and generation number to 10, and tested gap penalty method to find that \(gapstart=1\) yields the best result. With this setting, the average of 100 run yields the following scores  regular alignment = 0.0855, modified alignment=0.346, gap penalty alignment = 0.0176. Both results indicate that the modified alignment outperforms gap penalty method, and gap penalty method outperforms regular alignment. The question of what parameter settings in gap penalty algorithm yields the best result is beyond the scope of this paper. From the simulation results, we conclude that the modified alignment is at least comparable to gap penalty method. Therefore, we adopt the modified alignment algorithm in an effort to recover the structure of leaf barcodes.
Evolution of barcode matching scores
To see how the matching scores, when compared to the root, evolve as cells divide in the simulation, we run the program from generation 1 to generation 15, and for each generation the program is repeated 10 times. This part of simulation does not involve sequence alignment or lineage tree reconstruction, so the computational burden is not heavy even if we reach generation 15. The simulation parameters are tuned according to some experimental results. For instance, from Figs. 1C–E and 3C–E in^{25}, Fig. 3F.I in^{23}, and Fig. 3A in^{36} we conclude that large segment deletion is a major phenomenon in barcode editing; single nucleotide insertion is more frequent than two or more nucleotides insertion; single nucleotide deletion is more frequent than two or more nucleotides deletion; unedited target sites (perfect repairs in our program) account for considerable proportion of barcode editing events, etc. Therefore at a cutting site on the barcode, we set the probability of perfect repair at 0.7; the probabilities of inserting 1,2,3 nucleotides are 0.1, 0.03, 0.02, respectively; the probability of a substitution is 0.05; and the probability of a single nucleotide deletion is 0.1. The Cas9 mutation rate, defined as a cutting probability at each nucleotide, is set at \(mupb=0.1\). Given two or more cuts occurring on the barcode, the probability of a large segment deletion is 0.15.
The average matching scores as a function of generation number is shown in Fig. 2A, where n is the barcode length. The scoring scheme adopted to compute matching scores is given by the following.
It is seen from Fig. 2A that when the mutation rate (mupb) is small, the matching score decreases slowly as cells divide; and when the mutation rate is high the matching score decays fast, roughly at a geometric rate.
The standard deviation of the matching scores for the 10 runs is shown in Fig. 2B. Mutations accumulate in each generation, which brings in variation in the matching scores, and in the meanwhile the length of the remaining barcode shortens gradually, which causes standard deviation to decrease. As cells divide, the number of descendants are doubled in each generation and this factor further reduces the standard deviation of the matching scores.
Test on RMP and NBJ methods
Before we apply the RMP and NBJ methods, with or without filtering, to reconstruct the lineage tree and test the effect of some parameter settings, we need to test the efficacy of these methods. The mechanism of barcode editing in our simulation is different from most of the existing literature in that each nucleotide, with 4 possible states (A,C,G,T), could be a potential cutting site, and the barcode length varies from generation to generation. Furthermore, the Cas9 editing continues along with cell division, that is, any nucleotide in the current barcode could be cut again. Therefore, many existing public datasets do not fit this framework. For example, in the simulated in silico datasets in^{33}, each barcode consists of fixed number (in a few hundreds) of Cas9 targets, and each target has 30 possible states with respectively assigned switching probabilities. In addition, it is assumed that this switch/jump may occur at most once, which greatly simplifies the simulation scheme.
In the in vitro dataset of Challenge 1 in^{33}, however, each Cas9 target has 3 states (0,1,2), and it is possible to test our RMP and NBJ methods on this dataset. We built a simplified version of our program that works on the 76 in vitro training sets of Challenge 1 in^{33}, where the barcode length is fixed, and each target has 3 states, and pairwise entryentry comparison is used (in place of alignment) when comparing two barcodes. The tree depth is estimated based on the number of cells/barcodes. For RMP method, we set \(propm=0.7\) and for NBJ method \(propm=0.3\). When rebuilding the parent node in our methods, there are some randomness involved (see section "Reconstruct parent node"), so for each of the 76 training sets we run RMP and NBJ, with and without filtering, for 50 times, and the average (Avg) and maximum (Max) accuracy of dividing/paired nodes of these 50 runs are recorded (detailed data is available on github.com/xzhanglab/CRISPRbasedLineageTracingSimulation). Then the overall average of the 76 training sets are calculated, as shown in Table 1.
Recall that in the case of full binomial tree, the accuracy of paired nodes in our simulation is the complement of the RFdistance. Thus, if NBJNF Avg has an accuracy of 42.038% in the case of full binomial tree, then it corresponds to a RFdistance of 0.58.
The 76 trees are further classified as small (cell number<10 cells), medium (10\(\le\)cell number<20), and large (cell number\(\ge\)20), and the accuracy of each method on each class are also carried out. This result is provided in Figure S1, which is comparable to Figure 2F in^{33} (except that we use accuracy rather than RFdistance when comparing two lineage trees). As an example, Fig. 3 shows the accuracy of NBJNF method. In view of the result in Box 1Figure F in^{33}, we then conclude that our RMP and NBJ methods are comparable to some benchmarked approaches for reconstruction of in vitro cell lineages, such as DCLEAR(WHD), DCLEAR(KRD), Liu method, and Guan method.
In the second test on our RMP and NBJ methods we use our simulation to generate lineage tree with barcode length of 50 and generation number of 5, and apply the RMP and NBJ methods to rebuild it. Other parameters are as follows: probability of perfect repair is 0.5; the probabilities of inserting 1,2,3 nucleotides are 0.15, 0.1, 0.05, respectively; probability of a substitution is 0.1; probability of a single nucleotide deletion is 0.1; probability of large segment deletion is 0.1; sample size is \(ss=1\); Cas9 cutting rate is \(mupb=0.1\); pairing threshold propm varies from 0.4 to 0.9. Comparing to the parameter settings in section "Evolution of barcode matching scores", we see that the large segment deletion probability is slightly lower, and the probabilities of insertions are slightly higher. It is expected that the leaf barcodes generated under this setting possess more diversity/entropy. This modification is supported by a recent experimental result^{35} where Cas9TdT results in fewer deletions but twice the insertion events per allele than Cas9 expression, with all four nucleotides wellrepresented in the inserted sequences.
The rebuilt tree is then compared to the actual lineage tree and accuracy is calculated. For each propm value and each lineage rebuilding method we run the simulation 100 times and the average (Avg) and maximum (Max) accuracy are computed, as shown in Table 2.
We see from Table 2 that our RMP and NBJ methods, with or without filtering, could well rebuild the lineage, and in particular, the lineage accuracy of RMPNF and NBJNF methods could reach 100%, which means the full binomial lineage tree is perfectly rebuilt. Figure 4 shows a scenario of barcode evolution where NBJNF method with \(propm=0.8\) rebuilds lineage with 100% accuracy. Other parameter settings are the same as described for Table 2.
Further examination on the barcode evolution in Fig. 4 reveals that along with each cell division, sibling nodes have similar mutations which builds up in a way that barcodes belonging to different branches are easily distinguishable, see the paired leaf nodes in Fig. 4.
In the 100 runs using NBJNF method with \(propm=0.8\), as seen in Table 2, we also show the barcode evolution scenario which has the lowest accuracy of rebuilt lineage in Fig. 5. In this case, we see that after the first division, the two children barcodes both experience large segment deletion, which occur, although independently, around the same sites. Further mutations do not introduce enough varieties to distinguish leaf nodes that belong to different branches. As a result, the leaf barcodes all look similar, which causes homoplasy effect^{37}, and that is the reason why NBJNF method cannot pair them accurately. This finding suggests that in actual experiment, it is better control mutation rate to avoid large segment deletions in early generations, for example, reduce Dox induction in the first few divisions.
The effect of barcode length and sampling proportion
Due to the large computational burden, we may not perform exhaustive tests on all combinations of methods and parameter settings. We assume that some attributes found in one method also apply to other methods unless there is legitimate reason against it. In this section we first test the effects of barcode length and sampling size on the accuracy of reconstructed lineage using RMP and NBJ methods.
We pick different lengths in the barcode, \(n=100,200,300,400\) (bp)— and different sampling proportions (sssampling size) to test their effects on the accuracy of the reconstructed lineage trees, which is explained in sections "Binomial tree comparison" and "Fractional sampling". We simulate cell divisions for 10 generations with barcode mutations under certain mutation rate. The simulation is run 10 times for each setting, and the average accuracy is plotted in Fig. 6A for the RMP method with \(propm=0.85\), and in Fig. 6B for the NBJ method with \(propm=0.4\). The effect of propm on accuracy is investigated in section "The effect of matching proportion in pairing".
It is seen from Fig. 6 that accuracy of both the internal nodes and the paired nodes generally increases for longer barcode, though the increase does not appear dramatic. In a few scenarios the lineage accuracy slightly decreases with longer barcode. This is a bit antiintuitive because we were expecting that longer barcode will greatly increase the accuracy of reconstructed lineage. We discover that this is an inborn attribute with purely stochastic models, and will explain it in details in the next section together with the effect of mutation rate.
Nevertheless, the sampling proportion plays a much more influential role in the accuracy of lineage. When the sampling proportion is high, most internal nodes are paired nodes, and as a result the accuracy of internal nodes and the accuracy of paired nodes are very close. As the sampling proportion decreases, many internal nodes become singleton nodes, and these two accuracy measurements differ greatly  a large proportion of internal nodes are matched in the rebuilt lineage tree, while a much smaller proportion of the paired nodes are correctly matched.
The RMP method with high propm is stringent in pairing the nodes, while NBJ method with low propm encourages the paring of nodes. Therefore, when most nodes are singleton nodes, RMP method yields higher accuracy than NBJ method, and if most nodes are paired nodes, NBJ method generally outperforms RMP method.
The effect of mutation rate
The control of the mutation rate (or Cas9 cutting rate) is one of the key factors that affect the accuracy in reconstructing cell lineage. We pick barcode length \(n=300\), generation level at 10, and run the simulation for a variety of mutation rates (mupb) and sampling proportions. Both RMP and NBJ methods are performed. The accuracy of the rebuilt lineage tree, in terms of percentage of matched all internal nodes and the percentage of matched all paired nodes using RMP method and NBJ method are illustrated in Fig. 7.
For the result of RMP method in Fig. 7A, when sampling size ss is large, most sampled internal nodes are dividing/paired nodes, and the two percentages  all matched internal nodes vs matched paired nodes (see Fig. 23)—are very close. And in this case, as mupb increases, both percentages increase and then flatten out. When ss is small, most internal nodes are singleton nodes, and there appears to be a big difference between these two percentages  the percentage of all matched internal nodes is high while the percentage of matched paired nodes is low. It is also seen from Fig. 7A that for \(ss=0.5\), the percentage of matched paired nodes achieves a local maximum at mupb around 0.06.
The NBJ method with low propm encourages the pairing of two barcodes. Figure 7B shows that the accuracy of paired nodes stays stable as the mutation rate increases. The situation is different for all internal nodes. When the mutation rate is low and sampling size is small, the percentage of matched all nodes is low, and as mutation rate increases, this accuracy increases quickly, and then flattens out when mutation rate is very high. The reasons are as follows: Low mutation rate does not generate much variation in the barcodes, so it becomes harder to distinguish barcodes that should belong to different clades; small sampling size results in more singleton nodes in the lineage tree which is sparse; and NBJ method with low propm encourages the pairing of two barcodes. As a result, many singleton nodes are paired incorrectly, which lowers the percentage of matched nodes. As mutation rate increases, much variation is created in the barcodes which prevents them to be incorrectly paired, and the accuracy improves greatly. The tradeoff between all matched nodes and paired nodes is observed in both Fig. 7A, B.
Furthermore, when mutation rate is high, lineage accuracy tends to flatten out or slightly decrease. To see the combined effect of barcode length and mutation rate, we run RMP and NBJ methods 10 times, respectively, with various barcode lengths (\(n=100, 200, 300, 400\)) and various mutation rates (\(mupb=0.04, 0.08, 0.12, 0.16, 0.2\)). The average full tree (\(ss=1\)) accuracy is provided in Fig. 8. Other parameter settings remain the same as in this section.
Some interesting findings from Fig. 8 are as follows: when mutation rate is low, longer barcode consistently yields higher lineage accuracy; while mutation rate is high, however, lineage accuracy tends to decrease, and longer barcode does not show advantage over shorter barcode, which is a little surprising. Obviously, the first reason is that higher mutation rate brings in higher chance of largesegmentdeletions, which wipes out lineage information, and longer barcode tends to have longer segment deletions. The second reason, which appears to be deeper, is the following. In many widely adopted simulation models, each target in the barcode is assumed to mutate at most once. This is a strong and ideal assumption in that the lineage information, which is accurately contained in these mutated targets, will be retained throughout the whole editing process. Longer barcode, which contains more lineage information, will certainly yield higher accuracy in the rebuilt lineage. In the pure stochastic model, however, Cas9 editing activities proceed continuously and randomly, and newly mutated nucleotides may mutate again later on. As a result, lineage information, that is contained in mutated nucleotides, may be wiped out due to further mutations. Not only that, these further mutations could alter the previously established lineage information, and consequently promote mismatches in rebuilding lineage. To our best knowledge, this phenomenon has not been addressed in other related work.
The effect of matching proportion in pairing
The parameter propm, as introduced in sections "Pairwise alignment and barcodes pairing—RMP method" and "Neighbor joining method (NBJ) on barcodes pairing", controls the pairing of two barcodes. High value of propm sets a high bar to pair two barcodes because they must match a large proportion of nonzero nucleotide, while low value of propm allows two barcodes to be paired even if a few nonzero nucleotide are matched. The comparison of the lineage tree accuracy of high and low propm at mutation rate \(mupb=0.1\) are show in Fig. 9A for RMP method, and in Fig. 9B for NBJ method.
For RMP method, Fig. 9A shows that the setting \(propm=0.85\) yields higher accuracy than that of \(propm=0.4\) for all internal nodes. But when the sample size is small, lower propm yields a better result on paired nodes than that of higher propm. Recall that lower propm means the pairing of two barcodes is encouraged.
For NBJ method, when sample size is high, high and low promp have similar matching accuracies. However, when ss is small, high promp yields higher accuracy of all internal nodes, but low promp yields higher accuracy of paired nodes. Therefore, there appears to be a tradeoff between matched singleton nodes and paired nodes. If ss is small, the lineage tree is ‘sparse’ in that most nodes are singleton nodes, and a few nodes are paired nodes that are sparsely distributed in the tree, hence hard to pair correctly. An algorithm with certain parameter setting may match more paired nodes but mismatch many singleton nodes, and vice versa. Therefore, it is important to properly choose the algorithm and parameter settings in order to balance this tradeoff.
Comparison of RMP and NBJ methods
The results of section "The effect of matching proportion in pairing" may be combined to compare the performances of RMP method to NBJ method. As an illustration, we first choose the same \(propm=0.4\), \(mupb=0.1\), for both RMP and NBJ methods, and the accuracy comparison is shown in Fig. 10A, B.
Recall that low propm indicates encouragement of barcode pairing, and from Fig. 10A, B it is seen that under this condition, NBJ method outperforms RMP method. From the result in section "The effect of matching proportion in pairing" it appears that RMP method performs better when \(propm=0.85\), so we compare the result of RMP method with \(propm=0.85\) and the result of NBJ method with \(propm=0.4\), as shown in Fig. 10C, D.
The subgraph of Fig. 10C shows the accuracy of all internal nodes of RMP and NBJ methods, respectively, with different sampling sizes (ss). We see that when ss is high, there is no big difference between the accuracy of these two methods. When ss is low, RMP method has higher accuracy than NBJ method. However, we recall that when ss is low, most sampled barcodes are singleton. RMP tends to not pair these barcodes, so it successfully constructs lineage for most singleton barcodes, but misses those that should be paired.
The subgraph of Fig. 10D shows accuracy on paired/dividing internal nodes. Again when ss is large, RMP and NBJ do not show much difference in accuracy of paired/dividing nodes. However, when ss is small, we see that NBJ outperforms RMP, that is, when ss is small, NBJ correctly pairs more barcodes than RMP.
The effect of pulse induction
In the work of Bowling et al^{25} pulse induction of Doxycycline is implemented to control the barcode mutation. To test the effect of pulse induction on the accuracy of the rebuilt lineage tree, we assign mupb alternatively between a set rate and a base value (0.005) as barcode editing continues. The ’NoDox’ line in Fig. 1G in^{25} indicates that even in the case of no Dox induction, small percentage of edited alleles is observed. Therefore we set mupb at a very low rate to indicate the scenario of no Dox application. In the case of Dox induction, we tested the set rate at 0.05 and 0.1. Other parameter settings are the same as in section "The effect of barcode length and sampling proportion" so that we may compare the results. The results of RMP method are shown in Fig. 11, while the results of NBJ method are shown in Fig. 12.
From the simulation results of RMP method in Fig. 11 it is seen that when sampling size is large, pulse Dox induction yields higher accuracy than that of constant induction. When the sampling proportion is low and mupb is low, as in Fig. 11A, constant Dox induction yields higher accuracy on internal nodes, though in this situation most of the internal nodes in the sample are singleton nodes. However, when mupb is high as in Fig. 11B, pulse Dox induction generally yields better lineage accuracy than constant induction, especially for paired nodes.
Similar observations are seen for NBJ method in Fig. 12, i.e., when sampling size is big and mupb is high, pulse induction outperforms constant induction. One difference is that when sampling size is small, constant Dox induction yields better accuracy than pulse induction on both the internal nodes and paired nodes.
Comparing the results in Figs. 11 and 12, it appears that RMP method with \(propm=0.85\) and pulse induction with \(mupb=0.1\) yields the best lineage accuracy, see Fig. 11B.
Effect of Dox induction at specific time
For successful lineage tracing with molecular barcodes, it is important to have sufficient diversity and randomness in the barcodes. However, the cellular DSB repair system has a bias towards large sequence deletions. To prevent the loss of lineage information, it is recommended to use an intermediate dosage of Dox (25ug/g mice) with the CARLIN model, as suggested by Bowling et al^{25}. Despite this recommendation, there is a lack of justification for this experimental setting, and optimizing the usage conditions for CARLIN in mice remains a costly and variable process.
We conducted a series of simulations to evaluate the efficacy of paused intermediate Dox inductions for the CARLIN system. The simulations were performed using a set of parameters including a barcode length of 276, propm of 0.95, a sample size of 0.1, large deletion probability of 0.3, divp of 0.99, and clive of 0.99, which accounts for the possibility of cell death and nondividing. Mutations were introduced at generation numbers 4, 6, and/or 8, with varying rates. The simulation was run 10 times for each setting, and the average accuracy was calculated. To evaluate the accuracy of the rebuilt lineage tree, we calculated the percentages of paired nodes and internal nodes for three scenarios: one generation above the leaf level, two generations above the leaf level, and the entire tree. The focus of our analysis was on tracing back a limited number of generations, as real experiments often involve single time point sampling from animal tissues, allowing for the reconstruction of only the most recent one or a few generations. Both RMP and NBJ methods were employed, and accuracy was calculated using either all internal nodes or just paired nodes.
Supplementary Figure S2 presents the results of simulations performed across multiple induction patterns, various Dox dosages (mupb), and lineage reconstruction methods. The xaxis highlights the induction time points, which are depicted in red, and the 7 Dox dosages are represented by distinct colored lines. In Fig. 13, we calculated and summarized the average accuracy for each mock Dox concentration (mupb) over the 10 cell divisions (d3d12). By comparing the accuracy calculated in paired nodes versus all internal nodes, we observed that in the initial 3 cell divisions, the majority of the leaves are singletons. This is evidenced by the fact that the initial accuracy in all internal nodes is significantly high, while it is extremely low in paired nodes. These results suggest that Dox induction is unnecessary for lineage tracing at very early stages, irrespective of the induction patterns and mupb values, background Cas9 activities could generate moderate barcode diversity. We also found that when comparing the CARLIN performance in tracing back different generations, the accuracy is higher when tracing back fewer generations. Furthermore, testing the induction patterns revealed that the double paused pattern has the highest accuracy under the same conditions. In addition, the NBJ methods demonstrated advantages over the RMP, as the intermediate dosage of Dox induction showed much similar accuracy with high Dox dosage inductions, compared to the same groups in RMP methods.
Effect of nonfiltering by root
While reconstructing the parent node from children cells, a natural idea is to use available information  such as root barcode  as much as possible. This is how the filtering step was introduced, as illustrated in Fig. 21C. This filtering step will make internal barcodes gradually converge to root while tracing backward. To test whether filtering improves lineage accuracy, we compare the results of NBJ and NBJNF, with and without pulse Dox induction. Barcode length is set at \(n=100\)(bp) and mutation rate is set at \(mupb=0.1\). Other parameter settings are the same as that of section "The effect of barcode length and sampling proportion". propm increases from 0.4 to 0.9, and sampling size takes values \(ss=1,0.8,0.5,0.1\). For each setting the program is run 10 times and the average is calculated. The result with pulse Dox induction is shown in Fig. 14, and the result with constant Dox induction is shown in Figure S3.
From both Figs. 14 and S3, it is seen that NBJNF outperforms NBJ in almost all the parameter settings, which, in our opinion, is counterintuitive. We notice from simulation results that the filtering mechanism blurs the difference among different clones, in other words, internal barcodes become more and more similar as the algorithm traces backward in the lineage tree. Consequently, many internal barcodes are paired incorrectly which reduces lineage accuracy. We also made a comparison between RMP and nonfiltering RMP methods with and without pulse Dox induction, as in Figs. S4 and S5 where we see that in most cases nonfiltering RMP method outperforms RMP method. As a conclusion, it is better not perform the filtering with root while rebuilding parent node, as shown in Fig. 21C, and the lineage accuracy becomes higher.
Effect of adding a second barcode
As introduced in section "Adding a second barcode", we test the lineage accuracy with two barcodes for a grid of propm and propmi. Other parameter settings are similar to that of section "Effect of nonfiltering by root", except that there are two barcodes that are being edited independently as cells divide. Both the NBJ and NBJNF methods are applied, with and without pulse Dox induction. The simulation is run 10 times for each setting and the average accuracy is calculated. The results of NBJ method are shown in Figs. S6 and S7; the result of NBJNF method with constant Dox induction is shown in Figure S8. Figure 15 shows the result of NBJNF method with pulse Dox induction.
The accuracy of paired nodes tells when two children cells are generated from a common parent, so we may mostly focus on the accuracy of paired nodes. In comparison of Figures S6 to S7, and Figures S8 to 15, we see that pulse Dox induction results in higher accuracy of paired nodes. If we compare Figures S6 with S8, and Figures S7 with 15, we see that NBJNF method generally outperforms NBJ method, which agrees with the findings in section Effect of NonFiltering by Root.
Next we compare Figures S6 to S3, Figures S8 to S3, Figures S7 to 14, Figs. 15 to 14, we see that adding a second barcode improves the lineage accuracy. So we anticipate that adding more barcodes will improve lineage accuracy, if this additional information is used properly and a good algorithm on lineage reconstruction is applied.
In our NBJ and NBJNF algorithms on double barcodes, it appears to be tricky to choose parameters propm and propmi. From the simulation result, it is observed that the setting of propmpropmi combination where the highest accuracy occurs also depends on other factors such as the way of Dox application (constant v.s. pulse), sampling size (ss), and the nodes of interest (all internal nodes or paired nodes only).
Effect of changing indel probabilities
As introduced in section Test on RMP and NBJ Methods, the use of Cas9TdT in^{35} results in fewer deletions but twice the insertion events per allele than Cas9 expression, with all four nucleotides wellrepresented in the inserted sequences. We therefore test the effect of changing indel probabilities on lineage tracing. These indel probabilities are chosen the same as in section Test on RMP and NBJ Methods: probability of perfect repair is 0.5; the probabilities of inserting 1,2,3 nucleotides are 0.15, 0.1, 0.05, respectively; probability of a substitution is 0.1; probability of a single nucleotide deletion is 0.1; probability of large segment deletion is 0.1. We call this change of indel probabilities the simulation with Cas9TdT. Barcode length is set at \(n=100\) and generation number is 10. Two independent barcodes are used, NBJ and NBJNF methods are performed with and without pulse Dox induction. Simulation is run 10 times for each parameter setting. The average lineage accuracy using NBJ method is shown in Figures S9 (constant Dox induction) and S10(pulse Dox induction); the result of NBJNF method is shown in Fig. 16 (constant Dox induction) and Fig. 17 (pulse Dox induction).
Comparing the results with new indel probabilities (which favors insertion) with results of the original settings (as in section Effect of Adding a Second Barcode), it is easily seen that the results with new indel probabilities substantially improved lineage accuracy. As a further examination, we plot the maximum accuracy of the 10 runs for NBJNF method with pulse Dox induction in Fig. 18, where we see that the maximum accuracy could exceed 75% for some parameter settings.
Furthermore, we observe once again that the results with pulse Dox induction generally yield better accuracy than that of constant induction, and the NBJNF method outperforms NBJ method in the accuracy of lineage reconstruction. Our understanding is that the new indel probabilities, that favor insertion and inhibit large deletion, promote the diversity/differentiation of alleles as the editing process proceeds, yet filtering by the root barcode acts against this differentiation. As a conclusion, the root barcode information should be discarded while rebuilding the lineage tree.
Discussion
We were inspired by experimental results (see Fig 1C in CARLIN model^{25}) to design this stochastic simulation scheme in order to reflect the randomness in CRISPRCas9 editing and DSB repair. Our findings from the simulations are summarized as follows.

When comparing two barcodes and perform alignment, we find that a dynamic programming algorithm that rewards consecutive matches better recovers the structure of the mutated barcodes which is due to large segment deletion.

The average matching score decreases approximately geometrically as cell division and barcode mutation continue, and the decay rate is controlled by mutation probability due to Dox induction. The standard deviation of the matching scores also decreases as generation number increases.

Large segment deletions in early generations could greatly compromise lineage accuracy. It is better restrain Dox induction to reduce the chance of large segment deletions in early generations.

Properly chosen barcode length and moderate mutation rate will yield higher accuracy in lineage. Very long barcode with high mutation rate do not necessarily improve lineage accuracy, and on the contrary, lineage accuracy may even decrease.

The sampling proportion of the leaf barcodes has a major influence on the lineage tree comparison. If most leaf nodes are sampled, the lineage tree is near a full binomial tree, and most internal nodes are paired nodes; if very few leaf nodes are sampled, the lineage tree becomes a ‘sparse’ tree and most internal nodes are singleton nodes.

When both paired and singleton nodes exist in the lineage tree, i.e., sampling proportion is relatively small, there appears to be a tradeoff between the accuracy of paired nodes and accuracy of singleton nodes. If an algorithm encourages pairing of two barcodes, the accuracy of paired nodes may increase but the accuracy of singleton nodes will decrease, and vice versa.

The mutation rate controlled by Dox induction also plays an important role. If the mutation rate is very low, it does not generate much variations in the leaf barcodes, and therefore many nodes cannot be matched correctly. As mutation rate increases, it generates reasonable amount of variations in the leaf barcodes, and this information is used to successfully match more nodes in the reconstructed lineage tree. However, if the mutation rate is too high, the accuracy does not increase, and it even decreases for some parameter settings. High mutation rate will cause more largesegmentdeletions, and alter lineage information that was already established in previous mutations. As a consequence, more mismatches could occur while rebuilding lineage.

The accuracy of pulse induction appears to be sensitive to the sampling size. If sampling size is large, pulse induction generates better accuracy, and if sampling size is small, constant induction generally yields better result but slightly.

When comparing the CARLIN performance in tracing back different generations, the accuracy is higher when tracing back fewer generations. Furthermore, testing the induction patterns revealed that the double paused pattern has the highest accuracy under the same conditions. In addition, the NBJ methods demonstrated advantages over the RMP, as the intermediate dosage of Dox induction showed much similar accuracy with high Dox dosage inductions, compared to the same groups in RMP methods.

While reconstructing the parent node from children cells, not performing the filtering with root barcode results in higher lineage accuracy.

Adding a second independent barcode will bring in additional information, and if this information is used properly, the lineage tracing accuracy will be improved. However, so far we do not observe significant increase on lineage accuracy by adding a second barcode while using the same indel probabilities.

Changing indel probabilities would substantially affect lineage accuracy. If the barcode editing mechanism favors nucleotide insertion while inhibits large segment deletion, the lineage tracing accuracy could be significantly improved.
In a recent work Choi et al^{38} designed a genome editing system called DNA typewriter where the recording medium (DNA tape) consists of a sequence of partial CRISPRCas9 targets. At the beginning of genome editing only the first target is editable. Then each of the many prime editing gRNA (pegRNA) is designed to mediate the insertion of a kmer which consists of a segment of pegRNAspecific barcode and a 3bp key, which, after being inserted, automatically concludes the current target and activates the next. Because genome editing is sequential in this scheme, the temporal order of recorded events can be read out by their physical order along this edited DNA tape. The temporal order of recorded events can improve lineage resolution by reducing ambiguity about the order in which editing events occurred, and eschewing DSBs and thereby minimizing the risk of intertarget deletion. This idea of lineage reconstruction using DNA typewriter initiates a new field of lineage tracing with higher resolution. It is promising to incorporate the ordered recording of DNA tape into lineage reconstruction in our simulation program because it helps determine the temporal order to merge/pair the clones. Thus, higher accuracy is expected. In a further extension we plan to replace the stochastic barcode with a DNA tape, and simulate sequential editing events on the tape according to observations of actual experiment. In the lineage reconstruction module, we plan to implement a new algorithm that efficiently uses the order of recorded events on the DNA tape, and then compare the rebuilt tree to the original one and calculate accuracy.
Method
In this section we introduce the details of the program that simulates the evolution of barcodes and the cell DSB repair mechanism induced by CRISPRCas9.
Cell division with evolving barcodes
Starting from a root cell with an original barcode, we assume that in each generation a cell is divided into two children cells, and therefore the entire evolution process is modeled as a binomial tree. The barcode of the parent cell is inherited by the children cells with random mutations that are caused by DNA DSB repair mechanism. Doxycycline (Dox) is used in the CRISPRCas9 editing to induce doublestrand DNA breaks, and these breaks are repaired by the cell to result in a variety of altered barcodes, such as large segment deletion, nucleotide insertion and substitution.
A probability mupb is assigned as the Cas9 cutting rate that simulates the introduction of Dox, whose concentration can be controlled. Suppose the parent barcode has n basepoints (bp), and for each bp there is a probability mupb that a cut (DNA break) occurs. Then the total number of cuts follows a binomial distribution \(\sim b(n,mupb)\). If the total number of cuts is less than or equal to one, then there is no large deletion in the barcode; if the number of cuts is greater than one, then we assign a probability lgdelprob that a large deletion occurs. The actual probability that a large deletion occurs is calculated as
Whenever a large deletion occurs, two of the cut sites are randomly chosen and the bps inbetween are deleted. For the remaining cut sites, if any, we assign probabilities for each of the following events, respectively: perfect repair, insertion of one nucleotide, insertion of two nucleotides, insertion of three nucleotides, substitution, and single nucleotide deletion. The inserted nucleotide is randomly chosen among ACGT, and for substitution a letter other than the original nucleotide is randomly chosen. These probabilities are stored in a vector \(ins\_sub\). Figure 19 shows a simplified example of various barcode editing events on a segment of one strand. The exact mechanism of how a large deletion occurs and the associated parameters are unknown to our best knowledge. In the current version of simulation program Cas9 nuclease is stochastic. In a future extension, we will replace the stochastic barcode with experimentally designed barcode, and assign nucleotidedependent cutting probabilities to mimic the hotspots of nuclease loci.
We tuned the parameters in our model so that the output barcodes mimic the result of some actual experiments^{23,25,36}. For example, one may compare the leaf barcode profile in Fig. 20A with Fig. 1C in^{25} to find similar pattern of mutations after about 5\(\sim\)6 divisions. We then set a bit higher probability for perfect repair at each cutting site in observation of the considerable percentage of unedited alleles in Fig. 1E in^{25}. Then from Fig. 1D in^{25}, Fig. 3I in^{23}, and Fig. 2A in^{36}, we see that insertion and deletion at a cutting site (not large segment deletion) mostly involve a few nucleotides, say, 1\(\sim\)3 bps, and therefore we set respective probabilities for these events accordingly.
For each cell division, this procedure is performed twice to generate the barcodes of left child and right child, respectively. In the simulated evolution tree we label the cells in the way as shown in Fig. 20A.
Each labeled cell belongs to a certain generation in the tree, which is also referred to as the tree level. The root cell belongs to generation 0, cells 2 and 3 belong to generation one, cells 4,5,6 and 7 belong to generation two, and so forth.
A sample of the evolving barcodes is shown in Fig. 20A where the original barcode length is 50 with 5 cell division rounds. In simulation each barcode has leading and tail spaces to accommodate nucleotide insertions. Large segment deletions are observed.
Barcode alignment
In actual experiment, only the original barcode and the leaf barcodes (barcodes of the last generation) are known, and the intermediate barcode information is hard, if not at all impossible, to extract. Furthermore, the leaf barcodes are all collapsed, see Fig. 20B. That is to say, all the empty entries in the barcodes in Fig. 20A are missing. That means, we do not exactly know which nucleotides in the barcode have been mutated or which segments have been deleted. Therefore, it is necessary to perform barcode alignment to recover, to the most extent, the structure of the leaf barcodes.
We compare each leaf barcode to the root barcode and perform alignment. Because all the intermediate barcodes are not know, the leaf barcodes are renumbered starting from 1 for convenience. Classical DNA alignment, like NeedlemanWunsch Algorithm, uses dynamical programming to find the best alignments of two DNA sequences according to a scoring scheme. Given two strings s and t with respective length ls and lt, the (i, j)th entry of the reward/score matrix, rewm(i, j) is calculated according to the following formula
where \(mismat (<0)\) is a penalty on mismatches. Here mismatch includes cases where two different letters (ACGT) are aligned, or a valid letter corresponding to an empty spot. After the reward matrix is computed, the algorithm traces backward to find the alignment of s and t that has the maximum matching score. We call this method regular alignment for reference.
Experimental results showed that large segment deletion is very common in mutated barcode^{25}. We therefore modify this algorithm to capture this feature, where consecutive matches are more rewarded, while consecutive mismatches are less penalized. The new reward matrix is hereby calculated as follows
In the above expression, \(consmat(>0)\) gives a small reward to consecutive matches, fr (\(0<fr\le 1\)) indicates the fraction of penalty to be applied on mismatches, and \(mismat_s, mismat_t\) are two matrices that record the number of consecutive mismatches. The results of these two algorithms are shown in Fig. 20C, D.
The numbered barcodes/cells (1–32) in Fig. 20C, D are the leaves that correspond to the last 32 barcodes in Fig. 20A (32–63), respectively. From Fig. 20C, D we can see that classical alignment algorithm may occasionally insert matched nucleotides in places where are supposed to be large gaps/deletions, while our modified algorithm successfully recovered the largegap structure. To rebuild the lineage tree more accurately, it is critical to recover the structure of leaf barcodes so that we can tell, with more confidence, which nucleotides in a barcode have been mutated or which segments have been deleted. In section Comparison of alignment methods of simulation results, we compare our alignment methods with the wellestablished gap penalty algorithm and test their efficacy.
Lineage tree reconstruction
The real cell lineage tree is a binomial tree, and in many papers however, phylogenetic trees that are not of the binomial type are built to approximate the binomial lineage tree^{17,34}. This is mainly due to the fact that the intermediate barcode information, i.e., internal node information, is missing. In this paper, however, we present an algorithm to rebuild the binomial lineage tree and estimate its accuracy. In this procedure we rebuild the pseudo internal nodes whose barcodes gradually converge to the root barcode, and details are introduced in what follows.
Pairwise alignment and barcodes pairing—RMP method
We start from the leaf barcodes, which is known information, and work backwards. As we mentioned earlier, the leaf barcodes have been renumbered and an example is given in Fig. 20D with generation number 5. When the barcodes are sampled and sequenced, their order becomes random, see Fig. 21A.
Next we sort the leaf cells in descent order according to their respective matching scores to the root, which is explained in section "Barcode alignment". Barcodes with higher matching scores to the root preserve more information, so we start from the barcode (call it s) with the highest matching score and search for its pair. In other words, the pairing priority is given to those barcodes that best match the root barcode, and we call this method Root Matching Prioritized (RMP). Next we perform pairwise comparisons/alignments among leaf barcodes using a new scoring matrix as follows:
In this new scoring matrix the letter O represents an empty nucleotide that has been created by the barcode alignment as explained in section "Barcode alignment". This alignment (section "Barcode alignment") recovered the ‘structure’ of the leaf barcodes, and empty entries in the recovered barcodes are part of this structure, hence should not be discarded in the pairwise searching step. In other words, if an empty entry in one (recovered) barcode matches an empty entry in another (recovered) barcode, there is a reward associated with this match; and if an empty entry corresponds to a nonempty entry, the penalty is less than that of mismatched nonempty nucleotide.
Suppose we have found another barcode t that has the highest pairwise matching score to s, we then compare this score to \(propm\cdot nnz(s)\), where nnz(s) is the number of nonzero entries in s and \(propm\in (0,1)\) is a fraction. This step is necessary to avoid pairing of two barcodes with massive zero entries. If the pairwise matching score of s and t is greater than \(propm\cdot nnz(s)\), then s, t are paired and removed from the candidate cohort. If the pairwise matching score is less than \(propm\cdot nnz(s)\), the barcode s is left as a singleton barcode which will be explained in more details later. This procedure is then repeated on the remaining barcodes until all leaf barcodes have been processed, see Fig. 21B. Because the pairing priority is given to a leaf that is closest to root, but may not be given to closest pair of leaves, we think it is reasonable to pair two leaves cautiously. Therefore we suggest to choose \(propm\in [0.7,0.9]\) for RMP method.
Neighbor joining method (NBJ) on barcodes pairing
The Neighbor Joining Method (NBJ) is different from RMP in that, after leaf barcodes have been aligned pairwisely, the pairing priority is given to the pair that have the highest pairwise matching score, regardless of the root barcode. The pairing process is as follows: the highest pairwise matching score is found in the matrix with corresponding barcodes s and t, and if this matching score is greater than \(propm*\max (nnz(s),nnz(t))\), then s and t are paired and removed from the matrix of pairwise matching. Then the next highest pairwise matching score is sought. In each iteration the pair of barcodes that have the highest matching score (among the remaining cells) are sought, so they are encouraged to pair, and therefore the parameter promp can be lowered. This process is continued until the highest pairwise matching score among remaining cells is lower than a threshold.
In NBJ method the pairing priority is given to pair of leaves that are closest to each other, so we believe it is reasonable to encourage pairing in this case, and therefore we suggest to choose \(propm\in [0.3,0.6]\) for NBJ method.
Both RMP and NBJ will need to go through Pairing Validity Check, as explained in section "Pairing validity check". In the simulation part, both methods will be tested.
Pairing validity check
To build parent nodes of generation \(m1\) from generation m, some leaf barcodes of generation m have been paired and the rest are left as singleton barcodes. Suppose there are n barcodes in generation m and 2k of them are paired, then the rest \(n2k\) barcodes are singleton barcodes, and in generation \(m1\) there are \(k+n2k=nk\) parent nodes. A natural constraint is that in each generation, say m, the number of nodes should not exceed \(2^m\). Therefore, in generation \(m1\) the condition \(nk\le 2^{m1}\) should be satisfied. If this constraint is not met, we keep pairing more barcodes in generation m (by dropping the parameter propm as introduced in section "Pairwise alignment and barcodes pairing—RMP method") until the condition \(nk\le 2^{m1}\) is satisfied. This validity check is performed in each generation as we work backward to reconstruct the lineage tree.
Reconstruct parent node
Suppose the barcodes s and t with generation number m have been paired after the pairwise alignment, we try to reconstruct their parent barcode. The alignment algorithm has assigned s and t new structures, denoted \(new_s\) and \(new_t\), respectively. For these two new barcodes we do entry by entry comparison, that is, if the corresponding entries in \(new\_s\) and \(new\_t\) are equal, we keep this entry, if the two corresponding entries are not equal, we randomly pick one with equal probability. The resulting barcode is adopted as temporary parent barcode, denoted \(t\_par\). Next we compare \(t\_par\) to the root by performing another alignment, and then do an entry by entry comparison: if an entry in \(t\_par\) matches the corresponding entry in the root, it is left unchanged; if an entry in \(t\_par\) does not match the corresponding entry in the root, with probability 1/m it is replaced with the entry in the root, see Fig. 21C. We call this step ‘filtered by the root’. After all the parent barcodes have been reconstructed, we obtain generation \(m1\). This generation is then taken as the set of leaf barcodes and this procedure is repeated until we reach the root, see Fig. 21D. In this way, the parent barcodes gradually converge to the root.
Filtering vs nonfiltering with root barcode
The idea of filtering with root barcode, as introduced in section "Reconstruct parent node", is to use extensively any available information such as the root barcode. Therefore, while rebuilding the parent node, it is a natural thought to use the root barcode as a benchmark and let the internal barcodes converge to it gradually. However, it remains a question whether this filtering, as seen in Fig. 21C, will improve the accuracy of the reconstructed phylogenetic tree. In this work, we compare the simulation results of RMP and NBJ methods with and without filtering with root, and find that the methods without filtering give slightly higher accuracy, which is a bit counter intuitive. Details are given in section "Results".
Binomial tree comparison
After all the pseudo internal nodes have been constructed, as explained in section "Reconstruct parent node", we obtain the reconstructed lineage tree which needs to be compared to the real lineage tree. In the simulation of evolving barcodes, see e.g., Fig. 21A, the real lineage tree is known. So the next task is to compare two binomial trees to test accuracy of lineage reconstruction. We adopt a method similar to the Robinson Foulds (RF) method where we compare the corresponding clades of two trees. A clade is the set of all leaves that originate from a certain ancestry node (internal node, excluding the root). If two clades from these two trees, respectively, match exactly, we count it as a match of that corresponding ancestry node. An example is shown in Fig. 22, where Fig. 22A is the real lineage, Fig. 22B is the rebuilt lineage based on the information of leaf barcode and root barcode, and Fig. 22C illustrates how the accuracy is calculated when lineage in Fig. 22B is compared to that in Fig. 22A.
Recall that leaf barcodes have been sorted in the RMP method according to their respective matching scores to the root. This reordering does not affect the tree comparison in our method. For the example in Fig. 22B, we see, for instance, that node A with clade \(\{5,6\}\), node B with clade \(\{5,6,7,8\}\), node C with clade \(\{1,2,3,4,5,6,7,8\}\), have exact matches in the lineage tree in Fig. 22A, while node D with clade \(\{9,10,12,13\}\) does not have a match. In the full binomial tree every internal node has two children, and we call such nodes paired nodes. So in Fig. 22A there are a total of 14 internal nodes (all are paired nodes) and 7 matched paired nodes in Fig. 22B, so the accuracy of the reconstructed lineage in Fig. 22B is \(7/14=50\%\). We point out that this pairing method is very strict  if two clades differ by just one leaf node, then they are not matched. We also need to point out that in the case of full binomial tree, the accuracy as calculated in our program (either internal or paired nodes, which are equal) is exactly the complement of the widely used RF distance, in other word, \((accuracy)+(RF\ distance)=1\). That is because each clade in this case corresponds to a unique internal node of the tree.
Fractional sampling
In actual experiment, only a small sample, rather than all the cells in a whole organ, are sequenced. As a result, the lineage tree will not be a full binomial tree, see for example, Fig. 23 where half of the leaf barcodes are sampled. In the simulation, the full binomial tree is still obtained but in the last step we collect a random sample of a fraction of the leaf barcodes.
In Fig. 23A where the lineage tree is not a full binomial tree, nodes that have two children are again called paired (or dividing) nodes, while nodes that have only one child are called singleton nodes. Recall that in section "Pairwise alignment and barcodes pairing—RMP method" if we cannot find a counterpart for a certain barcode, that barcode is left as a singleton barcode. The method to trace back from singleton barcodes is similar to that of paired barcodes, except that to build the parent barcode, it is derived directly from its only child, and then it is filtered by the root barcode, see section "Reconstruct parent node".
Our tree comparing method still applies to partial binomial trees, see, for example, Fig. 23B that is to compare to Fig. 23A. For instance, nodes A, B, C in Fig. 23B are able to find their exact matches, respectively, in Fig. 23A, and they are all paired nodes. Node D has clades \(\{1,2\}\) but in Fig. 23A node D is a singleton node, so this is a matched singleton node. Same reasoning also applies to nodes E, F and G.
If the sampling proportion ss is small, it is possible that the percentage of matched internal nodes is still high (see, e.g., Fig. 23), even though most of those nodes are singleton nodes. The information of cell association is contained in paired nodes, but not in the singleton nodes. Therefore, in our simulation, in addition to calculating the percentage of total matched internal nodes, we also calculate the percentage of matched paired nodes. If the sampling fraction \(ss=100\%\), most internal nodes are paired nodes except a few nodes becoming singleton because some of their children barcodes are completely cut. Therefore, if ss is close to \(100\%\), the percentage of matched internal nodes and the percentage of matched paired nodes should be very close. However, if \(ss<<1\), we expect a considerable difference between these two quantities. The percentage of all matched internal nodes and the percentage of matched paired nodes provide us valuable information on accuracy of the reconstructed lineage tree.
We claim that the matching of two nodes from the two trees, respectively, is strict because the two nodes must belong to the same level/generation and have the same clades. For example, node D in Fig. 23B matches node D in Fig. 23A at the tree level 2, but it does not match the direct parent node of leaf \(\{1,2\}\) in Fig. 23A at level 3, even though they have the same clades. Future work may investigate less strict measures such as partially matched clades, or matched nodes but belonging to different generations.
Handling nondividing cells and dead cells
To simulate the scenario where some cells do not divide in certain cycles, and some cells die at some point of tumor progression, we introduce two parameters, namely divp and clive, into our program, where divp is the probability that a cell will divide in a certain cycle, and clive is the probability that a cell will survive into the next cycle. Each node/cell in the evolving tree has an indicator which tells if the cell is alive or dead. Dead cells will have all empty barcodes in their descendants. If a cell is alive and marked nondividing at some point, the program generates one valid child barcode for it, where mutation still occurs, and the other child is empty. In this way, the structure of the full binomial tree of evolution, as seen in Fig. 23A, is preserved. For example, leaf nodes \(\{13,14,15,16\}\) are all empty that may due to a dead ancestry node at level 2.
Occasionally a barcode is completely cut due to accumulated mutations, and it is then marked dead. It still generates pseudo children barcodes that are all empty/dead. When we sample or partially sample (as explained in section "Fractional sampling") leaf barcodes in the last generation, we only consider nonempty barcodes.
Adding a second barcode
The DARLIN mice as introduced in^{35} contain three independent target arrays (barcodes), and this experiment brings in a new topic of how to reconstruct the lineage tree with multiple barcodes. In our program we add a second barcode that is being edited independently. In this case of double barcodes, RMP method faces a challenge because for many alleles, one barcode may match the root well but the other may contain much mutations and is not closely matched to root, and therefore it is hard to order the cells well. However, NBJ method appears to be capable to handle this situation. Given two cells, namely Cell I with barcodes (\(I_1\), \(I_2\)), and Cell II with barcodes (\(II_1\), \(II_2\)), let \(M_1\) be the alignment score between \(I_1\) and \(II_1\), and let \(M_2\) be the alignment score between \(I_2\) and \(II_2\). Set \(M=M_1+M_2\), then cells are paired, starting from the currently highest M, if \(M>propm*\max (nnz(I_1),nnz(I_2),nnz(II_1),nnz(II_2))\). Here we may set propm a bit bigger because M is the total matching score of two barcodes. This round of pairing will pair cells whose both barcodes are matched well. Among the remaining cells we perform a second round of pairing. We search for the largest \(M_1\) and \(M_2\) in remaining cells. Suppose \(M_1>M_2\), then these two cells are paired if \(M_1>propmi*\max (I_1,II_1)\), where propmi stands for proportion of match  individual (barcode). The case where \(M_1<M_2\) can be done in a similar manner. This round of pairing will pair cells where one of the two barcodes matches well, i.e., \(I_1\) vs \(II_1\) or \(I_2\) vs \(II_2\), but the other contains much mutation and differ a lot due to independent editing events. It becomes a tricky task of how to choose propm and propmi in order to maximize lineage accuracy. In simulation we calculate the lineage accuracy for a grid of propm and propmi and discuss the effect of both parameters.
Extensions
Our simulation program builds a framework on CRISPRCas9 barcode editing and lineage reconstruction, with plenty of flexibility for further extensions. In this paper we generate barcode in pure randomness, and we assume the Cas9 cutting rate is uniform throughout the entire barcode. In our program two alleles are considered identical only if they are exactly the same, bp by bp. For this reason, the chance to obtain identical alleles in our simulation is small. However, similar alleles may also cause homoplasy effect. Simulation results suggest that early Dox induction should be restrained to reduce the chance of large segment deletion in early generations, which helps inhibit homoplasy effect. So, a direct extension is to use nonuniform cutting rate depending on the generation number as well as the type of each nucleotide, and in stead of purely random nucleotides, the barcode may consist of several copies of a target with certain length, for example, 10 copies of a target that has 27 bps, which makes a barcode with length 270 bp. In a future exploration we will perform this extension based on actual experimental design, and investigate effects such as homoplasy.
Computational count
We take NBJNF method as an example. Let n be the initial barcode length, and let k be the maximum generation number, then in the case of full binomial tree there are \(2^k\) leaves, and in generation i, \(1\le i \le k\), there are \(2^i\) nodes. Let \(\alpha \in (0,1]\) be a shrinking factor due to barcode editing  because of the large segment deletion, we expect that barcodes will become shorter and shorter as cells divide, at least on average basis. Then the leaf barcode will have an estimated length of \(\alpha ^i n\) in generation i. Each pairwise alignment among leaves has a computational count of \(O(\alpha ^{2i}n^2)\), and in generation i there will be \(2^i(2^i1)/2)\) alignments. So for generation i the computational count is estimated at \(O(\alpha ^{2i}n^22^{2i1})\). Then sum up from \(i=1\) to \(i=k\), the total computational count is \(O(2^{2k1}n^2\alpha ^{2k})\), and if \(\alpha =1\), it is \(O(2^{2k1}n^2)\).
Conclusions
In this work we designed a program to simulate the evolution of barcode and DSB in dividing cells due to CRISPRCas9 editing, where features such as segment deletion, nucleotide insertion and mutation are all included, and as cells divide, a phylogenetic tree is built. We also designed algorithms  RMP, RMPNF, NBJ, and NBJNF methods to rebuild the lineage tree from the information of leaf nodes and root node, and compare it to the original phylogenetic tree to test their accuracy. We also performed simulations and reported results with 2 independent barcodes. In the comparison of two lineage trees we claim matched nodes in the strict sense in that the following two conditions must be satisfied 1) The two nodes must have exactly the same clade of leaf nodes, and 2) matched nodes must belong to the same generation. In this work the comparison result of both paired nodes and internal nodes are provided. This program is also able to calculate the accuracy of matched internal nodes and paired nodes for each respective generation. Among the many factors that could affect lineage tracing accuracy, we found that sampling size and indel probabilities are the two major factors. Also, large segment deletions in early generations could greatly impact lineage accuracy. Larger sampling size with indel probabilities that favor insertion while inhibit deletion will produce the best lineage accuracy. These simulation results provide important guidance for actual experimental design. Furthermore, our program builds a framework for simulating CRISPRCas9 barcode editing, which is very flexible to incorporate further extensions. Future work may integrate this program with actual experiments to generate insightful information that improves lineage tracing accuracy.
Data and code availability
Simulation data and program code are available at https://github.com/xzhanglab/CRISPRbasedLineageTracingSimulation.
References
Kretzschmar, K. & Watt, F. Lineage tracing. Cell 148, 33–45 (2012).
Conklin, E. The Organization and Celllineage of the Ascidian Egg. Arch Entwicklungsmech Organismen (Academy of Natural Sciences, 1905).
Vogt, W. Gestaltungsanalyse am amphibienkeim mitortlicher vitalfarbung. ii. teil gastrulation und mesodermbildung bei urodelen und anuren. Wilhelm Roux Arch Entwicklungsmech Organismen120, 384–706 (1929).
GarciaBellido, A., Ripoll, P. & Morata, G. Developmental compartmentalisation of the wing disk of drosophila. Nat. New Biol. 245, 251–253 (1973).
Turner, D. & Cepko, C. A common progenitor for neurons and glia persists in rat retina late in development. Nature 328, 131–136 (1987).
Sauer, B. & Henderson, N. Sitespecific dna recombination in mammalian cells by the cre recombinase of bacteriophage p1. Proc Natl Acad Sci USA 85, 5166–5170 (1988).
Sun, J. et al. Clonal dynamics of native haematopoiesis. Nature 514, 322–327 (2014).
Lemischka, I., Raulet, D. & Mulligan, R. Developmental potential and dynamic behavior of hematopoietic stem cells. Cell 45, 917–927 (1986).
Zafar, H., Tzen, A., Navin, N., Chen, K. & Nakhleh, L. SiFit: Inferring tumor trees from singlecell sequencing data under finitesites models. Genome Biol. 18, 178 (2017).
Wagner, D. & Klein, A. Lineage tracing meets singlecell omics: Opportunities and challenges. Nat. Rev. Genet. 21, 410–427 (2020).
Grun, D. et al. Singlecell messenger rna sequencing reveals rare intestinal cell types. Nature 525, 251–255 (2015).
Qiu, X. et al. Reversed graph embedding resolves complex singlecell trajectories. Nat. Methods 14, 979–982 (2017).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of singlecell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Blasi, T., Buettner, F., Strasser, M., Marr, C. & Theis, F. cgCorrect: A method to correct for confounding cellcell variation due to cell growth in singlecell transcriptomics. Phys. Biol. 14, 036001 (2017).
Tritschler, S. et al. Concepts and limitations for learning developmental trajectories from single cell genomics. Development 146 (2019).
VanHorn, S. & Morris, S. Nextgeneration lineage tracing and fate mapping to interrogate development. Dev. Cell 56, 7–21 (2021).
Zafar, H., Lin, C. & BarJoseph, Z. Singlecell lineage tracing by integrating CRISPRCas9 mutations with transcriptomic data. Nat. Commun. 11, 3055 (2020).
Cotterell, J., VilaCejudo, M., BatlleMorera, L. & Sharpe, J. Endogenous crispr/cas9 arrays for scalable wholeorganism lineage tracing. Development 147 (2020).
Quinn, J. et al. Singlecell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science 371 (2021).
Raj, B. et al. Simultaneous singlecell profiling of lineages and cell types in the vertebrate brain. Nat. Biotechnol. 36, 442–450 (2018).
Spanjaard, B. et al. Simultaneous lineage tracing and celltype identification using CRISPR–Cas9induced genetic scars. Nat. Biotechnol. 36, 469–473 (2018).
Kalhor, R., Mali, P. & Church, G. Rapidly evolving homing CRISPR barcodes. Nat. Methods 14, 195–200 (2017).
Kalhor, R. et al. Developmental barcoding of whole mouse via homing crispr. Science 361 (2018).
Anzalone, A., Koblan, L. & Liu, D. Genome editing with CRISPR–Cas nucleases, base editors, transposases and prime editors. Nat. Biotechnol. 38, 824–844 (2020).
Bowling, S. et al. An engineered CRISPRCas9 mouse line for simultaneous readout of lineage histories and gene expression profiles in single cells. Cell 181, 1410–1422 (2020).
Masuyama, N., Mori, H. & Yachie, N. DNA barcodes evolve for highresolution cell lineage tracing. Curr. Opin. Chem. Biol. 52, 63–71 (2019).
Yang, D., Jones, M., Naranjo, S. & Rideout III, W. E. A. Lineage tracing reveals the phylodynamics, plasticity, and paths of tumor evolution. Cell 185, 1–19 (2022).
Jones, M., Khodaverdian, A., Quinn, J. & Chan, M. E. A. Inference of singlecell phylogenies from lineage tracing data using cassiopeia. Genome Biol. 21, 1–27 (2020).
Chan, M., Smith, Z., Grosswendt, S., Kretzmer, H. & Norman, T. E. A. Molecular recording of mammalinan embryogenesis. Nature 570, 77–82 (2019).
Klein, A. et al. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Choudhary, S. & Satija, R. Comparison and evaluation of statistical error models for scrnaseq. Genome Biol. 23 (2022).
Scavuzzo, M. et al. Endocrine lineage biases arise in temporally distinct endocrine progenitors during pancreatic morphogenesis. Nat. Commun. 9, 3356 (2018).
Gong, W. et al. Benchmarked approaches for reconstruction of in vitro cell lineages and in silico models of C. elegans and M. musculus developmental trees. Cell Syst. 12, 810–826 (2021).
SalvadorMartinez, I., Grillo, M., Averof, M. & Telford, M. Is it possible to reconstruct an accurate cell lineage using crispr recorders?. Elife 8, e40292 (2019).
Li, L. et al. A mouse model with high clonal barcode diversity for joint lineage, transcriptomic, and epigenomic profiling in single cells. Cell 186, 5183–5199 (2023).
Chen, W. et al. Massively parallel profiling and predictive modeling of the outcomes of CRISPR/Cas9mediated doublestrand break repair. Nucleic Acids Res. 47, 7989–8003 (2019).
Sashittal, P., Schmidt, H., Chan, M. & Raphael, B. Startle: A star homoplasy approach for crisprcas9 lineage tracing. Cell Syst.14, 1113–1121 (2023).
Choi, J., Chen, W. & Minkina, A. E. A. A timeresolved, multisymbol molecular recorder via sequential genome editing. Nature 608, 98–107 (2022).
Funding
FS Liu and X Zhang are supported by NCI R01 CA251950 and R01 CA183878. YP Yang was supported by the Faculty Development Leave Fund at University of Houston  Clear Lake in Spring 2022.
Author information
Authors and Affiliations
Contributions
X Zhang initiated this research and provided critical instructions to conduct this research. Y Yang built the foundation of the simulation program. FS Liu ran the program, collected, organized, and reported simulation data. Y Yang and FS Liu wrote the main body of this paper.
Corresponding author
Ethics declarations
Competing interest
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Liu, F., Zhang, X. & Yang, Y. Simulation of CRISPRCas9 editing on evolving barcode and accuracy of lineage tracing. Sci Rep 14, 19213 (2024). https://doi.org/10.1038/s41598024701547
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598024701547
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.