Lineage tracing using a Cas9-deaminase barcoding system targeting endogenous L1 elements

Determining cell lineage and function is critical to understanding human physiology and pathology. Although advances in lineage tracing methods provide new insight into cell fate, defining cellular diversity at the mammalian level remains a challenge. Here, we develop a genome editing strategy using a cytidine deaminase fused with nickase Cas9 (nCas9) to specifically target endogenous interspersed repeat regions in mammalian cells. The resulting mutation patterns serve as a genetic barcode, which is induced by targeted mutagenesis with single-guide RNA (sgRNA), leveraging substitution events, and subsequent read out by a single primer pair. By analyzing interspersed mutation signatures, we show the accurate reconstruction of cell lineage using both bulk cell and single-cell data. We envision that our genetic barcode system will enable fine-resolution mapping of organismal development in healthy and diseased mammalian states.


INTRODUCTION
Understanding the history of a cell is attractive to developmental biologists and genetic technologists because the lineage relationship illuminates the mechanisms underlying both normal development and certain disease pathologies. Researchers have developed a vast arsenal of robust genomic tools to interrogate cells. Traditionally, determining the history of individual cells has been accomplished using fluorescent proteins 1 , Cre-loxP recombinase 2 , somatic transposon events 3 , and accumulated microsatellite mutations 4 . More recently, various cellular barcoding strategies have been developed [5][6][7][8][9] . To this end, the clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9 genome editing system [10][11][12][13] has been primarily used to develop many cellular barcoding methods. However, creating repeat copies of array elements is difficult using methods that incorporate the CRISPR/Cas9 system, resulting in the need to introduce multiple exogenous DNA barcoding array elements into the genome to create a stable transgenic line for model organisms.
The majority of the human genome consists of repetitive elements that have been utilized for live imaging 14 and fetal aneuploidy testing 15 . Recently, researchers have shown that the editing of multiple endogenous retrovirus genes can be achieved without altering normal development 16 .
Inspired by this evidence, we reasoned that multiple target regions in various interspersed sites in repeat elements could serve as barcodes for lineage analysis. We hypothesized that the induction of double-strand breaks (DSBs) in numerous endogenous interspersed sites present in the genome would have a toxic effect on cells 17 and affect cell lineage. Thus, we used the recently proposed base editing method involving dCas9 fused with cytidine deaminase (referred to as BE3 in previous literature 18 but hereafter referred to as targeted deaminase), which converts C:G base pairs to T:A base pairs in the 4-8 nucleotide region from the protospacer adjacent motif (PAM) on the distal side of the protospacer sequence without inducing DSBs.
Consequently, we developed a new cellular barcoding method for lineage tracing using dCas9 fused with cytidine deaminase to target the long interspersed nuclear element-1 (L1) in the genome.
We believed that the interspersed target regions of the targeted deaminase system could be leveraged to introduce various substitution patterns into the regions using only a single-guide RNA (sgRNA).
We confirmed that these unique marker combinations were gradually introduced and that tracking the patterns enabled the accurate reconstruction of the lineage relationship. In addition, we validated our approach at the single-cell level using time-lapse imaging experiments as ground truth data. We expect that our genetic barcoding system could provide insight into normal cell development as well as molecular pathology.

Characterization and selection of target regions for the cell barcoding system
First, we evaluated the amplified regions that were known to be human L1 retrotransposon regions 15 . A single primer pair was used to amplify the L1 retrotransposon region (Figure 1a and Supplementary Figure 1a). The region was designed to maximize the number of distinct sequences and to increase the potential for uniform amplification 15 . As expected, the amplicon size was bimodally distributed, with 99% of the regions overlapping with known L1 subfamilies Next, we searched for the endogenous regions that could serve as barcodes after genome editing using targeted deaminase. Like the existing CRISPR/Cas9 system, targeted deaminase targets a 20 bp spacer adjacent to the 'NGG' PAM sequence and induces C-to-T conversion in the specific protospacer sequence window. For proof-of-concept, we screened all targetable spacer regions to identify those with the 'NGG' PAM sequence in the retrotransposon region and compiled the candidate spacer regions with a C in the 4-8 nucleotide window from PAM on the distal side of the protospacer. We confirmed that there were several sgRNA candidates with identical spacer sequences that satisfied our established conditions. We then selected two spacer sequences with the highest number of precisely matched sites (top two, highest ranking [sgRNA-1] and third ranked  in the targeted repeat regions (Supplementary Figures 1b and 3). The sequences of sgRNA-2 and sgRNA-1 were nearly identical (only a single base difference); thus, sgRNA-2 was omitted from further experiments. We expected that one sgRNA would introduce a substitution at multiple target sites in the L1 regions and defined 'cell barcode' as C-to-T substitutions (see Methods). Although the multiple target sites of each sgRNA had identical spacer sequences, they were distinguishable after amplification of the target region via the surrounding sequences, which could be uniquely aligned to the specific genomic position (Figure 1b).

Analysis of genome editing in repeat regions
We first applied the targeted deaminase system (Supplementary Table 1 (Figure 1c), implying a low number of edited cells. In contrast, sgRNA-3 exhibited an average 4-fold higher editing efficiency (6.3% and 9.3% in HEK293T and HeLa cells, respectively) compared with sgRNA-1, and the correlation of the editing efficiency between the two 'C's in the spacer sequence windows were high in the multiple targets (Figure 1d). We also observed differences in the editing efficiency between the two cell lines across the target sites. No significant background mutations were detected compared with the vehicle control without targeted deaminase and sgRNA (P-value < 2.2e-16, Mann-Whitney U test). As well, aside from Cs in the known 4-8 nucleotide window in the PAM distal end, very low editing frequency of non-target Cs was observed (P-value < 2.2e-16, Mann-Whitney U test, Supplementary Figure 5).

Cumulative introduction of cell barcodes using targeted deaminase
Next, we investigated whether the targeted deaminase system could continuously introduce genetic barcodes in the target repeat regions. To explore the extent of the continuous editing strategy at known time points, we compared sgRNA-1 and sgRNA-3 via repeat serial transfection of the targeted deaminase system at approximately 3-day intervals (Figure 2a). The average editing efficiency increased linearly with the gradual accumulation of edited sites following serial transfection, and the observed editing rate was faster using sgRNA-3 compared with sgRNA-1. Thus, we concluded that the continuous introduction of genetic cell barcodes was feasible using our method.

Lineage tracing at the bulk level using a controlled in vitro cell line experiment
We devised an in vitro cell expansion experiment to investigate whether the cell barcodes were properly introduced each generation for the tree construction. HEK293T and HeLa cells were transfected using the PiggyBac™ transposon system containing the mCherry fluorescence protein, targeted deaminase, and sgRNA (Supplementary Figure 4). After sorting by FACS into 96-well plates, single-cells from the two cell lines underwent clonal expansion. The mCherry-positive singlecells were then sorted and cultured in different wells, and the process was repeated (Figure 2b). The target regions from the expanded cells were amplified using a single primer pair and subjected to next-generation sequencing (NGS). The system was first applied to the HEK293T cells. The known tree topology within each node (a node represents the sum of the cell barcodes) represented by the bulk cells enabled the validation of the lineage tracing. On average, 95% of the unique reads were aligned and these reads were processed further as multiple alignments can occur because of homology between the flanking regions. After alignment and variant calling, we detected an average of five cell barcodes per node in the first generation of the tree and found that ~93% of the target sites were unedited by sgRNA-1.
Next, tree building was performed using an iterative graph approach 7 and an additional posthoc cell barcode selection step (see Methods). For sgRNA-1, the tree could not be identified correctly because of a low number of informative cell barcodes. Conversely, sgRNA-3 showed an average of 29 cell barcodes for the first generation nodes. The accuracy of the reconstructed tree was defined based on the fraction of the correct node placements on the known depth and position of the tree. For sgRNA-3, we achieved 81% accuracy (29/36) of the reconstructed tree based on the variant calling approach. To remove spurious connections and refine the reconstructed tree, we developed a custom algorithm to obtain confident cell barcodes. We used a probabilistic approach with a posterior calculation to select the final candidate cell barcode for the tree construction (see Methods).
Compared to the conventional variant calling approach, we observed an average of 70 cell barcodes for sgRNA-3 and the reconstruction accuracy improved to 92% (33/36) (Figure 2b). For the HeLa cells, we also observed slightly improved performance in accuracy (88% and 93% for conventional variant calling vs. custom algorithm, respectively) (Supplementary Figure 6).

Reconstruction of the lineage relationship at the single-cell level
Next, we explored whether our targeted deaminase system could be extended to the reconstruction of lineage relationships at the single-cell level. Because the tree reconstruction efficiency was better with sgRNA-3 than sgRNA-1, we focused on sgRNA-3 only and chose HeLa cells for the single-cell level lineage tracing for the ease of isolating single-cells. HeLa cells were transfected with the PiggyBac™ transposon system containing the mCherry fluorescence protein, targeted deaminase, and sgRNA-3. We used time-lapse imaging to generate ground truth tree data and picked individual cells (n = 32 total single-cells analyzed [3 different trees]) determined to be mCherry marker-positive by manual inspection (Figure 2c). To prepare for the sequencing experiment, we first performed whole-genome amplification (WGA) and subsequent PCR amplification of the selected single-cells. However, uneven distribution of the sequencing reads prohibited robust cell barcode identification. Furthermore, it has been reported that increased background C-to-T mutations can occur because of the high denaturation temperature during WGA 19 .
Thus, we elected to optimize the single-cell PCR conditions. After the optimization, we achieved more uniform distribution of depth over the target regions ( Supplementary Figure 7 and Methods).
A standard agglomerative hierarchical clustering approach was used for three different experimental trees for three to four divisions (8-16 cells) of the expansion. Confident cell barcodes were encoded in the binary state and pairwise cell-to-cell distances were calculated for the tree reconstruction.
Hierarchical clustering using an average-linkage method consistently recovered the known trees

DISCUSSION
Here, we demonstrated a lineage analysis platform using targeted deaminase with sgRNA.
Further, we showed that library preparation could be simplified using a single flanking primer pair to amplify the edited endogenous repeat element (L1 retrotransposon region 20 ). As well, our method relies on more predictable barcodes based on substitution patterns. Most importantly, our method does not rely on introducing exogenous barcodes without making dsDNA breaks. We used endogenous repeat elements as potential barcodes, eliminating the need to create and insert complex repeat arrays.
The genetic editing pattern in endogenous sites can serve as a cell barcode to reconstruct the cell lineage tree. A limitation of conventional Cas9-induced lineage tracing is the rapid saturation of genetic barcodes, which might hinder long-term labelling experiments. In contrast, the low editing rate of our method (0.06 vs 0.4 per hour 7 for our method vs conventional Cas9-induced method, respectively) enables long-term lineage tracing and can be further controlled using different sgRNA designs as demonstrated by the observed differences between sgRNA-1 and sgRNA-3. Notably, the high theoretical diversity (sgRNA-3, 10 250 ) of the cell barcodes is sufficient to cover approximately 37 trillion cells in the human body. Although we have only used targeted deaminase for the lineage analysis, the mutation rate could be controlled using other base editing systems 21,22 or the rational design of sgRNA that target other regions. In addition, other engineered dCas9 proteins or other classes of CRISPR proteins 23-26 might influence the editing efficiency or target diversity.
Alternatively, we could simultaneously integrate multiple sgRNAs into the genome and track the resulting mutation patterns to reconstruct the lineage in a more predictable manner.
Our method has some technical challenges. Although the potential diversity of our proposed method is substantially larger than the diversity achieved using previous methods, we could not fully explore this aspect because of the limited number of single-cells sampled and the low editing rate. In the conventional Cas9 system, edited barcode arrays are amplified and can be readout by paired-end sequencing 5 . Conversely, clonal information could not be directly inferred from our bulk cell analysis because of the interspersed nature of the barcode regions. The analyzable clonal sequence space is likewise limited by current sequencing length (300 bp for the Illumina platform using the 150 bp paired-end mode). Instead, we devised a lineage relationship between the bulk cells (isolated from a single well) that could be determined based on the integrated cell barcode pattern of the cell population.
We employed a custom algorithm to select robust cell barcodes for bulk cell-based lineage tracing because conventional variant calling based on the stringent FreeBayes algorithm returned a small number of variants. This allowed flexible control of the various parameters and eliminated the need for simple, sequencing depth-based filtering. In a previous study, the original tree topology was preserved with high accuracy using a graph-based approach 7 . In our single-cell analysis, tree building for three different sets of three to four generation single-cell division was also accurate. In simulations, we detected higher accuracy of the tree reconstruction with an increasing number of barcodes. In a practical scenario in which conventional Cas9 causes barcode dropout, our base editing scheme (without dsDNA breaks) could potentially be useful for lineage construction. In the future, additional comparative studies are required to investigate whether the base editing strategy confers an advantage over the Cas9 nuclease-based method from the perspective of the effect of barcode dropout on tree reconstruction.
In summary, we have shown a proof-of-concept for using the targeted deaminase system for lineage tracing through in vitro tree expansion experiments and time-lapse imaging. In doing so, we have laid the foundation for developing an alternative technique to trace cell lineages using endogenous targets. In future studies, we hope that our system can be used to determine the fates of different cells for organ development in a transgenic model organism. In the long-term, if coupled with transcriptional information, our method could provide a high-resolution view of developmental lineages.

Plasmid construction
The PB CMV-BE3 EF1α-mCherry-T2A-puro sgRNA (Supplementary Table 1  The experiment that entailed the continuous introduction of genetic barcodes was conducted using the aforementioned transfection method. Half of the cells were harvested for gDNA 3 days after transfection and the other half of the cells were cultured for the serial transfection.

Generation of cells containing the genetic barcoding system
In the experiment to measure barcode editing efficiency, only the genetic barcoding system without transposase was transfected using Lipofectamine™ 3000. The transfected cells were harvested at 4,8,12,16,20,24,30,36,42, and 48 hours, followed by gDNA extraction.

A controlled in vitro tree experiment
The procedures used to generate the barcodes were identical to the protocols described above, and successfully transfected cells were selected using puromycin (

Endogenous genetic barcode amplification -bulk cells
The gDNA extracted from the cells described above were used for the amplification of  Table 3.

Transfection for time-lapse imaging
The HeLa cells were prepared by subculturing in a 35-mm dish with 2 mL DMEM, supplemented with 1% P/S, 1% non-essential amino acids (NEAA) (Gibco, USA), 100 mM 2mercaptoethanol (2-ME) (Sigma-Aldrich), and 10% FBS. The cells were transfected with plasmids using an electroporation system (Neon®, Invitrogen, voltage:1140 v, pulse width range: 40 ms, pulse number: 1). After 4 hours of transfection, the culture medium was replaced with fresh medium to remove the dead cells.

Manual cell picking and monitoring for time-lapse imaging
All of the single-cell manipulations were conducted using a micromanipulator device (Nikon- l of nuclease-free water. The PCR reaction was performed using the following protocol: 2 min at 98°C followed by 10 cycles of 10 s at 98°C, 2 min at 57°C, and 2 min at 72°C and a final 2 min step at 72°C. After purification using AMPure beads, the initial PCR product was loaded into a second index PCR reaction and PCR was performed using the following protocol: 30 s at 98°C followed by 15 cycles of 10 s at 98°C, 30 s at 60°C, and 30 s at 72°C and a final 10 min step at 72°C. The final product was then purified using AMPure beads. Single-cell PCR amplification of the manually picked single-cells was performed using the PCR reaction composition ratio described above and following protocols: 1. Adaptor (initial) PCR: 2 min at 98°C followed by 30 cycles of 10 s at 98°C, 2 min at 57°C, and 2 min at 72°C and a final 2 min step at 72°C. The product was then purified using AMPure beads. 2. Index (second) PCR: 30 s at 98°C followed by 15 cycles of 10 s at 98°C, 30 s at 60°C, and 30 s at 72°C and a final 10 min step at 72°C. The products of the second PCR were then loaded onto a 2% agarose gel and separated by gel electrophoresis. The band at the expected size was purified using the QIAquick Gel Extraction Kit (QIAGEN, USA) according to the manufacturers' protocols. Likewise, sequencing was performed on the Illumina NextSeq 500 using the NextSeq 500/550 High Output v2 kit (300 cycles).

Analysis of amplified post-alignment processing
Reads were aligned to hg19 using BWA (v0.7.12-r1039) and realignment around indels (RealignerTargetCreator, IndelRealigner) was performed using GATK (v3.3-0). Per-position base calling was accomplished using the SAMtools (v1.1) mpilup function and the pileup file was used for custom variant calling (details in the next section). The aligned regions were annotated using RepeatMasker (http://www.repeatmasker.org) and the sizes of the amplified regions were plotted to calculate the overlap fraction.

Identification of confident sites for lineage reconstruction
We first adopted a variant calling approach using FreeBayes (v1.1.0-3-g961e5f3) to extract confident markers (C>T substitutions) for the lineage reconstruction. The variant calling used FreeBayes (input from BAM after indel realignment) and filtered positions (depth > 10) considered candidate markers, and only included the markers with higher allele frequency than the value calculated for the background control using an empty vector. For the bulk and single-cell linage tracing experiments involving HeLa cells, variant calling was performed using modified parameters (--ploidy 3, --pooled-discrete). To handle both the bulk and single-cell data efficiently, we developed a custom algorithm for a variant calling strategy that was based on our targeted deaminase system. We adopted a probabilistic approach using a binomial mixture model with conditional probabilities, as described in a previous study 28 . An expectation-maximization algorithm was used to estimate the model parameters to account for the inherent deviation of allele frequencies in unstable genomes (e.g., genomes with different ploidies). Every candidate position in the target region, depth >10×, variant allele count > 2, and posterior probabilities ≥ 0.95 was selected as a final marker. After performing a union operation for all the markers present in the bulk nodes, we selected confident markers using following criteria: 1) sites having a minimum editing efficiency of 0.1 and 2) removal of highly correlated markers (≥0.9). The sum of these markers for each cell represented the final 'cell barcode.'

Tree building for bulk and single-cell experiments
For the bulk cell experiment, tree building was performed using an approach similar to the previous method (https://bitbucket.org/Bastiaanspanjaard/linnaeus). We used the base editing pattern to build the substitution graph. For simplicity, the nodes were identified as CIGAR string-like sequences (i.e., 1E10E means the first and the 10-th C position in the perfect on-target region was edited). Inherently, the first ancestor node (bulk cells) had a maximal degree of connections with the other nodes. Iteratively, we removed this node and identified the ancestor node from the remaining connected components. This procedure was repeated until all the nodes were designated. Once all the pairwise cell networks were built, the cells were placed in the graph. We did not use the 'cell doublet' detection threshold because scRNA-seq was not used in this study.
For the single-cell-based lineage tracing, the information was restricted regardless of whether the site was edited. To identify confident markers, blacklist candidate regions (integration of the single-cell results exhibiting no mCherry signal or vehicle control single-cells) were also filtered out.
Unlike the bulk cell lineage construction, the time-lapse-based single-cell experiment contained the cells from the last depth of the expansion. Thus, the lineage tracing was accomplished using a different logic. The distance between the cells was calculated using the Jaccard index and hierarchical clustering was performed using the pvclust function in R. Approximately unbiased probability values (p-values) were calculated based on 1,000 iterations.

Inferring editing efficiency dynamics and simulations
The cells were transfected with the targeted deaminase vector as described above. After 4,8,12,16,20,24,30,36,42, and 48 hours, the bulk cells were collected and amplified for sequencing (two replicates). Assuming a wild type fraction of 1 (100%) in t=0, we estimated the editing rate (λ) by fitting the exponential function. We designated the wild type percentage as the fraction of unedited sites (C>T candidate positions) in the perfect on-target regions.

Data analysis
All statistical analyses were performed using the R computing environment and data plotting was conducted with the seaborn and ggplot2 packages in R. For Figure 1c and Figure 2a, two-tailed Man-Whitney U test was used to test for significance of difference between the control and sample means.

Data availability
The sequencing data supporting the findings of this study are available in the Sequence Read Archive with the identifier SRA SRP151792.