A key goal of developmental biology is to understand how a single cell is transformed into a full-grown organism comprising many different cell types. Single-cell RNA-sequencing (scRNA-seq) is commonly used to identify cell types in a tissue or organ1. However, organizing the resulting taxonomy of cell types into lineage trees to understand the developmental origin of cells remains challenging. Here we present LINNAEUS (lineage tracing by nuclease-activated editing of ubiquitous sequences)—a strategy for simultaneous lineage tracing and transcriptome profiling in thousands of single cells. By combining scRNA-seq with computational analysis of lineage barcodes, generated by genome editing of transgenic reporter genes, we reconstruct developmental lineage trees in zebrafish larvae, and in heart, liver, pancreas, and telencephalon of adult fish. LINNAEUS provides a systematic approach for tracing the origin of novel cell types, or known cell types under different conditions.
Measuring lineage relationships between cell types is important for understanding fundamental mechanisms of cell differentiation in development and disease2,3. In early development and in adult systems with a constant turnover of cells, short-term lineage predictions can be computed directly on scRNA-seq data by ordering cells along pseudotemporal trajectories according to transcriptome similarity4,5,6. However, the developmental origin of cells in the adult body cannot be identified using these approaches alone. Several approaches for lineage tracing exist. Genetically encoded fluorescent proteins are widely used as lineage markers7,8, but due to limited spectral resolution, optical lineage tracing methods have mostly been restricted to relatively small numbers of cells. Pioneering studies based on viral barcoding9,10, transposon integration sites11, microsatellite repeats12, somatic mutations13,14, Cre-mediated recombination15, and genome editing of reporter constructs16,17 have used sequence information to increase the diversity of lineage labels. However, these methods have not been coupled with single-cell transcriptome sequencing and therefore do not provide any information on cell type.
Here we present LINNAEUS for simultaneous measurement of single-cell transcriptomes and lineage markers in vivo. The approach is based on the observation that, in the absence of a template for homologous repair, Cas9 produces short insertions or deletions at its target sites, which are variable in their length and position16,18,19. We reasoned that these insertions or deletions (hereafter referred to as genetic 'scars') constitute heritable cellular barcodes that can be used for lineage analysis and read-out by scRNA-seq (Fig. 1a). To ensure that genetic scarring does not interfere with normal development, we targeted an RFP transgene in the existing zebrafish line zebrabow M, which has 16–32 independent integrations of the transgenic construct20. Since these integrations are in different genomic loci (as opposed to being in tandem), we could make sure that scars cannot be removed or overwritten by Cas9-mediated excision. We injected Cas9 and an sgRNA for RFP into one-cell-stage embryos to mark individual cells with genetic scars at an early time point in development (Fig. 1b). Loss of RFP fluorescence in injected embryos served as a direct visual confirmation of efficient scar formation (Supplementary Fig. 1). At a later stage, we dissociated the animals into a single-cell suspension and analyzed the scars by targeted sequencing of RFP transcripts (Online Methods). Simultaneously, we sequenced the transcriptome of the same cells by conventional scRNA-seq, using droplet microfluidics21 (Fig. 1c and Supplementary Figs. 2 and 3).
We analyzed single-cell transcriptomes of >70,000 cells from dissociated larvae at 5 days post-fertilization (dpf). On average, we detected ∼3,000 unique transcripts from ∼700 genes detected per cell (Supplementary Data set 1). Unsupervised clustering of single-cell transcriptomes22 revealed 70 groups of cells with distinct gene expression programs (Fig. 1d and Supplementary Fig. 4). We assigned these clusters to cell types based on differentially expressed genes (Supplementary Fig. 5 and Supplementary Data set 2). We found that Cas9 generated hundreds of unique scars per animal when targeting a single site in RFP (Fig. 1e,f and Supplementary Fig. 6), suggesting that analysis of genetic scars constitutes a useful approach for whole-organism lineage analysis. Bulk analysis of 32 individual larvae revealed that some scar sequences are more likely to be created than others, probably through mechanisms like microhomology-mediated repair23 (Fig. 1e). The scars with the highest intrinsic probabilities may be created multiple times per embryo and are therefore uninformative for lineage reconstruction. We therefore excluded the most frequent scars (P > 0.01) from further analysis. We found that scarring continued until around 10 h after fertilization, a stage at which zebrafish already have thousands of cells (Fig. 1g). Thus, our injection-based approach for Cas9 induction allowed us to label cells in an important developmental period during which the germ layers are formed and precursor cells for most organs are specified.
We detected variable numbers of scars in single cells, with the average number of scars per cell ranging from ∼2 for erythrocytes to ∼5 for epidermal cells. (Supplementary Data set 3). This indicated that some lineage information was lost due to the sparsity of scRNA-seq data. To investigate this issue in more detail, we analyzed single cells from the offspring of fish injected with Cas9 (Supplementary Fig. 7). In these fish, all cells (independent of the cell type) have the same scar profile, as they are derived from the same pair of germ cells. This analysis confirmed that scar detection efficiencies are dependent on cell type, which probably reflects differences in cell size or promoter strength. Furthermore, we observed some variance in scar-detection efficiency between the different transgenic integrations, which may be linked to genomic features of the integration sites. Notably, we did not find any highly expressed scars that were undetectable in specific cell types, suggesting that developmental silencing of specific integrations is of no major concern.
To validate that genetic scars contain useful information about lineage relationships, we calculated enrichment or depletion of scar connections between pairs of cell types (Supplementary Fig. 8). Clustering cell types by scar connection strength revealed three groups, each of which contained either mostly ectodermal or mesendodermal cell types (Supplementary Fig. 9). We suggest this pattern was caused by a small number of scars that were created during the first cell divisions and then expanded locally. These groups of cell types formed contiguous domains on the zebrafish fate map24, but did not strictly correspond to germ layers, since the domain boundaries of scar clones do not necessarily align with the boundaries between germ layers.
Next, we set out to analyze the data at higher resolution and reconstruct lineage trees on the level of single cells instead of cell types. As our previous filtering of frequent scars removed scars that may have been created multiple times, the lineage tree should fulfill the maximum parsimony principle, with every scar being created exactly once. Indeed, maximum parsimony approaches have previously been used for inferring trees from CRISPR–Cas9 lineage data16. Earlier studies indicated that missing data do not need to be detrimental to maximum parsimony tree-building methods25. However, such studies typically focused on a regime with an order of magnitude fewer taxa than we have cells, and more characters than we have scars. Using two simulated data sets, we found that Camin–Sokal maximum parsimony failed to reconstruct the correct tree for our system (Supplementary Figs. 10 and 11). While it might be possible to solve this issue using modified versions of maximum parsimony or other established tree-reconstruction algorithms, we developed an algorithm that is custom-tailored to our experimental system. Our custom-built strategy also facilitated integration of a filtering step to remove spurious connections. We therefore developed a computational method that fulfills the maximum parsimony criterion and allows for reconstruction of the correct tree in our system even if not all scars are detected in every cell (Supplementary Fig. 10 and Supplementary Note 1). Our algorithm is based on the observation that there is a correspondence between the underlying lineage tree and the resulting scar network graph, a representation of all pairwise combinations of scars that are experimentally observed together in single cells (Fig. 2a). If all scar connections are detected, the scar that is created first has the most connections in the scar network graph, followed by scars that were created next, enabling lineage tree reconstruction in an iterative manner (Fig. 2b). To remove spurious connections, caused by cell doublets, for example, scar connections that do not occur in enough cells were not taken into consideration (Supplementary Note 1). Using a simulated data set with realistic parameters, we found that our computational method correctly reconstructed lineage trees (Supplementary Fig. 11). Finally, we placed all single cells in the lineage tree based on the detected scars (Fig. 2c). Scar dropouts meant that we did not have full lineage information about every single cell. However, the reconstructed lineage tree allowed us to infer a large part of the missing scar information (Supplementary Fig. 12). The resulting single-cell lineage trees were then converted to a condensed representation for easier interpretation (Fig. 2d).
For the 5-dpf larvae we found that, as expected, the major developmental lineages shown in Figure 1d were separated at least partially from each other in the reconstructed lineage trees (Fig. 2e and Supplementary Figs. 13 and 14). These data can be explored at different levels of granularity, and we decided to focus next on the cell types of the lateral plate mesoderm (Fig. 2f). We found that the different blood cell types had a shared lineage, but we observed that the erythrocytes were also found in an additional branch that did not contain any immune cells. This observation probably reflects the transition from primitive to definitive hematopoiesis in early zebrafish development, as primitive hematopoiesis produces mostly erythrocytes, whereas definitive hematopoietic stem cells are capable of generating all blood cell types26. The primitive and definitive hematopoietic stem cells are known to have different developmental origins. We found that the putative definitive hematopoietic cells have a shared lineage origin with endothelial cells (Fig. 2f and Supplementary Fig. 14), which is to be expected, as the definitive hematopoietic stem cells (but not the primitive ones) are derived from endothelial cells of the dorsal aorta. For endodermal and neuronal/neural crest cell types, we observed a similar structure of partially cell-type-specific lineage branches (Supplementary Fig. 15). Due to the stochastic nature of cell labeling in LINNAEUS, scar creation is not synchronized with mitosis. It is therefore important to note that reconstructed lineage trees did not necessarily contain all cell divisions (Supplementary Fig. 11). Furthermore, early zebrafish development is highly variable27. We cannot, therefore, expect to find exact correspondence of early lineage trees for all cell types in different animals.
In another set of experiments, we applied LINNAEUS to dissected organs of adult fish (Supplementary Data sets 4 and 5). Analysis of >40,000 cells from the telencephalon, heart, liver, and the primary pancreatic islet by scRNA-seq allowed us to identify many different cell types in these organs (Fig. 3a and Supplementary Figs. 16 and 17). We first analyzed the resulting lineage trees at low granularity, which revealed a strong separation of the individual organs (Fig. 3b and Supplementary Fig. 18). However, we also detected several cell types, mostly from the immune system, that were present in multiple organs. We found that, as expected, the immune cells from different organs were grouped together in the lineage tree (Fig. 3c), which provided additional validation of our approach for scar filtering and lineage tree reconstruction. We next zoomed into cardiac and pancreatic cell types (Supplementary Fig. 19). In agreement with the literature, we detected an early separation of myocardial and endocardial lineages28. In the primary pancreatic islet, we observed scars that covered all three major endocrine cell types (alpha, beta, delta). However, we also found a smaller scar clone (scar 1204) in which delta cells were strongly underrepresented compared to the other scars, suggesting that the progenitors carrying this scar predominantly contributed to the alpha and beta cell lineages (Supplementary Fig. 19). Further studies would be necessary to corroborate potential biases of endocrine progenitors toward particular cell fates.
Related single-cell lineage tracing methods based on CRISPR–Cas9 technology have recently been used to study brain development as well as the clonal history of different organ systems in the zebrafish29,30. An important advantage of CRISPR–Cas9 lineage tracing compared to competing technologies, such as viral barcoding and other inducible sequence-based lineage tracing methods, is the ability to move beyond clonal analysis and to computationally reconstruct full lineage trees on the single-cell level. This is made possible by our computational approach for tree reconstruction that is robust to dropout events under realistic experimental conditions, and by our experimental strategy that uses independent scarring sites whose scars, once created, cannot be changed again. Within a single experiment, data analysis can be performed at different levels of granularity, from germ layers to organs and cell types. Our combined experimental and computational platform thus provides a strategy for dissecting the lineage origin of uncharacterized cell types and for measuring the capacity of lineage trees to adapt to genetic or environmental perturbations. Our approach is based on an existing transgenic animal with multiple integrations of a transgenic construct, which should facilitate adaptation of the method to other model systems.
The observation that Camin–Sokal maximum parsimony failed to reconstruct the correct tree for our system (Supplementary Figs. 10 and 11) serves as a cautionary note regarding computational analysis of CRISPR–Cas9 lineage data. However, additional studies would be necessary to systematically compare our algorithm to existing methods for tree reconstruction under different parameter regimes. Developing a general statistical framework for disentangling biological and technological variability of CRISPR–Cas9 lineage tracing remains another important open challenge for the future. We anticipate that future modifications of the experimental platform, such as, for instance, inducible systems, will enable longer periods of lineage tracing and molecular recording of cellular signaling events during cell fate decisions.
Zebrafish lines and animal husbandry.
We used the transgenic zebrafish line zebrabow M20 for LINNAEUS. This line has multiple integrations of a transgenic construct that expresses RFP from the ubi promoter, which is constitutively active in all cell types. Fish were maintained according to standard laboratory conditions. All animal procedures were conducted as approved by the local authorities (LAGeSo, Berlin, Germany) under license number G0211/16. We set up crosses between zebrabow M adults with high RFP fluorescence, and we injected the embryos at the one-cell stage with 2 nl Cas9 protein (NEB, final concentration 350 ng/μl) in combination with an sgRNA targeting RFP (final concentration 50 ng/μl, sequence: GGTGTCCACGTAGTAGTAGCGTTTTAGAGCTAGAAATAG CAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTT). Since injection efficiencies may vary (Supplementary Fig. 1), we selected embryos with low RFP fluorescence for single-cell analysis. For control experiments in Supplementary Figures 2 and 7 we set up crosses between pairs of adult Cas9-injected fish.
The sgRNA was transcribed in vitro from a template using the MEGAscript T7 Transcription Kit (Thermo Scientific). The sgRNA template was synthesized with T4 DNA polymerase (New England Biolabs) by partially annealing two single-stranded DNA oligonucleotides containing the T7 promoter and the RFP binding sequence, and the tracrRNA sequence, respectively. In the experiments described here, we did not use the ability of the line zebrabow M to switch from RFP to YFP or CFP expression upon addition of Cre20.
Preparation of single-cell suspensions.
Single larvae at 5 dpf were transferred into 50 μl HBSS containing 1× TrypLE (Thermo Fisher Scientific) and incubated at 33 °C for ∼20 min with intermittent mixing with a pipette (every 5 min) until the larva was no longer visible. 500 μl cold HBSS (Thermo Fisher Scientific) supplemented with 1% BSA was then added to the suspension, and the cells were pelleted in a table-top centrifuge at 4 °C and 300g for 5 min. The pellet was washed with 500 μl cold HBSS supplemented with 0.05% BSA and centrifuged down again. The resulting pellet was resuspended in the same buffer and filtered through a cell strainer of 35-μm diameter.
Adult zebrafish were euthanized by an overdose of tricaine in combination with low water temperature. Afterwards, heart, brain, pancreas islets, and liver were isolated from the fish. Single-cell suspensions of the organs were obtained using different protocols:
Heart. The zebrafish heart including atrium, ventricle and bulbus arteriosus was transferred into cold HBSS and opened carefully with forceps, allowing most of the erythrocytes to be washed away. Afterwards, the heart tissue was transferred into 500 μl HBSS containing Liberase enzyme mix (Sigma-Aldrich, 0.26 U/mL final concentration) and Pluronic F-68 (Thermo Fisher Scientific, 0.1%). The reaction was incubated at 37 °C for 30 min while shaking at 750 r.p.m. with intermittent pipette mixing. Afterwards, most of the tissue was dissociated. The reaction was stopped by adding 500 μl cold HBSS supplemented with 1% BSA. The cells were pelleted by centrifuging at 200g in a table-top centrifuge at 4 °C, then washed and filtered following the procedure described above for 5-dpf larvae.
Brain. The telencephalon without olfactory bulbs was isolated in cold HBSS and immediately transferred to a solution of HBSS with 0.81% D-glucose and 15 mM HEPES. Dissociation was initiated by adding 0.1× TrypLE and 0.1% Pluronic F-68 (final concentrations). The tissue was incubated for 30 min at 37 °C while shaking at 750 r.p.m., with occasional gentle mixing. The dissociation reaction was stopped by addition of equal volume of EBSS solution containing 4% BSA and 20 mM HEPES. The sample was filtered using a 70-μm filter and centrifuged at 300g for 5 min, after which the pellet was washed once with PBS and resuspended in HBSS with 0.04% BSA. Finally, the suspension was filtered with a 35-μm filter.
Pancreas and liver. The pancreatic tissue containing preferentially the primary pancreatic islet was isolated under a stereomicroscope and transferred into 500 μl HBSS containing 1× TrypLE and 0.1% Pluronic F-68. The liver was isolated and dissected into small pieces, one of which was transferred into 500 μl HBSS containing 1× TrypLE and 0.1% Pluronic F-68. After 30 min of incubation at 37 °C with intermittent pipetting, the suspensions were pelleted, washed and filtered following the procedure described above for 5-dpf larvae.
All final single-cell suspensions were quantified and controlled for quality by microscopy using a hemocytometer.
Scar detection in bulk samples.
DNA-based scar detection. DNA of single animals was extracted by heating the samples in 50 μl of 50 mM NaOH at 95 °C for 20 min. 1/10 volume of 1 M Tris-HCl, pH = 8.4 was then used to neutralize the mixture. We took 20 μl of the DNA for amplification of scar sequences using RFP-specific barcoded primers. The RFP primers were chosen such that the cut site of Cas9 was positioned approximately in the middle of the sequencing read. We then pooled the PCR products, performed a clean-up reaction using magnetic beads (AMPure Beads, Beckman Coulter), and added Illumina sequencing adapters in a second PCR reaction. Primer sequences are provided in Supplementary Table 1.
RNA-based scar detection.
RNA of single or pooled animals was extracted with TRIzol Reagent (Thermo Fisher Scientific) according to the manufacturer's protocol. The RNA was precipitated using isopropanol, and the pellet was washed twice with 75% ethanol, air-dried, and resuspended in 10 μl of reverse transcription mix (0.3 μM poly-T primer, 1× first strand buffer (Thermo Fisher Scientific), 10 μM DTT, 1 mM dNTPs, 0.5 μl RNAseOUT (Thermo Fisher, Cat. No. 10777019), 0.5 μl SuperScript II (Thermo Fisher, Cat. No. 18064-014). The reaction was incubated at 42 °C for 2 h for reverse transcription, followed by scar-specific PCR amplification as described above for DNA-based scar detection.
Transcriptome and scar detection in single cells.
Single cells were captured using Chromium (10X Genomics, PN-120233), a droplet-based scRNA-seq device according to the manufacturer's recommendations. Briefly, the instrument encapsulates single cells with barcoded beads, followed by cell lysis and reverse transcription in droplets. Reverse transcription was performed with polyT primers containing cell-specific barcodes, Unique Molecular Identifiers31 (UMI), and adaptor sequences. After pooling and a first round of amplification, the library was split in half. The first half was fragmented and processed into a conventional scRNA-seq library using the manufacturer's protocols. We used the second, unfragmented, half to amplify scar reads by two rounds of PCR, using two nested forward primers that are specific to RFP, and reverse primers binding to the adaptor site. The RFP primers were chosen such that the cut site of Cas9 was positioned approximately in the middle of the sequencing read, ensuring that a broad range of deletion lengths can be reliably detected. Primer sequences are provided in Supplementary Table 1. We confirmed successful library preparation by Bioanalyzer (DNA HS kit, Agilent). Samples were sequenced on Illumina NextSeq 500 2x 75 bp and Illumina HiSeq 2500 2x 100 bp.
Mapping and extraction of single-cell mRNA transcript counts.
A zebrafish transcriptome was created with Cell Ranger 2.0.2 from GRCz10, release 90. Alignment and transcript counting of libraries was done using Cell Ranger. Cell numbers to be extracted were set at a minimum of 6,000 but were increased if there were substantially more cells with more than 500 unique transcripts. Exact numbers can be found in Supplementary Data set 1.
Mapping and filtering of single-cell scar data.
Scar reads have the same structure as transcript reads: they consist of a barcode, a UMI and a scar. The scar sequences were aligned using bwa mem32 to a reference of RFP. Valid cell barcodes were identified based on the single-cell transcriptome data (see previous paragraph). We removed reads that were unmapped, had an incorrect barcode, or did not start with the exact PCR primer we used. We truncated all scar sequences to 75 nucleotides and filtered out shorter sequences.
To mitigate the effect of sequencing errors, we implemented several rounds of scar filtering (Supplementary Fig. 2). We started by counting the number of times each molecule was sequenced. Sequencing errors will typically have fewer reads than the actual scars they originate from. As a first filtering step, we therefore removed all molecules only seen once to reduce the complexity in the data set for consecutive filtering steps.
In the second filtering step, we aimed to remove easily recognizable sequencing errors and chimeric reads33. To this end, we consecutively considered scar sequences that have the same cellular barcode and UMI, UMIs that have the same cellular barcode and scar sequence, and cellular barcodes that have the same UMI and scar sequence. In each step, we kept only the molecule with the highest number of reads. The rationale behind this is that it is very improbable to have two valid scar sequences in the same cell with the same UMI, or to have a scar sequence with the same UMI appear in two different cells. The observation of two different UMIs for the same scar in the same cell is much more likely and corresponds to detection of multiple transcripts from the same locus, but information about scar expression levels was not required in our downstream analysis.
In the third filtering step, we specifically targeted sequencing errors within each cell. We compared the scar sequences found within a cell to each other. We filtered out sequences that had a Hamming distance of 2 or less to another scar sequence in the same cell that occurred in at least eight times as many reads. Scar sequences in the same cell that were one Hamming distance apart but had a read ratio less than 8 were tested on three criteria if both of them occurred at least twice in the scar library:
Do both scars have more than one transcript?
Do both scars occur in cells independently from each other?
Do the UMIs of both scars have Hamming distance of two or more?
If two of these criteria were true, the scars were kept and the sequences were placed on a list of validated scars that, if they occurred in the same cell in another library, did not have to be tested anymore. If one or zero criteria were true, the scar that had only one transcript, or the scar that did not occur independently, was filtered out.
In the fourth filtering step, we determined the distribution of reads for the scars we had kept so far. Based on this distribution we set a cut-off and filtered out the scars that did not have at least this number of reads. Finally, for each cell type we determined the distribution of different scars seen per cell and set a maximum number of scars a cell of that type can have. We filtered out cells in which we observed more than this maximum number as possible doublets.
While each scar is identified by its sequence, scars are labeled in the manuscript using their ranking in the bulk scar frequency distribution (e.g., “scar 77”) or their CIGAR code (e.g., “47M6D28M”) as a shorthand notation. Since scars cannot be modified once created, each scar is considered as a separate entity for lineage tracing independent of its sequence.
Determination of scar probabilities.
We aligned reads from 32 single embryos (DNA-based bulk scar detection) to a reference of RFP. We filtered out unmapped reads and reads that did not start with the exact PCR primer, and truncated all reads to 100 nucleotides, removing shorter ones. To determine the creation probabilities of the different scars, we removed all unscarred RFP reads from each embryo. We normalized the scar content of each embryo to one and calculated scar probabilities as the average ratio with which each scar was observed.
To account for the slightly different sequencing read structure of single-cell and bulk scar detection (see above), we considered only the nucleotides that are shared between the two approaches, and we assigned the bulk scar probabilities to single-cell scars accordingly. Single-cell scars that were not detected in bulk had their probability set to the lowest probability value detected in bulk.
Determination of scarring dynamics.
Embryos were injected with Cas9 and sgRNA at the one-cell stage. After 1, 2, 3, 4, 6, 8, 10, and 24 h, several embryos were collected and pooled (5–6 for earlier stages, 2–3 for later stages), followed by RNA and/or DNA extraction using TRIzol Reagent. Bulk scar libraries were produced as described above. For each sample, we calculated the percentage of unscarred RFP. We fit a negative exponential to these data, assuming that the fraction of unscarred RFP at t = 0 was 1.
Identifying cell types.
We used the R package 'Seurat'22 version 2.1.0, for cell-type identification as described below. We removed genes that were not found in at least three cells, and removed cells that had less than 200 of those genes. We log-normalized the transcript counts and removed cells with more than 2,500 genes observed. For single cells from 5-dpf larvae and adult pancreas, we filtered out cells with a mitochondrial content of more than 7.5%, and for single cells from adult hearts and telencephalons we filtered out cells with a mitochondrial content of more than 15%; we expect the cardiomyocytes in particular to have high mitochondrial content. We regressed out influences of the number of transcripts, mitochondrial transcripts, and libraries, and kept a total of 2,779 highly variable genes for cells of 5-dpf larvae, 3,775 highly variable genes for cells of adult telencephalon, 4,536 for cells of adult heart, and 3,018 for cells of adult pancreas. We performed a principal component analysis and kept the first 60 components for single cells from 5-dpf larvae, 11 for adult brains, 8 for adult hearts, and 50 for adult pancreases. Clustering, using the smart local moving algorithm34 on a K-nearest neighbor graph of cells, was done on these components with resolution 1.8 for 5-dpf larvae, resolution 0.8 for adult brain cells, resolution 1.0 for cells from adult heart and adult pancreas. Dimensionality reduction, using t-Stochastic Neighbor Embedding35,36 (tSNE), was done on the 60 components for the 5-dpf larvae, and on components 3–22 for the adult organs to reduce the visual impact of batch effects. To calculate differential gene expressions, we used the likelihood-ratio test as implemented in Seurat, introduced in McDavid et al.37, with an underlying negative binomial distribution for gene expression. This test aims to detect changes in mean gene expression and expression frequencies over different clusters. Using these differentially expressed genes, we assigned clusters to cell types based on literature and the ZFIN database38 (Supplementary Data sets 2 and 4). We did not aim to identify all cell types with maximal resolution and focused instead on unequivocal identification of those cell types that are highlighted in the text (such as the larval hematopoietic cells, and adult pancreatic cells). Cell-type assignments of all other clusters should therefore be considered tentative. Clusters were subsequently merged if they were found to have the same cell type, and we applied a mild coarse-graining by merging highly related cell types (e.g., different neuronal subtypes in the adult telencephalon were merged).
Connection enrichment analysis.
We used an analysis of the scars shared between cells to illuminate the overall structure of the sequencing results from 5-dpf larvae. We expect that cells in which we observe the same scar have a shared lineage. To understand the scarring process better, we aimed to find out which cell types share many scars—these cell types would have a strong lineage relationship, and which cell types do not share many scars—these cell types would not have many immediate shared precursors.
We call cells 'connected' if they share at least one scar that has a creation probability of less than 0.1% and is only present in one organism. To find out whether cell types have a higher number of connections between them than expected by chance, we developed the background model described below (see also Supplementary Fig. 8). The background model starts with the realization that a connection is defined by its endpoints, and that therefore the number of expected connections between two cell types is determined by the number of connection endpoints of the two cell types. More precisely, the chance of forming a connection between cell type A and B is given by p(A–B) = 2 * CE(A)* CE(B)/CE(tot)2, and that of forming a connection within cell type A by p(A–A) = CE(A)2/CE(tot)2, with CE(A) the number of connection endpoints of cell type A, and CE(tot) the total number of connection endpoints. These probabilities define a binomial background model. Using this model, we calculate the enrichment z-score between cell types, that is, how many s.d. the observed number of connections between two cell types is away from the expected number of connections. A positive enrichment score indicates more connections than expected by chance, a negative enrichment score indicates fewer connections than expected by chance.
We define the distance between cell types based on their enrichment z-scores by the following equation: D(A, B) = 1 – (E(A, B) – Emin)/(Emax– Emin), with D(A, B) the distance between cell types A and B, E(A, B) the enrichment z-score between them, Emin the minimal enrichment z-score and Emax the maximum enrichment z-score. The term E – Emin can be understood as a translation of all enrichment scores to positive values. These values are then divided by the maximum value and subtracted from 1 to create distances scaled between 0 and 1. We performed hierarchical clustering on these distances, using average linkage as implemented by the hclust function in R. We performed this analysis for two larvae, cutting the dendrogram into three and four clusters, respectively (Supplementary Fig. 9).
Our computational method for lineage tree reconstruction consists of two phases. First, we derive the order of scarring events. To do so, we make use of scar network graphs, a representation of all pairwise combinations of scars that are experimentally observed together in single cells (Fig. 2a). If all scar connections are detected, the scar that is created first has the highest degree of connections in the scar network graph, followed by scars that were created next, enabling lineage tree reconstruction in an iterative manner (Fig. 2b). In the second phase, we place all cells in the lineage tree according to their scar profile (Fig. 2c). Cells are placed as low in the tree as their scars allow. Due to incomplete scar detection, we do not have full lineage information about every single cell. However, the structure of the scar network graph is robust toward scar dropouts, since it is based on the collective information of thousands of single cells (see simulations in Supplementary Fig. 10). To ensure that lineage tree reconstruction is not affected by known experimental biases, we also included the following measures:
Double scarring. Some scars have a higher intrinsic probability than others (Fig. 1e). To minimize the chance of considering scars that may have been created twice or more in the same fish, we excluded all scars that have a probability higher than 0.1%. With this threshold, most scars were unique to a single fish among the replicates studied (Supplementary Fig. 6). Any remaining scars that were not unique to a single replicate were also excluded from the subsequent analysis.
Cell doublets. Co-encapsulation of two cells in one droplet is a known limitation of scRNA-seq techniques that are based on droplet microfluidics. Cell doublets can lead to spurious connections between scars in the network graph. Incomplete tissue dissociation, limited barcode diversity, barcode sequencing errors, and free-floating RNA from cells burst in the microfluidic system may potentially have similar consequences. As a protection against this effect, we only accept connections in the scar network graph that are more highly detected than expected by chance given a library-specific doublet rate (typically around 10%, depending on the experimental cell loading rate). See Supplementary Note 1 for details.
Missing connections: In case of very low cell numbers or scar detection efficiencies, it is possible that a connection is missed in the scar network graph. To address this issue, we performed a statistical test for each scar to check whether the number of observed connections is compatible with the scar being on top of the current sub-branch, given the numbers of cells and the observed scar dropout rates (Supplementary Fig. 20, Supplementary Note 1). In each iteration, we tested only those scars whose inferred detection rate (if placed on top of the corresponding sub-branch) was higher than 0.1, a threshold derived experimentally in Supplementary Fig. 7.
Pruning the tree. Especially for later, smaller branches, it is possible that not enough connections are observed to accurately place them in the lineage tree, resulting in positioning of the branch too high up in the tree (Supplementary Fig. 20 and Supplementary Note 1). We prune the lineage tree for such branches by removing branches that have less than 25% of the cells their siblings have.
We simulated the scarring process during embryo development (Supplementary Figs. 10 and 11). To do this, we used a simple model that starts with one cell, and in which all cells present undergo synchronized mitosis. Every cell cycle, the RFP integrations of the cells can acquire a unique scar. The chance of creating a scar is fixed for every integration for every cell division, and all scars are transmitted to the progeny of the cells.
After simulating the scarring events during development, we also simulate a sequencing experiment that produces data for tree building. To this end, the cells at the bottom of the tree are clonally expanded, generating many copies that all have the same scar profile. The experimental data consist of a sample of these cells, with a scar detection rate determining the chance of seeing a scar that is present in a cell.
We simulated two distinct trees. The first is a simple tree of three generations, where all cell divisions are marked by acquisition of new scars (Supplementary Fig. 10). From this tree we sampled 125 cells with a scar detection rate of 0.3, yielding 99 cells in which at least one scar was detected. This data set was then used to compare LINNAEUS tree building with maximum parsimony tree building.
The second simulated tree was a more realistic tree in which six generations of cells can potentially receive a scar on ten target sites (Supplementary Fig. 11). Here, we used a cell division rate of four per hour, as measured by microscopy39,40. A scarring rate of 0.4 per hour reproduced the scarring dynamics of the exponential fit during the first three hours (Supplementary Fig. 11a). We can use this simulation to estimate the number of new scars per cell division (Supplementary Fig. 11b). In this simulation, we assumed three cell types (fraction 15%, 25%, 60%) with different detection rates (70%, 30%, 10%, respectively). We furthermore assumed that two of the ten target sites are much harder to detect (by a factor 20, that is, detection rates 3.5%, 1.5%, 0.5%). The resulting developmental tree is shown in Supplementary Figure 11c. Due to the stochasticity of scar creation, scars are not created in all precursor cells, and in Supplementary Figure 11d we show the maximal lineage tree that can be measured by scars. We expand all final branches (not shown) and sample 2,000 cells from the resulting pool with a cell doublet rate of 5%, yielding 1,716 cells (including doublet cells) with at least one scar.
Tree building on simulated data.
To validate our tree building method, we built trees from both simulated trees using the cells sampled as described in the section “Simulations”. We compared our results to maximum parsimony tree building as done by the program “mix” in PHYLIP 3.695 (distributed by the author, J. Felsenstein, Univ. Washington, Seattle), using the Camin-Sokal algorithm with missing states encoded as “0”. If multiple trees were tied for best tree, we took the first tree generated.
The simple developmental tree (Supplementary Fig. 10) was recreated flawlessly by the LINNAEUS tree building algorithm (Supplementary Fig. 10b). However, maximum parsimony was not able to resolve the tree correctly, creating unjustified complexity due to multiple creation events for the same scar (Supplementary Fig. 10c). The more realistic scar tree (Supplementary Fig. 11) was also recreated faithfully by LINNAEUS (Supplementary Fig. 11f). Maximum parsimony again created a large amount of unjustified complexity with a total of 265 scarring events for 46 scars, an average of over five times per scar (Supplementary Fig. 11g).
We assessed RNA scar expression rates by comparing scar abundance in DNA to scar abundance in RNA in three animals 24 hours post fertilization. Using 70,251 data points, every one of which represents the RNA and DNA abundances of a sequence in one fish, we found a Pearson correlation of 0.97 between RNA and DNA abundances (Supplementary Fig. 3).
We used Seurat to identify cell types in four data sets: 72,252 cells from 5-dpf larvae (n = 7 animals) and cells from three different organs in adult fish (n = 3 animals): heart (12,248 cells), telencephalon (7,045 cells), and pancreas/liver (20,777 cells). Distribution of cell numbers over identified clusters can be found in Supplementary Data sets 2 (larvae) and 4 (adults). We determined differential gene expression using Seurat's “negbinom” test that includes a Benjamini–Hochberg correction of P-values.
To determine whether cell types had a statistically significant amount of connections (Supplementary Figs. 8 and 9), we first determined the theoretical connection probability of two cell types following the reasoning laid out above. We then used a two-tailed binomial test to assess whether the actual observed number of connections between the two cell types is different from the expected number of connections. The P-values were corrected for multiple testing using the Benjamini-Hochberg correction. Values for all 2,485 tests can be found in Supplementary Data 6.
Sequencing data are deposited on Gene Expression Omnibus, accession number GSE106121. Interactive single-cell lineage trees are available at http://bimsbstatic.mdc-berlin.de/junker/linnaeus/index.html.
Custom code is provided at https://bitbucket.org/Bastiaanspanjaard/linnaeus. As part of the software package we provide a sample data set on which the code can be run. The function of all scripts is summarized in README.md.
Life Sciences Reporting Summary.
Further information on experimental design is available in the Life Sciences Reporting Summary.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank R. Opitz, M. Guedes Simoes, D. Panakova, T. Durovic, and J. Ninkovic for help with cell-type identification. We also acknowledge support by MDC/BIMSB core facilities (zebrafish, genomics, bioinformatics), and we thank J. Richter for help with zebrafish experiments. Work in J.P.J.'s laboratory was funded by a European Research Council Starting Grant (ERC-StG 715361 SPACEVAR), a Fondation Leducq Transatlantic Networks Grant (16CVD03), and a Helmholtz Incubator grant (Sparse2Big ZT-I-0007). B.H. was supported by a PhD fellowship from Studienstiftung des deutschen Volkes.
Technical information for all sequenced libraries
Cell type information for 5 dpf larvae
Scar detection statistics for 5 dpf larvae
Cell type information for adult organs
Scar detection statistics for adult organs
Statistics connection enrichment analysis
About this article
Nature Reviews Genetics (2018)