## Main

High-throughput single-cell RNA sequencing (scRNA-seq) revealed hundreds of molecularly distinct cell types across the entire mouse and human nervous system1,2,3,4,5,6. However, a molecular understanding of the developmental origins of cell diversity remains limited, and a systematic analysis of lineage relationships is hampered by the low throughput of classical fate mapping techniques7,8. Advanced molecular tools have been used to record cell lineages9,10,11,12,13,14 and combined with scRNA-seq to generate fate maps in cultivated cells15,16, zebrafish17,18,19,20 and mice14,16,21,22. However, these technologies are not readily employed to uniquely label many progenitor cells in the mouse brain in vivo, and most approaches require tissue dissociation, although an in situ whole-transcriptome readout is crucial for studies of the nervous system where function arises from both differential gene expression and circuit-specific anatomy23,24,25,26.

Here we describe high-throughput clonal tracking and expression profiling of cells from the mouse forebrain using single-cell and spatial transcriptomics. We found two populations of fate-restricted progenitor cells present as early as embryonic day (E) 9.5 in the murine hippocampus. We discovered that microglia are generated from a limited number of progenitor cells that undergo massive clonal expansion as well as widespread migration across the mouse telencephalon. We recapitulated multiple migration patterns of progeny from brain progenitor cells using spatial transcriptomics of barcoded mouse brain tissue. Our findings demonstrate the utility of high-throughput clonal tracing in the mouse brain to provide molecular insights into brain development at the single-cell and tissue level.

## Results

### Unique labeling of progenitor cells with expressed barcodes

Here we present TREX, which enables TRacking and gene EXpression profiling of clonally related cells in the mouse brain by scRNA-seq (Fig. 1a). TREX relies on a diverse lentivirus library containing random 30-bp barcodes or cloneIDs downstream of a nuclear-localized enhanced green fluorescent protein (EGFP) driven by a strong, ubiquitous EF1a promoter (Extended Data Fig. 1a,b). A typical lentivirus preparation contained about 1.57 ± 0.12 × 106 cloneIDs per microliter (mean ± s.d., n = 4) with a largely uniform representation (Gini index = 0.2) and high sequence diversity (Hamming distance = 22 ± 2.4, mean ± s.d., n = 10,000 random samples) (Extended Data Fig. 1c–f).

To label individual progenitor cells in vivo, we used in utero microinjection of lentivirus into the ventricular system of the mouse forebrain at E9.5 (Extended Data Fig. 2a). We injected about 0.6 µl of EGFP-cloneID virus corresponding to 0.94 × 106 unique cloneIDs and which resulted in labeling of 1.8 ± 0.25% of all cells (mean ± s.d., n = 3) or a total of 41,000 ± 3,500 cells (mean ± s.d., n = 3) per E11.5 mouse brain (Extended Data Fig. 2b,c). We estimated that the initial number of labeled progenitors was around 2,600 cells at E9.5 and that 99.6% of cells were uniquely labeled with a cloneID (Extended Data Fig. 2d).

Barcoded EGFP+ cells were mostly evenly distributed throughout the E11.5 neuroepithelium and included Sox2+ radial glia progenitors lining the ventricular zone as well as their Sox2 daughter cells (Extended Data Fig. 2e–g). Long-term EGFP-cloneID expression was maintained in all major cell types of the juvenile mouse brain, and labeled cells were found in various regions (Fig. 1b,c). In conclusion, we present a highly diverse lentivirus library suitable for heritable and brain-wide labeling of thousands of mouse brain progenitors with unique barcodes.

### Molecular identity of barcoded brain cells

To determine the molecular identity of labeled cells, we dissected brains from 2-week-old mice and isolated all EGFP+ barcoded cells separately from cortex (CX), striatum (STR) and hippocampus (HC) for scRNA-seq (Supplementary Fig. 1a–d). We collected transcriptome profiles of 65,160 cells from these three regions per brain from four barcoded and one non-injected control brain. Graph-based clustering revealed five main clusters corresponding to astroependymal cells, immune cells, neurons and oligodendrocyte and vascular cells (Supplementary Fig. 1e). Both control and barcoded samples showed a similar cell type composition except for vascular cells, which represented 16.6% of cells in the control dataset and less than 0.5% of barcoded cells (Supplementary Fig. 1f). Blood vessels only begin to sprout into the ventrolateral brain at E9.5 (ref. 27), which results in a low number of cells that can be labeled at the time point of injection. We, therefore, removed the cluster of vascular cells from all datasets and kept a final of 62,388 single-cell profiles with a mean of 5,444 transcripts and 2,255 genes detected per cell (Supplementary Fig. 1g–l).

We performed subclustering for each major cell type from all brain regions and assigned each cell subclass a unique mnemonic identifier based on an existing mouse brain atlas2 (Supplementary Fig. 2). We found 40 molecularly defined cell classes, including projection neurons (n = 17), GABAergic interneurons (n = 7), immature neuronal cells (n = 3), astroependymal cells (n = 5), oligodendrocyte lineage cells (n = 5) and immune cells (n = 3) (Fig. 1d). We collected the highest number of cells from CX (n = 28,188 cells), followed by HC (n = 18,231 cells) and STR (n = 15,969 cells) (Fig. 1e–g). We compared gene expression profiles and total cell type composition between barcoded and non-injected samples, which indicated that lentivirus-mediated barcoding does not perturb cell physiology (Supplementary Fig. 3). Together, these data show the utility of TREX for barcoding progenitor cells in the developing brain and profiling the identity of their progeny at a postnatal stage.

### Barcode expression metrics across cell types

To specifically study barcoded cells, we removed cells from non-injected control samples from the full dataset and focused on the 49,724 cells isolated from four barcoded mouse brains (Fig. 2a). We detected EGFP transcripts in a total of 21,743 cells (43.7% of all cells), with highest average expression in immune cells and lowest average expression in astroependymal cells (Extended Data Fig. 3a–c). The number of EGFP transcripts per cell class was correlated (Pearson’s r = 0.81), with elongation factor 1-alpha 1 (Eef1a1) levels indicating that transgene expression under the synthetic EF1a promoter recapitulates endogenous Eef1a1 expression patterns, albeit at lower levels (Extended Data Fig. 3d).

We extracted cloneIDs directly from single-cell transcriptome data as well as targeted amplicon libraries and found a total of 21,433 cloneIDs in 18,570 cells (37.3% of all cells) (Fig. 2b and Extended Data Fig. 4a–d). We captured cloneIDs for most cell types except for one very rare type of interneuron, TEINH18. The average number of cloneIDs per cell was similar when using scRNA-seq or bulk DNA sequencing of barcoded cells (Extended Data Fig. 4e,f), suggesting that cloneID capture is quantitative using single-cell transcriptomics.

Although most cloneID+ cells (89.6%) across all brains expressed only one cloneID (Fig. 2c), the proportion of such cells varied among brains and ranged from 78.6% to 95.8%, with the remaining fraction of cells expressing multiple cloneIDs (Extended Data Fig. 4g,h). Based on the transduction rate of 1.8% and an idealized transduction model28, we expected that 99.08% of cells contain one cloneID and 0.92% of cells contain two or more cloneIDs (Extended Data Fig. 4i). The observed deviation from the theoretical cloneID copy number distribution per brain is not due to undetected doublets in scRNA-seq (Extended Data Fig. 7j–l) but can be attributed to increased transduction rates of local progenitor cells due to position and/or differential expression of receptors required for lentivirus entry (Supplementary Figs. 4 and 5). In summary, all major brain cell types were represented among barcoded cells, and most cells express a single cloneID.

### Clonal relationships across forebrain regions

We identified clonally related cells based on the Jaccard similarity of cloneIDs for each pair of cloneID-containing cells15, and we defined clones as groups of two or more related cells. We reconstructed 2,360 clones containing 11,569 cells (23.3% of all cells; Fig. 2d) with an average size of 4.9 ± 0.3 cells per clone (mean ± s.e.m.). The number of clones per brain ranged from 201 to 1,106 (11.1% to 38.6% of all cells per brain; Extended Data Fig. 5). Interestingly, clones containing mesoderm-derived myeloid cells were about 7.4 times larger than those with neuroectoderm-derived cells and contained 29.5 ± 7.6 cells per clone (mean ± s.e.m., n = 84 clones) compared to 4 ± 0.1 cells per clone (mean ± s.e.m., n = 2,276 clones), respectively (Fig. 2e). This difference in clone size probably reflects the massive proliferation of brain macrophages required to colonize the entire central nervous system (CNS) after only a small number of precursors enter the brain before closure of the blood–brain barrier around E13 (ref. 29).

To estimate the potential error associated with clone reconstruction, we quantified how often cell types that arise from different progenitors shared the same cloneID. We found that clones containing cortical excitatory neurons (n = 371 clones) or inhibitory neurons (n = 18 clones), which are known to have separate developmental origins30, never shared the same cloneID (Extended Data Fig. 6a). Second, among 84 clones containing 2,481 mesoderm-derived microglia or perivascular macrophages, only three clones with a total of 453 cells shared a cloneID with five neuroectoderm-derived cells (Extended Data Fig. 6b–d), and we removed these cells from the respective clones. These data suggest a low error rate of about 0.2% (five of 2,481 cells) that could be related to clone size and cell type, because only large immune clones contained neuroectoderm-derived cells, or to non-unique cloneID labeling. We found that cloneID removal from cell types that express multiple cloneIDs results in ‘lumping’ errors (Supplementary Fig. 6). This is expected because the co-expression of two or more distinct cloneIDs per cell leads to a higher combinatorial diversity15, thus reducing the error associated with clone reconstruction. Finally, there was a high correlation (Pearson’s r = 0.99) between cloneID frequency and the number of cells with a cloneID in distinct clones, indicating that there is no preferential uptake of certain barcodes among progenitor cells (Supplementary Fig. 7).

Cell types most often represented in clones were oligodendrocyte subtypes (3,703 cells, 32%), followed by immune cells (2,476 cells, 21.4%), astroependymal cells (2,443 cells, 21.1%), immature neuronal cells (2,207 cells, 19.1%), projection neuron types (708 cells, 6.1%) and interneurons (32 cells, 0.3%) (Fig. 2f). Except for one type of interneuron, TEINH19, we captured clonal information for all cell types that also contained a cloneID. Cell types containing the highest proportion of cells in clones were cortical and striatal microglia (MGL3), of which 77.5% and 60.6%, respectively, of all sampled MGL3 cells per region were represented in clones (Fig. 2g). The lowest proportions of cells represented in clones were observed for ependymal (EPEN) cells (1.3–4.9% of EPEN cells across all regions), TEGLU16 piriform pyramidal neurons (2.5%) and TEINH21 inhibitory neurons (7.8%) in the CX. In line with our previous observation, we found a high correlation (Pearson’s r = 0.57) between the number of cells in clones and barcode expression level for each cell type (Supplementary Fig. 8). In conclusion, we captured clonal information about most cell types in different regions of the mouse telencephalon and demonstrated that reconstruction of clonal relationships using TREX has a very low error rate.

### Regional distributions of clonally related cells

Because we isolated barcoded cells from CX, STR and HC, we asked how often clonally related cells spread across these areas. By calculating the proportions of cells across each forebrain region for each cloneID, we observed that the cells of 1,880 clones (79.7%) accumulated in a single region (Fig. 3a,b). Clonal dispersion of progenitors across more than one region was less frequent and was observed for 282 clones (11.9%) spreading across CX and STR as well as CX and HC (182 clones, 7.7%) but rarely between STR and HC (nine clones, 0.4%) or all three regions (seven clones, 0.3%). This indicates that most clonally related cells show limited regional dispersion across the mouse telencephalon.

To assess which cell types were associated with dispersed clones, we determined the cell type composition of clones spread across multiple forebrain regions relative to the total number of cells in clones for each cell type (Fig. 3c,d and Supplementary Fig. 9). We found that clonally related cells that crossed the CX/STR boundary often contained inhibitory neurons such as medial ganglionic eminence (MGE)-derived neurogliaform cells (TEINH16) in the CX (Fig. 3e). Inhibitory neurons shared a cloneID with medium spiny neurons (MSN1) and gray matter astrocytes (ACTE2) in the STR as well as neuronal intermediate progenitor cells (SZNBL) and oligodendrocyte subtypes in both STR and HC. These data suggest that MGE-derived cortical interneurons are generated by distinct progenitor cells and revealed that individual progenitors can give rise to both neurons and oligodendrocytes. Also, subventricular zone neural stem cells (RGSZ) in the STR often shared a cloneID with cells such as gray matter astrocytes (ACTE2), layer 2/3 excitatory neurons (TEGLU7) and all oligodendrocyte subtypes in the CX (Fig. 3f). This example of transcriptional divergence demonstrates a direct clonal relationship between E9.5 progenitor cells that generate RGSZ neural stem cells and those that produce neurons and glia cells for the other regions of the telencephalon during embryonic development.

Many cell types that were specifically found in the HC shared a cloneID with multiple other cell types in HC but rarely with other types in CX, indicating an early segregation of progenitor fields for both regions (Extended Data Fig. 7a–d). However, clones with Cajal–Retzius (CR) cells were an exception that rarely contained other cell types and often shared a cloneID with CR cells in CX (Fig. 3g). We quantified the proportions of CR cells across both regions for each cloneID and observed that 24.6% of cloneIDs accumulated in CX, 49.3% in HC and 26.1% spread across both CX and HC (Extended Data Fig. 7e,f). CR cells are among the first-born neurons critical for brain development, our data confirm that these cells originate from three distinct sites in the brain31 and further indicate that the progenitors from disparate embryonic fields converge in their differentiation to produce transcriptionally similar cells.

The anatomical boundary between HC and STR was rarely crossed (Fig. 3h), and cell types associated with such clones were mostly oligodendrocyte types such as oligodendrocyte precursor cells (OPCs) and committed oligodendrocyte precursors (COP1). This suggests that oligodendrocytes in both HC and STR are derived from a common progenitor most likely located in the ventral forebrain, which generates OPCs that subsequently migrate widely into all parts of the telencephalon before differentiating32.

Finally, clonally related immune cells comprising microglia (MGL1 and MGL3) and perivascular macrophages (PVM1) showed a widespread regional dispersion and crossed anatomical boundaries among CX, STR and HC 1.3-fold to nine-fold more often than neuroectoderm-derived clones (Fig. 3i,j). This suggests that myeloid progenitors and their progeny undergo extensive migration to populate large areas of the forebrain.

### Fate distributions of clonally related cells

We investigated the distribution of cloneIDs across cell types by calculating the proportions of cells within each major cell class for each cloneID. We found that immune cells (n = 84 clones) consisting of microglia and perivascular macrophages constitute a separate lineage as expected (Fig. 4a). Of the remaining 2,276 neuroectoderm-derived clones, a total of 1,193 clones (52.4%) contained at least two different cell types (Fig. 4b). The remaining 1,083 neuroectoderm-derived clones contained only one of the five major cell types, and such clones were also observed among the largest clones (Supplementary Fig. 10). Although this might suggest that many lineage-restricted progenitor cells exist in the E9.5 mouse neuroepithelium, we cannot conclude that a strictly ‘uni-potential’ progenitor was present during barcoding, because only a small sample of its progeny had been isolated.

To systematically assess lineage relationships among subclasses of all cell types, we investigated the probability of recovering shared cloneIDs from all pairs of profiled cells in the mouse brain. We calculated the clonal coupling score, defined as the number of shared cloneIDs relative to randomized data20, yielding values that range from positive (related cells) to negative (unrelated cells) for each brain (Supplementary Fig. 11). To summarize the data for all brains, we focused on the 27 cell types found in clones with at least three cells per clone across all four brains and determined the pairwise correlation between coupling scores. Hierarchical clustering of the pairwise correlations revealed four distinct groups of clonally related cells corresponding to diverse cell types of the cortex, hippocampus and striatum as well as microglia from all three regions (Fig. 4c). These results corroborated our previous observations regarding the limited clonal dispersion of most neuroectoderm-derived cell types across the mouse telencephalon.

We observed a strong clonal coupling in the HC between neuronal and astroependymal cells (fate 1) as well as between astroependymal cells and oligodendrocytes (fate 2), indicating that these cells originate from two fate-biased pools of progenitor cells. We found that 265 clones containing 1,683 cells were biased toward fate 1 (Fig. 4d) and consisted mainly of neuronal cell types such as dentate gyrus neuroblasts (DGNBL1, 51.7%), neuronal intermediate progenitor cells (SZNBL, 11.4%) and granule neurons (DGGRC1, 1.6%) as well as astroependymal cells, including gray matter astrocytes (ACTE2, 21%) and radial glia-like cells (RGDG, 10.4%). A total of 192 clones with 682 cells were biased toward fate 2 (Fig. 4e) and contained mainly oligodendrocyte subtypes such as OPCs (37.5%), committed oligodendrocyte precursors (COP1, 34.2%) as well as astroependymal cells, including white matter astrocytes (ACTE1, 12.8%) and gray matter astrocytes (ACTE2, 5%). One population of progenitor cells, fate 1, likely corresponds to the embryonic precursors of adult neural stem cells33 that are biased to generate astroependymal cells and dentate granule neurons as early as E9.5. The second precursor cell population, fate 2, mainly contains oligodendrocyte subtypes and could represent a major source of hippocampal glia cells involved in myelin formation and maintenance.

We also investigated the cloneID distribution across cell types that were not included in the clonal coupling analysis, because they were not isolated from all four brains and/or they were not contained in clones with at least three cells per clone. Interestingly, we never observed hippocampal CA1 (TEGLU24) and CA3 (TEGLU23) excitatory neurons in the same clone that otherwise contained identical cell types (Fig. 4f,g). Because the number of clones containing at least one CA1 or CA3 neuron was small (42 clones with 77 CA1 cells and 11 clones with 14 CA3 cells), we cannot exclude that these cells share a common progenitor. However, our observations are in agreement with previous studies about the early specification of CA field identity34 and might indicate a fate specification (or at least fate bias) as early as E9.5.

We investigated the clonal relationships between microglia in the brain parenchyma (MGL1 and MGL3) and perivascular macrophages (PVM1) located at CNS borders. We found that ten of 14 clones (n = 587 cells) that contained PVM1 cells also contained one or both microglia subtypes (Fig. 4h,i). Compared to 331 MGL1 cells (56.4%) and 220 MGL3 cells (37.5%), these clones contained only 36 PVM1 cells (6.1%). Because barcode expression levels and proportion of cells in clones were similar for MGL1, MGL3 and PVM1 (Fig. 2g and Supplementary Fig. 8), this observation indicates that the common progenitor for all three cell types largely generates microglia and few perivascular macrophages. Although it has been established that microglia are derived from mesodermal progenitors35,36, it has been shown only recently that the same early embryonic precursors also generate perivascular macrophages37,38. Our results are in line with this observation and further revealed that microglia are generated in much larger numbers than perivascular macrophages from a common progenitor cell.

### Spatial profiling of transcriptomes, cell types and clones

Next, we developed Space-TREX, a method based on Spatial Transcriptomics (ST)23 that enables simultaneous clonal tracing and expression profiling of barcoded mouse brain sections in situ (Fig. 5a). We introduced immunostaining of intracellular antigens into the protocol, enabling combined profiling of spatial gene and protein expression together with clonal barcodes in the same tissue section (Fig. 5b and Extended Data Fig. 8a–d). Because ST relies on the capture of transcripts in spots with a diameter of 55 µm, most spots contain between one and ten cells with an average of about four cells (Extended Data Fig. 8e). However, not every cell in the tissue is barcoded, and, out of all spots containing an EGFP+ cell, 81% of spots contain only one barcoded cell, and the rest contain more than one barcoded cell (Extended Data Fig. 8f). Therefore, it can be assumed that a cloneID captured in a spot originates most often from a single barcoded cell, and we can reveal its identity using protein expression data collected for the same section.

We hybridized eight adjacent coronal sections from one postnatal day (P) 14 brain barcoded at E9.5 and used antibodies targeted to EGFP, NeuN and Olig2 to identify barcoded cells, neurons and oligodendrocytes, respectively (Fig. 5c). To establish a dataset containing information on spatial gene expression patterns, cell types, clones and neuroanatomical definitions, we aligned brain sections to the Allen Mouse Brain reference atlas using an integrated computational framework39,40 (Extended Data Fig. 8g–i). The entire dataset contained information on the transcriptional profiles of 28,746 spots that were distributed across all forebrain regions. We extracted a total of 1,321 cloneIDs, of which 1,079 cloneIDs were contained in 162 clones distributed across all brain regions (Fig. 5d,e and Extended Data Fig. 8j–n). The number of cells per clone in the Space-TREX data (6.7 ± 0.4, mean ± s.e.m., n = 162 clones; Fig. 5f) was significantly larger than the clone size observed in the TREX data (4.9 ± 0.3, mean ± s.e.m., n = 2,360 clones), indicating that cell loss leading to incomplete clones is reduced when using a spatial barcode readout.

In line with the TREX data, most clones showed a limited spread across all regions except for clones with cells located in white matter fiber tracts (Fig. 5g) that are known to be enriched for oligodendrocytes derived from highly migratory progenitors32. Although most clonally related cells crossed boundaries of major anatomical regions at low frequencies, intra-regional dispersion was more common, for example within cortical regions such as the amygdalar (AMY) and olfactory (OLF) areas as well as upper (UL) and deeper (DL) cortical layers (Fig. 5h). Interestingly, we observed extensive dispersion between either AMY/OLF or DL/UL, suggesting that most early progenitor cells are restricted to generate cell types of either area but undergo more widespread migration within each area.

We used cell type information available for barcoded cells and found that clones containing both neurons and oligodendrocytes show an extensive spread across multiple regions (Fig. 6a,b and Supplementary Fig. 12a,b). This mode of dispersion likely corresponds to tangential migration well described for interneurons41 that also share a common early progenitor with oligodendrocytes, although these lineage relationships are not well understood42. We also observed neuronal clones that formed radially organized clusters mainly in the AMY/OLF areas as well as the UL/DL areas of the CX (Fig. 6c–e and Supplementary Fig. 12c–e). Although a few members of these clones were more widespread, more than 80% of all clonally related cells were found in larger clusters spanning areas of around 1.75 mm × 1.75 mm. Interestingly, cells from dispersed clones were distributed across the dorsoventral axis within a single 10-µm section, wherease cells from clustered clones were spread from the most anterior to the most posterior brain section, spanning 80 µm (Fig. 6f–i). Together, these data demonstrate that Space-TREX can be used for high-throughput mapping of clonal barcodes, gene expression and cell types in situ.

## Discussion

We developed TREX and Space-TREX for simultaneous clonal tracing and gene expression profiling of dissociated mouse brain cells and tissue sections, respectively. We found that the clonal dispersion across forebrain regions is limited, and only specific cell types are associated with dispersed clones. We discovered two fate-biased progenitor cell populations that exist as early as E9.5 in the hippocampal neuroepithelium, suggesting an unexpected early segregation of precursor cells. The clonal output of one progenitor population indicates that those cells are the origin for Hopx+ precursors that continue to become adult neural stem cells in the mouse dentate gyrus33.

We unraveled unique features of myeloid-derived clones, such as their large clone sizes and widespread dispersion across multiple forebrain regions compared to neuroectoderm-derived clones. The large clone size probably reflects the massive proliferation of brain macrophages required to colonize the entire CNS, because only a small number of precursors enter the brain before closure of the blood–brain barrier around E13, restricting access to immune cells that arise later in development29. Embryonic microglia migrate long distances within regions after entering the brain43, and we show that clonally related microglia also migrate extensively across anatomical boundaries to populate large areas of the brain. Microglia expansion and dispersion are central for brain homeostasis44,45,46 but remain only partially understood in particular at the clonal level. Thus, novel tools such as TREX enable systematic studies of the underlying molecular mechanisms within the context of microglia clonality.

Using Space-TREX, we provide, to our knolwedge, the first demonstration of high-density clonal tracking coupled to cell phenotyping and in situ sequencing of brain tissue. Compared to previous approaches that use complex in situ hybridization schemes and fluorescence microscopy for barcode detection11,14, Space-TREX relies on widely available reagents and DNA sequencing, thus enabling barcode readout in large tissue sections at scale47.

Currently, (Space-)TREX is limited by sparse sampling due to loss of barcoded cells after tissue dissociation (10.6% of cells recovered), isolation via fluorescence-activated cell sorting (FACS) (35–64% of sorted cells recovered) and droplet encapsulation (50% of loaded cells recovered) as well as cloneID dropout from a subset of sequenced cells (24–51% contain a cloneID), resulting in clonal information for about 0.51% of all barcoded cells initially present in the tissue (Extended Data Fig. 9). This would mean that the true clone size is 200-fold higher than the average clone size observed under our experimental conditions and that each neuroectoderm-derived clone contains about 800 cells on average, whereas each myeloid-derived clone is composed of 6,000 cells on average. Because a typical cortical clone labeled at E9.5 contains about 200 cells48, we consider our estimates an upper bound for true clone size.

The observed cell and barcode recovery rates are in line with other approaches employing an scRNA-seq readout of genetic barcodes in various model systems and highlight a general challenge for such methods (Supplementary Table 1). Such approaches rely on sequencing a given clone with a specific clonal structure and size multiple times to provide statistically robust insights about the fate bias of progenitor cells. For example, we sampled clones from fate-biased progenitors in the HC 11–265 times, which is sufficient to also detect rare cell types with nearly 100% probability at the observed clonal sampling rate of 0.51% or less (Extended Data Fig. 10). Sparse sampling could be decreased by using a plate-based assay with higher RNA detection sensitivity49 or by employing a single-cell, high-sensitivity readout of cell types and barcodes using spatial transcriptomics.

Compared to classical fate mapping studies that rely on sparse labeling of cells in dozens to hundreds of (transgenic) animals, (Space-)TREX enables high-throughput dense reconstruction of clonal relationships using 10–30 times fewer animals (Supplementary Fig. 13). In contrast to CRISPR-based lineage tracing, our technology uses millions of diverse and compact barcodes that can be cloned as libraries, enabling straightforward barcode readout and clone reconstruction. Overall, we think that an integrated approach, such as Space-TREX, is needed to disentangle the complex relationships among cell identity, cell history and tissue anatomy that underlie the organisation of both the healthy and diseased brain.

## Methods

### Plasmid and lentivirus production

LV-EF1a-H2B-EGFP was constructed by exchanging the PGK1 promoter from LV-GFP50 with an EF1a promoter (Supplementary Table 2). Reporter constructs (Extended Data Fig. 4j) were cloned by exchanging EGFP with TagBFP (Evrogen), TagRFP (Evrogen) or emiRFP670 (ref. 51). The lentivirus plasmid library was generated by inserting an amplified oligonucleotide library (Supplementary Tables 2 and 3) into LV-EF1a-H2B-EGFP using Gibson assembly52 and transformation of electrocompetent Endura cells (Lucigen). Lentivirus particles (>109 transducing units per milliliter) were generated by the core facility VirusTech at the Karolinska Institutet or by GEG-Tech.

### Sequencing of lentivirus preparations

Viral RNA was isolated using the NucleoSpin RNA Virus Mini Kit (Macherey-Nagel) and reverse transcribed using the SuperScript VILO cDNA Synthesis Kit (Invitrogen). The cDNA was used as template for cloneID amplification and indexing (Supplementary Tables 2 and 3). The resulting libraries were sequenced on an Illumina NextSeq (Supplementary Table 4), aligned against a reference containing the 30-bp cloneID and flanking regions using the BWA-MEM algorithm53. A custom BASH script was used to extract unique cloneIDs and corresponding read counts.

### Estimating the fraction of uniquely labeled cells

First, we calculated the total number of cells at the time point of injection. If Nt1 is the number of labeled cells at E11.5, Δt is the time difference in days and f is the frequency of cell divisions per day, then the number of transduced cells Nt0 is:

$$N_{t1} = N_{t0} \ast 2^{{{{\mathrm{{\Delta}}}}}tf}\,{{{\mathrm{hence}}}}\,N_{t0} = \frac{{N_{t1}}}{{2^{{{{\mathrm{{\Delta}}}}}tf}}}$$

We determined $$N_{t1} = 41,450$$ cells, Δt = 2 d and f = 2 divisions per day54, thus Nt1 2,591 ± 220 cells, or approximately 2,600, as noted in the main text.

Second, we estimated the fraction of uniquely labeled cells. For a number of uniformly distributed barcodes (N) and a small number of used barcodes (k) to label progenitor cells, the fraction F of uniquely labeled cells can be approximated as:

$$F = \left( {1-\frac{1}{N}} \right)^{k - 1}$$

However, the observed distribution of barcode abundance is not perfectly uniform in our library, which means that cells are more likely to be labeled with some barcodes than with others. The expected number47,55 of non-uniquely labeled cells, E(X), is then given by:

$$E\left( X \right) = k \ast \mathop {\sum}\limits_{i = 1}^N {p_i} (1 - (1 - p_i)^{k - 1})$$

where pi is the probability of picking the cloneID i = 1…N, k is the number of infected progenitor cells and N is the total number of injected cloneIDs. We typically injected N = 0.94 × 106 cloneIDs and estimated k = 2,591 cells, implying E(X) 11 non-uniquely labeled cells. This corresponds to 99.6% uniquely labeled cells, as stated in the main text.

### Mice

CD-1 mice (1× P11 female; 1× P11 male; 1× P12 male; 1× P12 female; 1× P14 male, 1× P14 female) obtained from Charles River Germany were used for all experiments. Animals were housed in standard housing conditions (ambient temperature of 20–22 °C and humidity of 40–60%), with a 12-h light/dark cycle with food and water ad libitum. All experimental procedures were approved by the Stockholms Norra Djurförsöksetiska Nämnd.

### Ultrasound-guided in utero microinjection

To target the developing mouse nervous system, a modified version of a published procedure50 was used. In brief, timed pregnancies were set up overnight, and plug-positive females were identified the next morning and counted as E0.5. Pregnant females at E9.5 of gestation were anaesthetized with isoflurane; uterine horns were exposed; each embryonic forebrain was injected with 0.6 µl of lentivirus; and 4–8 embryos were injected per litter. Surgical procedures were limited to 30 min to maximize survival rates.

### Immunostaining and imaging of embryonic and postnatal tissue

E11.5 mouse embryos were collected in ice-cold PBS, fixed in fresh 4% formaldehyde (FA) overnight at 4 °C, placed in 30% sucrose overnight at 4 °C, embedded in Tissue-Tek O.C.T. (Sakura) and cut into 20-µm-thick sections. Postnatal mice were sacrificed by isoflurane overdose, followed by transcardial perfusion with ice-cold PBS, followed by 4% FA. Brains were post-fixed in 4% FA overnight, and 50-µm sections were prepared using a VT1000S vibratome (Leica).

Sections were incubated with blocking/permeabilization buffer (5% donkey serum and 0.3% Triton X-100 in DPBS) and stained with antibodies against EGFP (chicken, 1:2,000, Aves Labs, AB_2307313), NeuN (rabbit, 1:500, Atlas Antibodies, AB_10602305), Sox9 (goat, 1:300, R&D Systems, AB_2194160), Sox10 (goat, 1:300, R&D Systems, AB_442208) or Iba1 (rabbit, 1:500, Wako, AB_839504) at 4 °C overnight. Sections were then washed with DPBS and incubated with fluorophore-conjugated secondary antibodies (donkey, 1:500, Jackson ImmunoResearch) against the respective species (anti-chicken Alexa Fluor 488, 703-545-155, AB_2340375; anti-rabbit Alexa Fluor 647, 711-605-152, AB_2492288; anti-goat Alexa Fluor 647, 705-605-147, AB_2340437) and DAPI (1 µg ml−1) in blocking buffer at room temperature for 1 h, followed by washing and mounting. Confocal images were captured with a laser scanning confocal microscope (LSM700, Carl Zeiss) using a Plan-Apochromat ×10/0.45 or ×20/0.8 objective. Image processing and analysis was performed using Fiji software56.

### Single-cell dissociations and flow cytometry

Mice were sacrificed with an overdose of isoflurane, followed by transcardial perfusion with ice-cold artificial cerebrospinal fluid (aCSF: 87 mM NaCl, 2.5 mM KCl, 1.25 mM NaH2PO4, 26 mM NaHCO3, 75 mM sucrose, 20 mM glucose, 2 mM CaCl2, 2 mM MgSO4). Mice were decapitated; the brain was collected in ice-cold aCSF; 1-mm coronal slices were collected using an acrylic brain matrix for mouse (World Precision Instruments); and the regions of interest were microdissected under a stereo microscope with a cooled platform. Tissue pieces were dissociated using the Papain dissociation system (Worthington Biochemical) with an enzymatic digestion step of 20–30 min, followed by manual trituration using fire-polished Pasteur pipettes. Dissociated tissue pieces were filtered through a sterile 30-µm aCSF-equilibrated Filcon strainer (BD Biosciences) into a 15-ml centrifuge tube containing 9 ml of aCSF and 0.5% BSA. The suspension was mixed well; cells were pelleted in a cooled centrifuge at 300g for 5 min; supernatant was carefully removed; and cells were resuspended in 1 ml of aCSF containing reconstituted ovomucoid protease inhibitor with BSA. A discontinuous density gradient was prepared by carefully overlaying 2 ml of undiluted albumin inhibitor solution with 1 ml of cell suspension, followed by centrifugation at 100g for 6 min at 4 °C. The supernatant was carefully removed; the cell pellet was resuspended in 1 ml of aCSF containing 0.5% BSA; and the cell suspension was transferred to a round-bottom tube (BD Biosciences) for flow cytometry. Single EGFP+ cells were sorted on a BD Influx equipped with a 140-µm nozzle and a cooling unit with a sample temperature of 4 °C and collected into a DNA LoBind tube (Eppendorf) containing aCSF with 0.5% BSA. All EGFP+ cells per sample were sorted and pelleted in a cooled centrifuge at 300g for 5 min. The supernatant was carefully removed; the cell pellet was resuspended in a minimal volume of aCSF; and the cell concentration was determined using a Bürker chamber. Importantly, aCSF equilibrated in 95% O2/5% CO2 was used in all steps, and cells were always kept on ice or at 4 °C except for enzymatic digestion.

### scRNA-seq

Two brains (brains 1–2) were processed using the 10x Genomics Chromium Single Cell Kit Version 2 (v2), and three brains (brains 3–5) were processed using the 10x Genomics Chromium Single Cell Kit Version 3 (v3) (Supplementary Table 4). Suspensions from barcoded brains were prepared as described above, counted and resuspended aCSF and added to 10x Chromium RT mix. Suspensions from control brains were prepared as described above, diluted in aCSF to concentrations between 800 and 1,000 cells per microliter and added to 10x Chromium RT mix. For downstream cDNA synthesis (12 polymerase chain reaction (PCR) cycles), library preparation and sequencing, we followed the manufacturer’s instructions.

### Data normalization and cell filtering for scRNA-seq

Overall, three regions from four barcoded brains and from one control brain were sequenced using 10x Chromium v2 or v3. Because the number of cells per region for the control brain was much higher than the corresponding number of cells for any barcoded brain (Supplementary Table 4), we downsampled the control datasets to about 9,000 cells (CX), 8,000 cells (HC) and 7,000 cells (STR). The gene expression matrices obtained after running Cell Ranger count were merged by region (CX, STR and HC) using merge() in Seurat version 3 (ref. 57). All genes expressed in ~0.1% of all cells were kept, and all cells expressing 500–10,000 genes were kept in the merged data. The data were log-normalized with a scale factor of 10,000 using the NormalizeData() function, followed by linear transformation (scaling) of data. Doublet removal was done using mutually exclusive markers for various cell types (Igf2, Pf4, Hexb, Rsph1, Pdgfra, Bmp4, Mog, Clic6, Rgs5, Cldn5, Reln, Igfbpl1, Slc32a1, Slc17a7 and Aldoc). A cell cycle score was assigned to each cell, and the difference between the G2M and S phase scores was regressed out. Highly variable features were selected using FindVariableFeatures(), followed by principal component analysis and the use of significant principal components (between 10 and 30) for graph-based clustering (shared nearest neighbor graph calculation and clustering using Louvain). After determining differentially expressed genes, we manually assigned major cell classes to each cluster (astroependymal, immune, neurons, oligodendrocytes and vascular) using canonical markers. We then split cells by major cell type, performed subclustering and extensively annotated each cluster based on canonical marker genes from published data and from www.mousebrain.org. At each step, we removed (1) clusters classified with ambiguous labels and (2) outlier cells on the fringes of clusters in uniform manifold approximation and projection (UMAP) space to further eliminate doublets. We merged all cells into a single file together with metadata and annotations. The filtered cellIDs were exported and used as input for cloneID extraction and clone calling.

### Biological pathway analysis between barcoded and control samples

To investigate the effect of lentivirus transduction on cellular physiology, we analyzed 195 genes expressed during virus infection (KEGG pathway: mmu05170). We downsampled the dataset such that each cell type per condition contained an equal number of cells. We plotted expression values of non-zero expressed genes related to virus infection for single cells as heat maps grouped by condition or major cell type. For each cell type, we analyzed differentially expressed genes between both conditions (logfc.threshold ≥ 1) on normalized and variance-stabilized downsampled datasets.

### CloneID enrichment from cDNA

A nested PCR strategy was employed for enrichment of cloneIDs from full-length cDNA (Supplementary Tables 2 and 3). Each amplicon library was sequenced on a MiSeq or NovaSeq 6000 (Supplementary Table 4). We used Cell Ranger version 3.0.1 count for data processing of amplicon libraries and the TREX pipeline (see below) to extract cloneIDs.

### Extraction of cloneIDs and clone calling for scRNA-seq

Raw 10x Genomics Chromium v2 or v3 sequencing data were pre-processed with Cell Ranger version 3.0.1. As reference for read mapping, Cell Ranger was configured to use a custom reference consisting of the GRCm38 (mm10) genome and an additional sequence representing the H2B-EGFP-N transgene, in which the cloneID region was marked with ‘N’ wildcard characters. The resulting BAM file of aligned sequencing reads was then processed with TREX, our custom Python tool for cloneID extraction and clone calling. TREX uses only reads from filtered cells (see above) that align to the H2B-EGFP-N transgene. CloneIDs are recovered from those alignments that cover the masked cloneID region. If soft clipping is encountered at one of the bases adjacent to the region, the alignment is assumed to continue ungapped into the region. All cloneIDs with identical unique molecular identifiers (UMIs) that come from the same cell (have the same cellID) are collapsed to a consensus sequence. To error-correct cloneIDs, they are single-linkage clustered using a Hamming distance of at most 5 as linking criterion. In each cluster, all of its cloneIDs are replaced with the cloneID occurring most frequently in that cluster. From the resulting final cellID–cloneID combinations, those that are supported by only one UMI and one read are discarded. Also removed are cloneIDs that are supported by only one UMI and have a high frequency in another cell. We assume that those cloneIDs are contaminations.

The cleaned data are transformed into a count matrix showing UMI counts for each cloneID in each cell. This matrix is used to sort cells into clones of cells with similar cloneID combinations. In brief, the Jaccard similarity between each pair of cloneID-expressing cells was calculated using the R package proxy58. A Jaccard score of 0.7 was used as a cutoff for related cells15, and clones were defined as groups of two or more related cells.

### Calculation of clonal coupling scores

For each brain, we calculated clonal coupling scores20 considering all clones containing at least three cells per clone. We randomized the clone–cell type associations, while preserving the number of cell types related to each clone and the number of clones related to each cell type, to create 1,000 randomized datasets59. We compared the observed clonal data to randomized datasets to obtain empirical P values and z-scores indicating, for each pair of cell types, how often we expect to see the observed clonal association. To summarize the clonal coupling scores for four brains, we kept only cell types found in clones in all brains. For each brain, the Pearson correlations of z-scores between each pair of cell types were calculated, and the correlation coefficients were transformed using Fisher z-transformation and averaged to represent clonal coupling scores for all brains.

### Calculation of detection probabilities at low sampling rates

We estimated the probability of sampling (without replacement) at least one cell from each cell type in each clone using a multivariate hypergeometric distribution implemented in the function dmvhyper from extraDistr60. Given the number of cells from different cell types present in a clone and the sampling rate, we can calculate

$$P_{detect\_all}\left( {cell\_type\_distr,s} \right) = \mathop {\sum }\limits_{x{\it{\epsilon }}X}^{sampled\_cells} PMF_{mv\_hyper}\left( {x,cell\_type\_distr,s} \right)$$

where

• cell_type_distr is the number of cells from each cell type in the clone.

• s is the number of sampled cells.

• X is all possible combinations of s sampled cells for the given clone, so that we have at least one cell from each cell type.

• PMFmv_hyper is the probability mass function for the multivariate hypergeometric distribution.

Next, assume that we have N similar clones with the same number of cells distributed over the same cell types as above. If we sample cells from all the N clones, the probability of sampling at least one cell from each cell type in at least one clone is given by the binomial distribution:

$$\mathop {\sum }\limits_{n = 1}^N PMF_{binom}\left( {n,N,P_{detect\_all}\left( {cell\_type\_distr,s} \right)} \right)$$

### Tissue processing and library preparation for ST

Mice were sacrificed. Then, brains were collected in ice-cold aCSF, transferred to ice-cold Tissue-Tek O.C.T. (Sakura) and snap-frozen at −40 °C in a bath of isopentane and dry ice. Eight consecutive 10-µm sections around AP −1.65 mm from bregma were collected for processing using the 10x Genomics Visium Spatial Gene Expression Kit.

The first four sections (V9–V12) were fixed in ice-cold methanol, followed by rapid imaging (<15 min for all sections) of EGFP and transmitted light signal using an epifluorescence microscope (Axio Imager.Z2, Carl Zeiss) equipped with a Plan-Neofluar ×10/0.3 M27 objective before further processing following the manufacturer’s instructions. The remaining four sections (V13–V16) were fixed in ice-cold methanol, briefly rinsed with DPBS, incubated with DPBS containing DAPI (1 µg ml−1), FluoTag-X4 anti-GFP conjugated to Atto488 (1:200, Nanotag Biotechnologies), NeuN-Alexa 568 (rabbit, 1:400, Abcam, ab207282), Olig2-Alexa 647 (rabbit, 1:200, Abcam, ab225100) and RNaseOUT (1 U µl−1) at room temperature for 10 min. The sections were washed two times for 1 min with DPBS containing RNaseOUT (1 U µl−1) and mounted in 85% glycerol containing RNaseOUT (1 U µl−1), and images were captured for all four fluorescent channels as well as the transmitted light channel. The coverslip was removed by immersing the slide in water. Then, the slide was dried for 5 min at 37 °C and further processed following the manufacturer’s instructions, starting with the tissue permeabilization step.

### Data and image analysis for ST

For each section, the registered microscope image was used for manual alignment and tissue detection using the Visium Manual Alignment Wizard (10x Genomics), followed by running Space Ranger version 1.0.0 (Supplementary Table 4). Each dataset was separately processed in Seurat version 3 (ref. 57), and only spots that expressed at least 300 genes were kept. We used SCTransform61 for data processing, merged datasets and exported spot IDs as input for cloneID extraction and clone calling.

Fluorescent images acquired for four sections (V13–V16) were processed in R using a custom segmentation workflow that entails (1) two-dimensional (2D) fast Fourier transform convolution filtering, (2) image correction, (3) thresholding, (4) removal of speckles or other abnormal shapes and (5) watershedding to identify and label cells. The segmentation workflow was applied to each of three channels: EGFP (barcoded cells), NeuN (neurons) and Olig2 (oligodendrocytes). To find co-localizing signals across two channels, A and B, an overlap score was estimated for all pairs of nuclei (i, j) as intersect(Ai, Bj)/min(Ai, Bj) where Ai and Bj are the sets of pixels defining nuclei i and j. An overlap score of at least 50% was used to determine if the signal originated from the same nuclei. For alignment of all four sections, we used a manual image registration method implemented in the ManualAlignImages function from the STUtility package62. The raw NeuN images were masked, and tissue edges were manually rotated or shifted to fit the image and spot coordinates of images V14–V16 to the reference image V13. All capture spot coordinates from V14–V16 were transformed to align with the coordinate system of V13 using the learned transformation functions. The same transformations were applied to the coordinates of the previously segmented nuclei in V14–V16. We calculated the pairwise 2D Euclidean distances between aligned spots and nuclei and selected the cell with shortest distance to the centroid position of each cloneID+ spot for assignment of cell type identity. For alignment of all four H&E-stained sections (V9–V12), we used an automated image registration method implemented in the AlignImages function from STUtility with image V9 as reference. All capture spot coordinates from V10–V12 were transformed to align with the coordinate system of V9, using the learned transformation function.

Registration of aligned images of brain tissue sections to the standardized Allen Mouse Brain Atlas was done using WholeBrain40. We used an extended and inverted version of the H&E target image V9 and the NeuN target image V13 with bregma coordinates AP −1.65 mm for registration of an entire brain section to the Allen Mouse Brain Atlas.

### Statistics and reproducibility

No statistical method was used to pre-determine sample size, but our sample sizes match typical numbers used in scRNA-seq and ST experiments2,6,23,63,64. For TREX, one male and one female EGFP+ animal were randomly selected from two different litters, and the control mouse was randomly selected from a third litter. For Space-TREX, one EGFP+ mouse was randomly selected from a pool of littermates. For TREX, we collected all EGFP+ cells per brain region to sample the maximum amount of clonally related cells and to have enough cells for each cell type, allowing further quantitative analysis. Data distribution was assumed to be normal, but this was not formally tested. No data were excluded from the analyses. The allocation to experimental groups (barcoded versus control) could not be randomized because it was necessary to specifically isolate EGFP+ cells present only in barcoded brains. Blinding was not applicable because control and barcoded samples were of similar age, differentiable based on EGFP fluorescence, and our results were based on analysis of clonal barcodes present only in EGFP+ cells.

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.