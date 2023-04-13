Dandelion enables holistic scVDJ-seq analysis

As Dandelion operates on the AIRR data format, it is highly interoperable with existing AIRR tools13,23. It can serve as a bridge between the various tools for AIRR analysis and the single-cell software ecosystem, for example, scverse14,24 (Fig. 1a). Dandelion has been certified by the AIRR Software Working Group to be AIRR standards compliant.

Fig. 1: Holistic scVDJ-seq analysis pipeline. a, Schematic illustration showing that Dandelion bridges methods from single-cell V(D)J workflows, such as AIRR standards and the single-cell gene expression analysis software, and combines with them additional new methods of its own to create a holistic pipeline for analysis. b, Schematic illustration of the Dandelion workflow. Paired single-cell gene expression (scRNA-seq) and AIR repertoire (scVDJ-seq) data are generated, followed by mapping of the sequencing reads. From the mapped results, Dandelion provides refined contig annotations with BCR mutation calling, improved γδTCR mapping and identification of multi-J mapping contigs. It also provides downstream analysis after integration with scRNA-seq results. Apart from allowing the users to explore clonotype networks and V(D)J usage, Dandelion also supports building a V(D)J feature space on pseudobulked cells, that can be used for differential V(D)J usage and pseudotime inference. Additional unique features provided by Dandelion are boxed in orange. Full size image

Dandelion can be used to analyze single-cell BCR, αβTCR and γδTCR data, allowing for mutation calling, improved γδTCR annotation, analysis of productive and nonproductive V(D)J contigs and identification of unspliced J gene alignments (Fig. 1b). Dandelion performs quality control checks, clonotype calling and network generation for downstream analyses. It is designed to work with AIRR-formatted input or 10X Genomics’ cellranger vdj output. A main feature of Dandelion is the creation of a ‘V(D)J feature space’ that can be used to perform and visualize differential TCR/BCR usage across cell pseudo-bulks or neighborhoods and infer pseudotime trajectory inference. Extended Data Fig. 1 shows a summary of features in Dandelion and other pipelines. Dandelion was previously applied to a large COVID-19 study4, which showcased its network-based repertoire diversity analysis method.

Dandelion provides a streamlined preprocessing pipeline

For optional re-annotation of contigs, Dandelion expects 10X Genomics’ cellranger vdj output files (for example, all_contig_annotations.csv and all_contig.fasta).

Similar to Change-O23, Dandelion re-annotates V(D)J contigs using igblastn25 with reference sequences contained in the international ImMunoGeneTics information system (IMGT) database26. blastn is also used to check the D and J genes separately (same settings as igblastn25. The additional blastn step allows us to (1) apply an e-value cutoff for D/J calls to only retain high confidence calls; (2) identify ‘multi-J mapping’ contigs (see below) and (3) recover contigs without V gene calls (removed by igblastn). We packaged this preprocessing workflow into a singularity container to streamline and improve the user experience and avoid the difficulties with setting up the pipeline.

Nonproductive contigs do not translate into functional proteins and are often filtered out by other scVDJ-seq analysis pipelines, for example, scirpy13, scRepertoire15 and Platypus18 (Extended Data Fig. 1). In the Immcantation23 workflow, nonproductive contigs are preserved and there are specific instructions for filtering or retention during annotation and clone definition steps. Moreover, igblastn is a V gene annotation tool25 and would filter contigs without V gene. We found that a significant proportion of αβTCR, γδTCR and BCR data were nonproductive in fetal human tissues3 and the majority were due to absent V genes, with the exception of the TRA locus where most were due to presence of premature stop codons (Fig. 2a). This pattern was consistent even after excluding thymic samples to account for developing T cells (Extended Data Fig. 2a). These nonproductive contigs without V genes were captured in scVDJ-seq because the rapid amplification of 5′ complementary DNA (cDNA) ends (5′ RACE) technology used in the protocol does not require primers against V genes for targeted enrichment, in contrast to the previous multiplex PCR approach (Extended Data Fig. 2b). They likely represent products of partial or failed recombination and we reasoned that they are still biologically meaningful, reflecting a cell’s history and origin. The Immcantation workflow would divert these contigs into a ‘failed’ file and this file is not typically exposed to the user. Therefore, Dandelion does not automatically filter out nonproductive contigs, and this data have utility, as later discussed.

Fig. 2: Dandelion offers improved contig annotations. a, Left—barplot of proportion of contigs that are productive or nonproductive in each locus. Right—barplot showing the causes of nonproductive contigs in each locus. b, Schematic illustration of the V(D)J rearrangement process and the potential cause of multi-J mapping. c, Boxplot of the proportion of contigs with multi-J mapping, in the presence (blue) or absence (orange) of V genes. Only samples with at least ten contigs are shown. Boxes capture the first to third quartiles and whisks span a further 1.5× interquartile range on each side of the box (two-sided Wilcoxon rank-sum test). The sample sizes of all boxplots from left to right are 51, 46, 52, 55, 26, 29, 20, 29, 33, 27, 33, 29, 34 and 25. d, Top—logistic regression formula to explore factors associated with multi-J mapping. Bottom—volcano plot summarizing logistic regression results (ref. 3); y axis: −log 10 (BH adjusted P value); x axis: log(odds ratio). Variables that were also significant in our control/cycloheximide-treated PBMC dataset are highlighted in red (associated with increased multi-J mapping) or blue (associated with decreased multi-J mapping). e, Sequence logos covering the last 11 and first 10 nucleotides at 3′ ends (position 1–11) and the neighboring intron (position 12–21), respectively, for genes associated with increased (top) or decreased (bottom) multi-J mapping. J genes associated with increased multi-J mapping were less likely to have T in position 17 (logistic regression; two-tailed P value). ‘GTAAGT’ is a known consensus motif for splicing in position 12–17 that is +1 to +6 in the intron. They were also more likely to have T in position 6 (logistic regression; two-tailed P value). f, Swarmplots of fraction difference of sc-γδTCR contigs (n = 33) annotated by Dandelion versus 10X cellranger vdj (v6.1.2). The red dashed line marks the threshold of 0, above which Dandelion recovers more γδTCR contigs than 10X. Left—all high confidence contigs. Right—high confidence productive contigs. Data for a, c, d (bottom) and f were taken from ref. 3 and each dot represents a sample. Full size image

We also discovered that multiple J genes can be sequentially mapped onto different regions in the same mRNA contig, a phenomenon we termed ‘multi-J mapping’. We found that the majority of the most frequent multi-J mapping contigs contained two to four neighboring J genes interspersed with introns (Supplementary Table 1). As RNA splicing, rather than DNA recombination, is the process of linking the chosen J to C genes, contigs with multi-J mapping are likely products of partially spliced transcripts (Fig. 2c). It is biologically plausible that the J gene nearest to the 5′ end is the intended exon that would be expressed in the mature mRNA.

We next investigated factors that might contribute to multi-J mapping. We first noted that nonproductive contigs without V genes were more likely to have multi-J mapping (Fig. 2c). This difference could be due to nonsense-mediated decay (NMD), an RNA degradation process that is triggered due to premature stop codons27. Multi-J mapping contigs containing a V gene will initiate translation from the V gene, triggering degradation by NMD due to premature stop codons in J gene introns, whereas those without a V gene cannot be translated and will therefore evade degradation by NMD. To test the contribution of NMD to multi-J mapping, we analyzed peripheral blood mononuclear cells (PBMCs) treated with cycloheximide against control cells. Treatment resulted in an increase in the proportion of multi-J mapping in TCR contigs with V genes (Extended Data Fig. 2c), supporting that NMD recognizes and degrades V-gene-containing multi-J mapping contigs.

Using a logistic regression model, we found that there was a significant interaction (Benjamini–Hochberg (BH) adjusted P = 7.07 × 10−4) between V gene presence and cycloheximide treatment on multi-J mapping (Fig. 2d; ref. 3; Supplementary Table 2, cycloheximide treatment, and Supplementary Table 3), supporting the above findings. However, the significant noninteracting V gene term (BH adjusted P = 5.73 × 10−182) suggests that NMD may only partially account for the effect of V genes on multi-J mapping. Furthermore, we found that the known consensus motif for splicing, ‘GTAAGT’ in +1 to +6 position of adjacent intron28, was disrupted in J genes associated with more multi-J mapping (Fig. 2e and Supplementary Table 4). In conclusion, specific cell types, J gene identity, V gene presence and NMD are factors that may contribute to multi-J mapping (Extended Data Fig. 2d).

Dandelion also performs γδTCR contig annotation. There are two existing methods for sc-γδTCR mapping as follows: (1) 10X Genomics’ cellranger vdj, (primarily tailored for αβTCR contigs); (2) TRUST4 (ref. 29), which performs de novo contig assembly and annotation. The cellranger software can reconstruct γδTCR contigs but most versions struggle with annotating them (Supplementary Note). While TRUST4 can yield sc-TCR annotations, including γδTCR, it relies on the presence of a V gene in the contig thus unable to handle nonproductive contigs without V genes. For comparisons, we processed 33 γδTCR libraries3; one mapping was done with cellranger 6.1.2 to the 10X GRCh38 5.0.0V(D)J reference, with the contigs identified by cellranger as high confidence subsequently re-annotated with Dandelion. Another mapping was done with cellranger 6.1.2 to the 5.0.0 reference modified to obtain annotated γδTCR contigs as per 10X Genomics’ workaround instructions. We see a consistently higher recovery rate of both high confidence γδTCR contigs and high confidence productive γδTCR contigs in the mapping postprocessed with Dandelion (high confidence contigs: P = 5.39 × 10−7, high confidence productive contigs: P = 3.14 × 10−6, Wilcoxon signed-rank test; rank correlations were 1 and 0.98, respectively; Fig. 2f). While 10X Genomics has introduced some γδTCR support with cellranger 7.0.0, the results were inferior to the prior workaround from version 6 (Extended Data Fig. 2e).

Creating a V(D)J feature space

To better leverage the combined gene expression and AIR repertoire data, we introduced an analysis strategy to create a pseudobulked V(D)J feature space, transforming V(D)J data from categorical to continuous format for downstream applications (Fig. 3a). Transcriptionally similar cells are first grouped into pseudo-bulks, which can be based on metadata features, or partially overlapping cell neighborhoods30. For instance, cells can be pseudobulked by cell type, donor and organ to perform differential analysis across cell types while controlling for donor and organ differences. For trajectory analysis, we recommend pseudo-bulking cells by partially overlapping cell neighborhoods sampled from gene expression space for example using Milo30 to model a more continuous cell state. For each pseudobulk, we compute the fraction of cells using each of the genes in a segment (for example, TRAJ1 to TRAJ61 in the TRAJ segment). The fractions from the various segments are concatenated, forming the V(D)J matrix/space. This can then be used with conventional dimension reduction techniques such as principal component analysis (PCA) or uniform manifold approximation and projection (UMAP).

Fig. 3: Creating a V(D)J feature space. a, Schematic illustration of the workflow of creating a V(D)J feature space. Step 1—cells are assigned to pseudobulks, which can be based on metadata features, or partially overlapping cell neighborhoods. Step 2—V(D)J usage frequency per pseudobulk is computed for each gene and used as input of the V(D)J feature space. Step 3—the V(D)J feature space can be visualized with conventional dimension reduction techniques such as PCA or UMAP, and it can then be used for differential V(D)J usage analysis and pseudotime inference. b, Top left—gene expression UMAP of all T cells from adult human tissues in ref. 5, colored by low-level cell type annotations. Each point represents a cell. Top middle—V(D)J usage frequency per celltype_donor pseudobulk is computed for each gene and used as input of the V(D)J feature space. Top right—UMAP of the pseudobulk V(D)J feature space of the same cells. Each point represents a cell pseudobulk. Bottom panel—top two differentially expressed TCR genes in CD4+ T cells, CD8+ T cells and MAIT cells. c, Left—UMAP of neighborhood V(D)J feature space covering DP to mature T cells with paired productive αβTCR in data from ref. 3. Each point represents a cell neighborhood, colored by the dominant cell type in each neighborhood. The point size represents neighborhood size, with connecting edges representing overlapping cell numbers between any two neighborhoods. Only edges with more than 30 overlapping cells are shown. Right top—inferred pseudotime, and branch probabilities to CD8+ T and to CD4+ T, respectively, overlaid onto the same UMAP embedding on the left. Right bottom—scatterplot of branch probability to CD8+ T against pseudotime. Each point represents a cell neighborhood, colored by the dominant cell type in each neighborhood. Full size image

One utility of this V(D)J space is demonstrated as we pseudobulked adult human T cells5 by cell types and donors to explore differential usage that is consistent across different donors. On the V(D)J feature space-based UMAP, pseudobulks containing mucosal-associated invariant T (MAIT) cells formed a distinct cluster, in contrast to the single-cell gene expression-based UMAP (Fig. 3b and Extended Data Fig. 3a,b). This is expected due to the semi-invariant nature of MAIT TCRs and illustrates the power of the V(D)J feature space. Although there is no clear clustering in other cell types apart from MAIT (Extended Data Fig. 3b), CD4+ T cells were distinctly separated from CD8+ T cells (Fig. 3b). Differential V(D)J usage for each cell type can be computed, for example, with nonparametric statistical tests implemented within scanpy14 (Fig. 3b and Supplementary Table 5).

Leveraging V(D)J usage in pseudotime trajectory inference

We also developed a new usage for V(D)J data by performing pseudotime inference on the cell neighborhood-based V(D)J feature space. Many pseudotime inference methods have been proposed based on transcriptome similarity31. However, current approaches remain problematic in immune cell development because the differentiation process is often interspersed with waves of proliferation, and transcriptomic convergence, for example, between NKT and NK cells can be misleading. Because the usage of V(D)J genes in AIRs changes definitively as a result of cycles of recombination and selection during lymphocyte development, the AIR repertoire acts as a natural ‘time-keeper’ for developing T and B cells. A developing T cell’s fate toward CD8 versus CD4 T cells is determined by whether its TCR interacts with antigen presented on MHC class I or class II during positive selection. Therefore, it is biologically conceivable that the TCR gives more accurate predictions on the branch probability of each T cell lineage. For this task, we chose to pseudobulk by cell neighborhoods as modeling cell states with partially overlapping cell neighborhoods has advantages over clustering into discrete groups.

We sampled cell neighborhoods from developing T cells with productive αβTCR (double positive (DP) to mature single positive CD4+/CD8+ T cells)3 on a k-nearest neighbor graph built with gene expression data using Milo30 and constructed the neighborhood V(D)J feature space (Fig. 3c and Extended Data Fig. 3c). Trajectory analysis on this V(D)J feature space was performed using Palantir32. It outputs pseudotime and branch probabilities (Fig. 3c) to each terminal state (Extended Data Fig. 3d). The inferred pseudotime follows from proliferating to quiescent DP ((DP(P)) and DP(Q)) T cells, to abT(entry), which splits into CD8+ T and CD4+ T lineages. TCR usage trend can be visualized along the pseudotime trajectory (Extended Data Fig. 3e). The average pseudotime and branch probabilities per cell can also then be projected back from its neighborhoods (Fig. 4a).

Fig. 4: Comparing pseudotime inferred from V(D)J space or gene expression (GEX) space. a, Top—pseudotime and branch probability to CD8+ T inferred from neighborhood V(D)J space in Fig. 3c, projected back to the cells, overlaid onto the same UMAP embedding as in the top left panel. Left bottom—UMAP of DP to mature T cells with paired productive αβTCR in data from ref. 3. Each point represents a cell, colored by cell types. Underneath the UMAP is a schematic showing the T cell differentiation process. Right bottom—pseudotime and branch probability to CD8+ T inferred from neighborhood GEX space, projected back to the cells, overlaid onto the same UMAP embedding as in the top left panel. b, Scatterplots of the pseudotime ordering against the average relative TRAV or TRAJ location. Each point represents a cell neighborhood. Each TRAV or TRAJ gene is encoded numerically for its relative genomic order. The x axis represents the average TRAV/TRAJ relative location for each cell neighborhood. Top—results from pseudotime inferred from neighborhood V(D)J space. Bottom—results from pseudotime inferred from neighborhood GEX space. c, Stripplot of correlation coefficients of gene expression with branch probabilities to CD8+ T within abT(entry) cells, for branch probabilities inferred from neighborhood V(D)J space and neighborhood GEX space separately. Only genes that are known CD4+/CD8+ T cell markers or TFs involved in CD8+ T/CD4+ T lineage decisions are labeled and colored. The rest of the genes are grayed out. Labeled genes that had significant (BH adjusted P < 0.05) positive correlations were colored in red, the ones with significant negative correlations were colored in blue and those without significant correlations were colored in orange. Full size image

There are two alternative tools, CoNGA21 and mvTCR22, that integrate transcriptome with TCR information. Both were created to detect clonally expanded cell types with CDR3 sequences being the input. We tested whether they could also be used to reveal developmental relationships with the same dataset above. Both CoNGA and mvTCR failed to preserve the intercellular relationships (Extended Data Fig. 4a,b). This is not surprising, as selection of different V(D)J genes underpins recombination, while CDR3 diversity can additionally be influenced by random nucleotide insertions.

V(D)J trajectory accurately orders T cell development

We next compared the results of trajectory inference on feature spaces from pseudobulked neighborhood V(D)J, pseudobulked neighborhood gene expression or single-cell gene expression.

The analysis of single-cell gene expression performed unsatisfactorily as a large proportion of CD8+ T and CD4+ T cells were misclassified with higher branch probabilities to the opposite terminal state (Extended Data Fig. 5a,b). Therefore, we mainly focused our comparison with results from pseudobulked neighborhood gene expression (GEX) space, which produced more biologically meaningful pseudotime and branch probabilities (Fig. 4a). When projected back to cells, the inferred pseudotime in the pseudobulked space better reflected the known biology of DP(P)_T to DP(Q)_T, to abT(entry) and subsequent splits into CD8+ T and CD4+ T lineages (Extended Data Fig. 5c,d). This suggests that pseudotime inference with pseudobulked cells works better than directly from single cells, potentially due to less noise compared to single-cell data.

We observed two major differences when comparing the pseudotime inferred from neighborhood V(D)J feature space versus that from neighborhood GEX space (Fig. 4a). First, the DP(Q) T cells dwelled for a longer ‘time’ in the V(D)J trajectory compared to the GEX trajectory. Second, the branching point of CD8+ T and CD4+ T cell lineages happened earlier in abT(entry) cells in the V(D)J trajectory (Extended Data Fig. 6c). To assess the fidelity of the V(D)J trajectory, we used the known fact that V–J recombination in the TRA locus happens processively33 (from middle to distal ends on the genome). We encoded the genomic order numerically for each TRAV and TRAJ gene and the V(D)J pseudotime ordering for each DP(Q) neighborhood showed a substantially better monotonic relationship with the TRAV relative locations (Fig. 4b). Average local correlations across adjacent neighborhoods along V(D)J pseudotime had higher absolute correlation coefficients on average (−0.67 versus −0.43 for TRAV; Extended Data Fig. 6a). A smaller improvement was also observed for TRAJ, with the average local Pearson’s correlations improved from 0.42 to 0.50 (Extended Data Fig. 6b).

CD4 versus CD8 T cell lineage commitment is a classical immunological binary lineage decision that has been intensely investigated over many years34 but remains challenging to study as the selection intermediates have been difficult to observe directly35. We examined which genes in abT(entry) cells showed expression patterns that are correlated with branch probabilities to CD8+ T versus CD4+ T lineage (Fig. 4c).

The top genes that were positively correlated with the CD8+ T cell lineage choice included CD8A and CD8B, which are markers for CD8+ T cells6. The top genes that were negatively correlated included CD40LG, which is a marker for CD4+ T helper cells6, and ITM2A (induced during positive selection and causes CD8 downregulation36). Other markers of CD4+ T cells such as CD4 (ref. 6), together with highly validated transcription factors (TFs) that are known to be involved in CD8+ T or CD4+ T lineage decisions34, including RUNX3 (ref. 37,38), ZBTB7B39,40, TOX41 and GATA3 (ref. 42,43), all displayed significant correlations in the expected directions. However, when using GEX pseudotime, the correlations were notably reduced and some (for example, TOX and RUNX3) were no longer statistically significant (Fig. 4c). For TOX, the direction of the correlation was wrongly inverted (Fig. 4c). In addition, the V(D)J pseudotime also revealed new associations for TFs such as ZNF496, MBNL2 and RORC for CD8+ T, and SATB1, STAT5A and STAT1 for CD4+ T (Extended Data Fig. 6d and Supplementary Table 6).

We have also used different pseudotime inference methods to ensure the robustness of the results. Neighborhood V(D)J-based pseudotime trajectories inferred using monocle3 (ref. 44) and diffusion pseudotime45 similarly showed a better monotonic relationship with TRAV/TRAJ relative locations compared to neighborhood GEX-based pseudotime (Extended Data Fig. 7a,b). Overall, Palantir is preferred as it outputs the branch probabilities, which is useful in deciphering CD4/8 lineage decisions.

Taken together, we showed that V(D)J-based pseudotime inference gives more accurate DP(Q) T cell alignment and improves lineage association within abT(entry) cells. We can use this approach to recapitulate known regulators and uncover new candidate regulators underlying CD8+ T/CD4+ T fate choice.

Using nonproductive recombination as a ‘fossil record’

Based on our earlier observations of high proportions of nonproductive contigs being represented in the single-cell V(D)J data (Fig. 2a), we next explored whether this was cell-type specific. As expected, nonproductive BCR contigs were restricted to B lineage cells (Extended Data Fig. 8a,b). However, nonproductive TRB contigs were surprisingly expressed in many cell-type lineages (developing DN T cells, ILC/NK, B cells; Fig. 5a and Extended Data Fig. 8c). The majority of the nonproductive TRB contigs within ILC/NK/B cells were contigs without V gene (Extended Data Fig. 8d).

Fig. 5: Insights into lymphocyte development from nonproductive TCR. a, Boxplot of the proportion of cells with productive (blue) or nonproductive (orange) TRB in different fetal lymphocyte subsets. Each point represents a sample and data were taken from ref. 3. Only samples with at least 20 cells are shown. Boxes capture the first to third quartiles and whisks span a further 1.5× interquartile range on each side of the box. The annotations used here were based on the version whereby the exact identity of cycling B cells was predicted to be immature B, mature B, B1 or plasma B cells using Celltypist3,5. The equivalent boxplot using the original annotations is shown in Extended Data Fig. 8a. b, Top left—schematic illustration showing the proposed development of B cells (top panel), and the relationship between ILC/NK and T cell lineages. Top right—UMAP of neighborhood V(D)J feature space covering ILC, NK and developing T cells with TRBJ in data from ref. 3. Each point represents a cell neighborhood, colored by cell types. The point size represents neighborhood size, with connecting edges representing overlapping cell numbers between any two neighborhoods. Only edges with more than 30 overlapping cells are shown. Bottom—inferred pseudotime and branch probabilities to ILC/NK and T lineage, respectively, overlaid onto the same UMAP embedding on the top right. c, Top—scatterplot of branch probability to ILC/NK lineage against pseudotime. The pseudotime was inferred from neighborhood V(D)J space shown in Fig. 5b and projected back cells. Each point represents a cell, colored by cell types. Bottom—heatmap of TF expressions across pseudotime in DN T cells. Pseudotime is equally divided into 100 bins, and the average gene expression is calculated for DN T cells with pseudotime that falls within each bin. Genes selected here are TFs that had significantly high Chatterjee’s correlation73 with pseudotime (BH adjusted P < 0.05, and correlation coefficient >0.1). Full size image

The B lineage cells with nonproductive TRB contigs included pre–pro B and B1 cells but not pro- or pre-B cells (Fig. 5a and Extended Data Fig. 8c). Pre–pro B and B1 cells expressed only nonproductive TRB but not TRG/D contigs (Extended Data Fig. 9a–c), suggesting that they share a common developmental route (Fig. 5b), bypassing pro-/pre-B cell stages. This clarifies that human fetal B1 cells can emerge through an alternative route. The conventional route is thought to progress from pre–pro, pro-, pre-, immature, to mature naive B cells46. Our observations are consistent with findings in mouse B1s, which were shown to bypass the pre-BCR selection stage47,48, which normally happens in pre-B cells to remove self-reactive B cells. This may also explain why B1 cells have BCRs with shorter noncoded/palindromic (N/P) nucleotide insertions3, due to negligible expression of DNTT in pre–pro B but much higher expression in pro- and late pro-B cells3. In addition, as pre–pro B cells are almost undetectable in adult bone marrow49, it potentially explains the difficulty of identifying B1-like cells in adult human tissues3.

Similar to DN T cells, the ILC/NK lineage also expressed nonproductive TRG/D contigs with some TRA contigs (Extended Data Fig. 9a–c). We used TRBJ frequency to construct a V(D)J feature space because all T/ILC/NK cells express TRBJ (Fig. 5b and Extended Data Fig. 10a). The inferred trajectory suggests that ILC/NK cells deviate away from T cell development between DN(early) and DN(Q) stage (Fig. 5b,c).

Previous literature on the ILC/NK lineage has also demonstrated partial recombination of TRG/D in mouse lung ILC2 (ref. 50) and of TRB/G in mouse thymic ILC2 (ref. 51), leading to the hypothesis of ‘aborted’ DNs for ILC/NK development52. Our observation of the expression of nonproductive TRB/G/D in ILC/NK cells partially supports this theory. Notably, we also observed nonproductive TRB expression in ILC/NK cells in other fetal organs, with no overt differences in frequencies between organs (Extended Data Fig. 9d). This potentially suggests that T cells and ILC/NK cells might share the same initial stage of development, and then deviate away from each other before productive TRB/TRG/TRD is made.

We observed that expression levels of genes encoding TFs (Fig. 5c) and cell surface proteins (Extended Data Fig. 10b) such as SPI1, RAG1, HHEX, TCF12, CD34, CD3D, CD8A and CD8B, followed an expected pattern along the TRBJ-inferred trajectory53. At the same time, we also discovered many new genes that could redefine DN stages at a higher resolution than previously reported in the literature. We note that there were some discordances in expression patterns of selected TFs between human and mouse DN development53 (Extended Data Fig. 10c). However, this discrepancy could be due to age mismatch that is fetal human to adult mouse, and the mouse data was mainly learned from TF-knockout studies. Future work with detailed comparison using paired scRNA-seq and scVDJ-seq in mice and humans of different age groups will help provide clarity to this.

Finally, we repeated the analysis in human fetal myeloid cells3 to ask whether the pDCs that have initiated BCR rearrangements are derived from lymphoid progenitors54,55. pDC can be derived from both myeloid and lymphoid lineages54,55 and there is IgH D-J rearrangement in some pDCs54,56,57,58,59. We found some nonproductive BCR in pDC (both heavy and light chain as shown in Supplementary Fig. 1a,b) in agreement with previously reported IgH D-J rearrangement in pDC54,56,57,58,59. However, pDC itself expresses RAG and DNTT (Supplementary Fig. 1c). The presence of nonproductive BCR does not necessarily indicate that pDCs are derived from lymphoid progenitors as BCR rearrangement can be carried by RAG in pDC itself57. While it may be interesting to use our VDJ-based trajectory to explore whether the development of pDC overlaps with early B cell development, the current dataset is limited by the cell number as only 51 pDC and cycling pDC cells have nonproductive IGH.

In summary, the unexpected finding of expression of nonproductive TCR contigs in specific cell types has the potential to shed new light on lymphocyte development. Our analysis suggests that B1 potentially arises directly from pre–pro B cells and provides support for the ‘aborted’ DN theory for ILC/NK cell origins.