Introduction

Diversity of the adaptive immune cell receptor repertoire (AIRR) is ensured by genetic recombination of Variable (V), Joining (J), and Diversity (D) gene segments1,2. During thymic T cell development, V(D)J recombination of \(\upbeta ,\upgamma , \mathrm{and}\,\updelta\) T cell receptor (TCR) chains occurs at the CD4CD8 double-negative (DN) stage, while TCR\(\mathrm{\alpha }\) chain recombination occurs later at the CD4+CD8+ double-positive (DP) stage1,3. TCR recombination is coupled to selection checkpoints in order to restrict maturation to thymocytes bearing in-frame, functionally recombined, TCRs without self-reactive capabilities4. Despite the selection processes, a fraction of self-reactive thymocytes escape to the periphery, conferring risk of autoimmune responses5,6.

Escaped self-reactive T cells may be held in check by unconventional, agonist selected T cell populations such as regulatory T cells (Tregs)7. Agonist selected populations are of high interest due to their immune-modulating functions and therapeutic potential8,9,10,11, however, knowledge regarding thymic heterogeneity and development of agonist selected T cells remains incomplete12,13,14. In similarity to developing Tregs, thymic precursors of CD8\(\mathrm{\alpha \alpha }\)+ intraepithelial lymphocytes (IELs) have been suggested to undergo agonist selection15,16.

Previously, we have studied gene expression (GEX) in cell populations in young paediatric thymi, in order to explore the influence of the local cellular milieu for thymic agonist selection17. Since the TCR could also play a role in determining divergence towards an agonist-selected T cell fate, we here performed single cell AIRR sequencing (scAIRR-seq) to assess the TCR repertories of the same cells. The current results indicated changes in the TCR repertoire during developmental progression of pre-selection DP thymocytes. In addition, we found biases in the TCR repertoire for a thymic PDCD1+ putative CD8\(\mathrm{\alpha \alpha }\) T cell precursor population that have previously been described for ZNF683+ CD8\(\mathrm{\alpha \alpha }\) T cells. Our results add insights regarding how the TCR influence decision checkpoints during thymocyte development, in particular the decision between agonist versus conventional selection.

Methods

Study population

Thymic tissues and EDTA blood samples were collected from young paediatric donors (3 male, 2 female, age span 7 days–13.5 months) undergoing corrective heart surgery at the Department of Cardiothoracic surgery, Oslo University Hospital (Supplementary table 1). The project was approved by the Regional Ethics committee of South East Norway (REC 31516) and conducted in compliance with the Declaration of Helsinki. Written informed consent was obtained from donor parents.

Sample processing

Sample collection and processing has been described previously17. Briefly, thymic tissue was dissociated by mechanical and enzymatic (Liberase TM) treatment. For each donor (N = 5), four samples were profiled: (1) Peripheral blood mononuclear cells (PBMC), (2) thymic cells without enrichment, (3) thymic cells enriched for APCs by density gradient centrifugation, and (4) CD45-depleted thymic cells.

Single cell AIRR sequencing

Single cell sequencing was performed according to the 10× Genomics Chromium Single Cell 5′ V(D)J Reagent Kit User guide with Feature Barcode Technology for Cell Surface Protein, v1 Chemistry (10× Genomics protocol CG000186, RevD). Amplification of full length cDNA prepared from mRNA was run for 13 cycles for PBMC and unenriched samples, and 15 cycles for APC-enriched and CD45-depleted samples, before single cell TCR V(D)J enriched libraries were constructed. In brief, 2 \(\upmu\)l cDNA was used to enrich full-length V(D)J segments via PCR amplification with primers specific to the \(\mathrm{\alpha \beta }\)TCR, before enzymatic fragmentation and size selection by beads. Indexing PCR was run for 9 cycles for all samples as per manufacturer’s recommendation, and sequencing was performed on an Illumina NovaSeq S2 flow cell (read length 27 for R1, 92 for R2).

Pre-processing and filtering of single cell AIRR data

Single cell AIRR sequencing data was pre-processed using CellRanger v.3.1.0 with alignment to the refdata-cellranger-vdj-GRCh38-alts-emsebl-5.0.0 reference18,19. Most libraries reached a sequencing depth of > 5000 read pairs/cell, with two exceptions for the CD45-depleted TCR library for donor 3 (3300 read pairs/cell) and the APC-enriched TCR library for donor 2 (4353 read pairs/cell). Clonotypes were defined as by CellRanger v.3.1.0, according to exact match of the complementarity-determining region 3 (CDR3) nucleotide sequence. Alternatively, for analyses focusing on the CDR3 amino acid sequence, we defined clonotypes by exact match in CDR3 amino acid (CDR3aa) sequence.

Further analysis (except CoNGA) was performed in R v.4.1.320 using Immunarch v.0.8.021. The filtered_contig_annotation.csv files from CellRanger were loaded as paired chain data by use of repLoad() with .mode = “paired”, and barcodes annotated as thymocytes/T cells in the GEX dataset were retained. Next, CellRanger-derived clonotypes were filtered by chain combinations, permitting a) a single TRB chain or a pair of one TRA and one TRB for thymocytes, and b) a pair of one TRA and one TRB for peripheral blood T cells.

Downstream analyses in Immunarch were performed using single receptor chains, loaded by repLoad() with .mode = “single”. The resulting dataset was filtered by keeping barcodes that existed in filtered, paired chain data, and grouped by GEX data cell type annotations.

For calculation of numbers and abundances of unique CDR3aa sequences in the filtered single chain data, identical CDR3aa sequences were combined and their counts summed up. Unique CDR3aa numbers were calculated using repExplore() with .method = “volume” and .col = “aa”. The abundance of unique CDR3aa sequences was calculated using repExplore() with .method = “count” and .col = “aa”.

CDR3 lengths, clonal overlap, and gene usage in Immunarch

CDR3aa lengths were calculated for thymic TCRs by use of single chain data, after combining data from all samples for each cell type. For thymocytes, cell types were further grouped as early stage (DN(P), DN(Q), DP(P), DP(Q)), late stage (\(\alpha \beta\)T(entry), CD4+ SP, CD8+ SP) or agonist (CD8\(\mathrm{\alpha \alpha }\)(I), CD8\(\mathrm{\alpha \alpha }\)(II), T(agonist), Treg(diff), Treg) thymocytes. Identical CDR3aa sequences were combined and summed up before running repExplore() with .method = “len” and .col = “aa”. A two-tailed Welsh t-test was performed between early stage thymocytes and either late stage or agonist thymocytes, with a null hypothesis of no difference in group CDR3aa length means (H0: \({\mu }_{1}= {\mu }_{2})\), an alternative hypothesis of a difference in group CDR3aa length means (Ha: \({\mu }_{1}\ne {\mu }_{2})\), and an alpha level of 0.05, followed by Bonferroni correction for two tests.

Clonal overlap was evaluated for singe chain data, after combining data from all samples by either cell type or donor, and grouping and summing up identical CDR3aa sequences. Overlap was determined by the Morisita-Horn overlap index, which accounts for both numbers and abundances of overlapping clones22,23. For this, repOverlap() was used with .method = “morisita” and .col = “aa”.

Gene usage was determined based on single chain data combined by cell type, according to CellRanger-derived clonotype definitions. One clonotype lacking TRAJ gene information and one clonotype lacking TRBJ gene information were excluded. Weighted gene usage was calculated as proportions weighted by the abundance of each clonotype, using the Immunarch geneUsage() function with .quant = “count”, .norm = T, and .ambig = “maj”.

Overlap of the TCR dataset to McPAS-TCR

For identification of pathology-specific TCR sequences, we assessed our data for overlap to the McPAS-TCR database (accessed 02.02.2023), which contains pathology-associated TRA and TRB CDR3aa sequences with known antigen specificities24. Human sequences under the “Autoimmune” and “Pathogens” categories of McPAS-TCR were extracted, encompassing 26258 TRB sequences and 6999 TRA sequences. We then tested for overlap to the database among unique, single chain CDR3aa TRA or TRB sequences for early stage thymocytes, late stage thymocytes, agonist thymocytes, peripheral CD4 T cells (CD4 cytotoxic T cell (CTL), CD4 naive, CD4 Proliferating, CD4 T central memory (TCM), CD4 T effector memory (TEM)), and peripheral CD8 T cells (CD8 naive, CD8 TCM, CD8 TEM). Enrichment of pathology-specific sequences among cell groups relative to the database was assessed by a one-tailed Fisher’s exact t-test (H0: true odds ratio = 1, HA: true odds ratio > 1) with Bonferroni correction for the number of tested diseases and the number of tested cell type groups, based on the approach by Amoriello et al.25.

Joint TCR and transcriptomic analysis using CoNGA

We performed clonotype neighbour graph analysis (CoNGA)26, a joint analysis pipeline for TCR and GEX data in order to identify correlations between the data types. The CoNGA pipeline includes creation of separate neighbourhood graphs for clonotypes in the TCR and GEX datasets, followed by identification of clonotypes that reside within overlapping neighbourhoods in both datasets. For CoNGA analysis (commit ID b572f73e5c90aaae59e11f187553a50324f26f02) of either thymocytes or peripheral T cells, filtered, paired chain TCR data was used, and analysed together with previously reported GEX data for either thymocytes or T cells.

After log normalization and then regression in order to account for the effect of number of counts and percentage of mitochondrial genes in the GEX data, data was reduced to one representative cell per CellRanger-derived clonotype. Neighbourhood graphs were created for the GEX and TCR datasets by use of BBKNN v. 1.5.127, before UMAP dimensionality reduction28 and clustering by the Leiden algorithm29 at resolution 1.0. Differential gene expression analysis was performed by the Wilcoxon Rank Sum test by the rank_genes_group() function from Scanpy v. 1.9.1, as implemented by the CoNGA find_gex_cluster_degs() function.

Next, the CoNGA graph-versus-graph analysis was used, which calculates a “CoNGA score” for identification of clonotypes residing within overlapping neighbourhoods in GEX and TCR datasets, and combines overlapping clonotypes into “CoNGA clusters”. This was done by the CoNGA run_graph_vs_graph() function, with a minimum size for identified “CoNGA clusters” of 22 clonotypes for the thymocyte dataset, and 8 for the PBMC-derived T cell dataset, constituting at least 0.1% of clonotypes in the dataset.

Graph-versus-feature analysis was used for identification of TCR features exhibiting variation across the GEX dataset, by running the CoNGA run_graph_vs_feature() function. Reported adjusted P-values are from assessment of variation in TCR features across a GEX cluster graph, where clonotypes residing in the same GEX cluster are connected. As implemented in CoNGA in order to increase computational speed, features were first subjected to a preliminary t-test, and p-values were adjusted by multiplication for the number of performed tests. Features with multiplied p-values \(\le\) 10 were then subjected to a two-tailed Mann–Whitney U test with Bonferroni correction, at an alpha level of 0.05.

Visualisations

Visualisations were prepared as implemented in Immunarch or CoNGA, in addition to ggplot2 v.3.3.530.

Results

Sample preparation and data filtering

To study TCR repertoires of human thymus and blood, scAIRR-seq was performed. Briefly, one PBMC sample and three thymic samples were prepared for each young paediatric donor (N = 5), with thymic samples consisting of one unenriched sample, one sample enriched for APCs, and one CD45-depleted sample (Fig. 1A). Single cell TCR sequencing was performed for all 20 samples.

Figure 1
figure 1

Single cell gene expression (GEX) and adaptive immune receptor repertoire (AIRR) sequencing of young paediatric (N = 5) thymi and blood. (A) Experimental set-up. For each donor, single cell AIRR sequencing was performed for one PBMC sample and three thymic samples. Created using content from Servier Medical Art, provided by Servier, licensed under a Creative Commons Attribution 3.0 unported license, and graphics adapted from 10 × Genomics. (B) Clonotype Neighbour Graph Analysis (CoNGA) for identification of correlations between thymocyte GEX and T cell receptor (TCR) datasets. (I) Clonotypes that are highly similar in either the GEX (top row) or TCR (bottom row) datasets, based on neighbourhood graphs, are clustered together. (II) Groups of clonotypes with similar GEX and TCR profiles are identified as having low “CoNGA scores”. The “CoNGA scores” reflect the likelihood of seeing equal or greater overlap between the GEX and TCR dataset by chance, when overlap is assessed as shared neighbours between the two datasets for each clonotype. (III) Clonotypes with low “CoNGA scores” (“CoNGA hits”) are grouped into “CoNGA clusters”, bi-coloured discs indicate their GEX (left half) and TCR (right half) cluster assignment.

AIRR data was filtered to retain immune receptor chains from thymocytes/T cells based on previous annotations17, and presenting in expected chain combinations (TRB or TRA + TRB for thymocytes, TRA + TRB for peripheral T cells). This resulted in 46086 unique cell barcodes in the TCR data. The number of unique CDR3aa sequences for each chain was proportional to cell numbers and cell composition of samples, with most abundant cell types having higher numbers of unique CDR3aa sequences (Supplementary Fig. 1A). For each chain, the clonal abundance was low, with most unique CDR3aa sequences being associated with one single cell per sample (Supplementary Fig. 1B).

Joint TCR and transcriptomic analysis confirms previously identified thymocyte populations

In order to determine whether changes in TCR repertoires during the course of thymocyte developmental progression could be observed, we used two approaches for data analysis: One based on Immunarch using our previously reported annotations, and one based on the CoNGA pipeline26. The CoNGA pipeline performs a joint analysis of both GEX and AIRR data for identification of correlations between both data types to enable new discoveries not revealed by the GEX data alone.

In the CoNGA-generated GEX UMAP (Figs. 1B, 2A, B Supplementary Fig. 2A), populations previously reported by us and others were identified14, including proliferating (P) DP thymocytes (GEX cluster 10), quiescent (Q) DP thymocytes (GEX clusters 3 and 2), CD8+ SP thymocytes (GEX clusters 1 and 11), and CD4+ SP thymocytes (GEX cluster 0 among others). GEX cluster 11 contained two distinguishable subsets, where one expressed genes associated with an innate-like phenotype, in resemblance to the population termed NKT cells in our previous report. We also observed clusters resembling ZNF683+ CD8\(\mathrm{\alpha \alpha }\) cells (GEX cluster 12) and FOXP3+ Tregs (GEX cluster 14), and a cluster (GEX cluster 9) expressing PDCD1 in similarity to the thymic Type A CD8\(\alpha \alpha\) precursor15,31.

Figure 2
figure 2

Gene expression and T cell receptor (TCR) features across single cell gene expression (GEX) defined clusters. (A) Selected genes and TCR features across the thymic GEX UMAP. Colour indicates expression of selected genes (top row) and Z-score normalised and GEX-neighbourhood averaged expression of the same genes (bottom row). Blue colour indicates negative values, red colour indicates positive values. (B) Differentially expressed genes across thymic GEX clusters. Dot colour indicates expression level, and dot size indicates the fraction of clonotypes expressing the gene. (C) Selected TCR features exhibiting variation across GEX clusters in a graph-versus-feature analysis. Numerical features indicating properties of the TCR repertoire are mapped onto a GEX cluster graph where clonotypes residing in the same GEX cluster are connected. Variation in the TCR features across the GEX cluster graph is assessed first by an initial t-test, then by a one-sided Mann–Whitney U test for the features with low t-test P values. Colours indicate direction of change (left) or adjusted Mann–Whitney U test P values (right).

A shift from proximal to distal TRA gene usage occurs at the DP(Q) thymocyte stage

We next investigated potential differences in TCR gene usage between thymocyte populations. In the CoNGA pipeline, clonotypes that reside within overlapping neighbourhoods in the GEX and TCR datasets are combined into “CoNGA clusters” (GEX:TCR) (Fig. 1B, Supplementary Fig. 3). One “CoNGA cluster” (10:14) contained clonotypes from the DP(P) GEX cluster 10 and exhibited high usage of the proximal TRAV41 gene. DP(Q) thymocytes in GEX clusters 2 and 3 participated in several distinct “CoNGA clusters”, including one (3:14) biased towards TRAV41, and one (2:9) biased towards TRAV1-2 and TRAV1-1.

To further identify variation in TCR features across the thymocyte GEX dataset, we used the CoNGA graph-versus-feature analysis (Fig. 2C, Supplementary Fig. 4, Supplementary table 5). Among the assessed TCR features, CoNGA implements numerical scores representing properties of the TCR, including scores previously developed by others32,33, and novel scores by the CoNGA developers26. For instance, the “alphadist” score represents the ordinal distance between the TRAV and TRAJ gene segments when the TCRA locus is ordered by genomic position26. The “alphadist” score was significantly decreased for GEX clusters 10 (DP(P) thymocytes, Padj = 4.3 × 10–96) and 3 (DP(Q) thymocytes, Padj = 7.0 × 10–80), and significantly increased for GEX cluster 2 (DP(Q) thymocytes, Padj = 9.0 × 10–97), fitting with the gene usage patterns noted above.

In agreement, analysis by Immunarch (Fig. 3A, Supplementary Fig. 5) revealed a high frequency of TRAV41 for DP(P), in addition to a high frequency of TRAJ57 and TRAJ58. We finally noted diversity in gene usage among the innate-like population labelled as Natural Killer T (NKT) cells, in agreement with reports of a CD8+ innate-like polyclonal thymic cell population from mice34, and NK-like CD8+ T cells in humans35. In sum, DP(P) thymocytes exhibited a bias toward proximal TRA genes, while shift from proximal to distal TRA gene usage occurred at the DP(Q) thymocyte stage.

Figure 3
figure 3

Gene usage proportions and distribution of CDR3aa lengths among thymocytes. (A) TRAJ and TRAV gene usage across thymocytes. Colour indicates proportion among CellRanger-derived clonotypes weighted by clonal counts. (B) Length of unique TRA and TRB CDR3aa sequences among agonist (CD8 \(\mathrm{\alpha \alpha }\)(I), CD8\(\mathrm{\alpha \alpha }\)(II), T(agonist), Treg(diff), Treg), early (DN(P), DN(Q), DP(P), DP(Q)), and late (\(\mathrm{\alpha \beta }\)T(entry), CD4+ SP, CD8+ SP) thymocytes, dotted vertical lines indicate group means.

Early thymocytes have longer CDR3aa sequences compared to late or agonist thymocytes

We next investigated CDR3aa length across thymocytes. In the CoNGA graph-versus-feature analysis (Fig. 2C, Supplementary Fig. 4, Supplementary table 5), DP thymocytes (GEX clusters 10, 2, and 3) exhibited significantly increased CDR3aa lengths (Padj = 0.0011 for GEX cluster 10, 7.7 × 10–61 for GEX cluster 2, 2.2 × 10–48 for GEX cluster 3) compared to remaining clusters. By contrast, CDR3aa length was significantly decreased for the more mature CD8+ (GEX cluster 1, Padj = 3.0 × 10–4 and GEX cluster 11, Padj = 8.1 × 10–6) and CD4+ SP (GEX cluster 0, Padj = 3.7 × 10–7) thymocyte stages.

To determine whether the same conclusion was supported by Immunarch, we grouped unique TRA or TRB CDR3aa sequences of early stage, late stage, and agonist selected thymocytes (Supplementary table 2). We tested for differences in CDR3aa length between early stage thymocytes and either late stage or agonist thymocytes by Welsh’s t-test, under an assumption of a normal distribution (Fig. 3B, Supplementary table 3). Early stage thymocytes exhibited significantly longer CDR3aa lengths compared to late stage thymocytes (Padj = 1.7 × 10–17 for TRA, 7.1 × 10–220 for TRB) or agonist selected thymocytes (Padj = 5.2 × 10–16 for TRA, 2.9 × 10–138 for TRB). In sum, both pipelines highlight a decrease in CDR3 length after the DP(Q) stage.

PDCD1 + thymocytes exhibit previously reported ZNF683 +-associated TCR biases

We further examined the potential biases in the TCR repertoire among thymocytes developing towards CD8\(\mathrm{\alpha \alpha }\) T cells, as CoNGA identified specific biases for both thymic and PBMC ZNF683+ CD8\(\mathrm{\alpha \alpha }\) T cells in a previous report26. While a “CoNGA cluster” (12:2) was formed by clonotypes from the ZNF683+ GEX cluster 12 and TCR cluster 2, the presence of additional 1:2 and 11:2 “CoNGA clusters” could indicate similar TCR features among other CD8+ T cells (Supplementary Fig. 3). However, the graph-versus-feature analysis indicated a significantly increased score for “strength” in GEX cluster 12 (Padj = 2.5 × 10–5), a measure reflecting a CDR3aa composition mediating strong TCR interactions based on an estimated interaction potential26,32,33,36 (Fig. 2C, Supplementary Fig. 4, Supplementary table 5).

GEX cluster 12 (ZNF683+ CD8\(\mathrm{\alpha \alpha }\)) also exibited several features previously observed among thymic and peripheral ZNF683+ T cells26, including enrichement of large and positively charged amino acid residues (“volume” score, reflecting amino acid size, and “charge” score, reflecting amino acid charge26,32,33, Supplementary Fig. 4, Supplementary table 5). Still, no increase was observed in the score termed “independent of peptide:MHC” (“imhc”), which summarises the reported ZNF683+ T cell-associated TCR features as a weighted linear combination of several TCR sequence features, including “volume” and “charge” scores among others26.

By contrast, “imhc” was significantly upregulated for the PDCD1-expressing GEX cluster 9 (CD8\(\mathrm{\alpha \alpha }\) precursor, Padj = 3.3 × 10–15), and for the DP(Q) thymocytes in GEX clusters 3 (Padj = 3.2 × 10–35) and 2 (Padj = 1.18 × 10–51). Like GEX cluster 12, GEX cluster 9 was also significantly enriched for CDR3aa sequences mediating high affinity TCR interactions (”strength” score, Padj = 1.3 × 10–19) (Fig. 2C, Supplementary Fig. 4, Supplementary table 5).

While Treg progenitors mirroring the previously reported T(agonist) and Treg(diff) populations were not identified in the CoNGA analysis, potentially as an effect of the employed cluster resolution, mature Tregs could be observed in GEX cluster 14 (Figs. 1B, 2B). This cluster exhibited reduced “imhc” and “charge” scores (Fig. 2C, Supplmentary table 5), thus constrasting with the observations from the CD8\(\mathrm{\alpha \alpha }\) T cell lineage.

Overall, thymic PDCD1+ and ZNF683+ cells exhibited TCR features reported to be associated with ZNF683+ T cells. For the PDCD1+ cluster, this resulted in a significant upregulation of the summarising “imhc” score, suggesting a TCR repertoire bearing similarities to previously characterized ZNF683+ T cells.

Peripheral ZNF683 + cells resemble their thymic counterparts with respect to TCR biases

We next applied CoNGA to the peripheral T cell TCR and GEX dataset (Fig. 4A–C, Supplementary Fig. 2B, Supplementary Fig. 6, Supplementary table 6). We observed clonotypes residing within overlapping neighbourhoods in the GEX and TCR datasets, indicated by low “CoNGA scores”, among CD8+ T cells (GEX clusters 4 and 7) (Fig. 4A). GEX cluster 7 expressed CCL5 and consisted of GNLY+ and ZNF683+ subsets, with low “CoNGA scores” largely mapping to the GNLY+ subset (Fig. 4A, B). The clonotypes of GEX cluster 7 participated in two “CoNGA clusters” (7:5 and 7:14), which were marked by TRAV1-2 and TRAJ33 usage37,38 and a corresponding high score defining TCR sequences matching that of a consensus sequence for the receptor of mucosal-associated invariant T cells (“mait” score)26 (Fig. 4C, Supplementary table 6). Plotting on the GEX UMAP indicated that the mait cell-associated TCR sequences mainly related to the GNLY+-expressing subset (Supplementary Fig. 6, Supplementary table 6). Clonotypes of GEX cluster 7 exhibited enrichment for CDR3aa mediating strong interactions, indicated as a significant increase in the “strength” score (Padj = 0.024), but not in the “imhc” score indicating ZNF683+-associated features. Further, we note that the increase in “strength” was also observed for GEX cluster 4 (Padj = 4.7 × 10–7).

Figure 4
figure 4

Clonotype Neighbour Graph Analysis (CoNGA) for identification of correlations between peripheral T cell gene expression (GEX) and T cell receptor (TCR) datasets (N = 5). (A) (I) UMAP projections coloured by GEX (top row) and TCR (bottom row) clusters. (II) UMAP projections coloured by “CoNGA scores”, indicating clonotypes residing in overlapping neighbourhoods in both the GEX and TCR datasets. (III) Bi-coloured discs indicating the GEX (left half) and TCR (right half) cluster assignment of clonotypes with low “CoNGA scores” (“CoNGA hits”). (B) Raw and Z-score normalised and GEX-neighbourhood averaged (nbr) expression of selected genes and TCR features across the thymic GEX UMAP. Blue colour indicates negative values, red colour indicates positive values. (C) Gene expression and TCR features among clonotypes grouped into “CoNGA clusters”. Shown are (I) bi-coloured discs indicating GEX (left half) and TCR (right half) cluster assignment, (II) donor origin of clonotypes, (III) differentially expressed genes, (IV) frequently used TCR gene segments and amino acid sequences, (V) properties of the TCR repertoire (represented as numerical TCR feature scores, red colour indicates increased scores, blue colour indicates decrease scores), and (VI) expression of selected genes.

In sum, biases in the peripheral TCR repertoire were largely attributed to peripheral mait cells, with the ZNF683+ subset exhibiting no increase in the previously reported “imhc” score, in resemblance to the thymic ZNF683+ cells.

TRA chains exhibit a higher degree of clonal overlap compared to TRB chains

We next determined clonal overlap of TRA and TRB chains across donors and cell populations in thymus and blood, using the Morisita-Horn index as implemented in Immunarch (Fig. 5A–D). While the extent of overlap was limited, we observed a trend towards higher overlap for TRA compared to TRB, in agreement with previous reports39.

Figure 5
figure 5

Clonal overlap in TCR data. Overlap (Morisita-Horn index) of unique TRA (top row) and TRB (bottom row) CDR3aa sequences among thymocytes and peripheral T cells across either cell types (left panels) or donors (right panels).

Late stage thymocytes were enriched for pathology-specific CDR3aa TRB sequences

To identify pathology-associated CDR3aa sequences, we investigated overlap between our TCR data and CDR3aa sequences from the “autoimmune” or “pathogens” categories of the McPAS-TCR database24. TRA and TRB CDR3aa sequences were assessed separately for early stage thymocytes, agonist selected thymocytes, late stage thymocytes, peripheral CD4 T cells (CD4 CTL, CD4 naive, CD4 Proliferating, CD4 TCM, CD4 TEM), and peripheral CD8 T cells (CD8 naive, CD8 TCM, CD8 TEM) (Supplementary table 4)25. Enrichment of pathology-associated sequences was particularly pronounced for TRB sequences of late stage thymocytes, with significant (Bonferroni corrected Fisher’s exact t-test) enrichment of Influenza—(Padj = 0.013), Celiac disease—(Padj = 1.7 × 10–4), and Inflammatory bowel disease (IBD)—(Padj = 5.0 × 10–5) associated sequences, implying the presence of both pathogen—and autoimmune-associated CDR3aa sequences among post-selection thymocytes.

Discussion

In this work, we have performed scAIRR-seq on human thymus and blood from young paediatric donors, at an age prior to extensive thymic involution. We have previously reported GEX profiling of the same single cells, with cell type annotations17. Here we explored potential differences in the immune cell receptor repertoires across the identified cell populations. The use of single cell technology permitted coupling of immune receptor sequences to the transcriptional profile of each cell, facilitating assessment of TCR profiles across distinct thymic cell populations. Two distinct analytical approaches were used, where one was based on the detailed cell type annotations of the GEX dataset and the Immunarch workflow. The second approach was based on the CoNGA pipeline, which introduces a new integrative clustering using both gene expression and TCR data.

Both approaches identified dynamic changes in the TCR repertoire during thymocyte development. First, both approaches indicated expression of TRA as early as the DP(P) thymocyte stage, with a favouring of proximal TRA genes, before a shift towards distal TRA genes at the DP(Q) stage. This contrast with the established model of TRA recombination at the DP(Q) stage, and has previously been observed by others in human thymic single cell TCR sequencing data14. Previous work has revealed that premature TRA expression may occur in transgenic cell lines, resulting in expansion of CD8\(\mathrm{\alpha \alpha }\) IELs40. Early TRA expression has also been reported in non-transgenic murine thymocytes41. However, these studies reported a preference for the distal TRAV1 gene, potentially induced from the E\(\updelta\) rather than the E\(\mathrm{\alpha }\) enhancer, while we observed a preference for the proximal TRA gene. Again, our observations aligned with previous single cell human TCR data14. A preference for proximal TRA genes for more immature thymocyte populations, followed by a shift towards distal genes as development progresses, would also be in agreement with the suggested processive mechanism of repeated TRA recombination events42.

Both our analytical approaches further implied a shortening of the CDR3 region as thymocytes progress from DP to more mature stages, which potentially could be explained by biases inferred by selection events at the \(\mathrm{\alpha \beta }\)T(entry) stage43. As such, both pipelines resulted in similar observations consistent with previous literature, highlighting robustness across approaches.

For the more mature thymocytes and T cells, clonotypes that resided in overlapping neighbourhoods in the GEX and TCR datasets were largely attributed to CD8+ rather than CD4+ populations. This is in agreement with the previous CoNGA analyses of both PBMC and thymic samples26, potentially explained by a more restricted repertoire for CD8+ T cells compared to CD4+ T cells44. Alternatively, this could indicate an increased suitability for the CoNGA approach in identification of CD8+ compared to CD4+-derived TCR biases. Nevertheless, CoNGA appeared well suited for elucidation of the choice between conventional CD8+ SP selection or CD8\(\mathrm{\alpha \alpha }\) lineage differentiation.

Previously, CoNGA has identified specific TCR features associated with thymic and peripheral ZNF683+ CD8\(\mathrm{\alpha \alpha }\) T cells14,26, including CDR3aa sequences enriched for large, positively charged amino acids. However, we did not observe a significant increase in the “imhc” score, which summarises the ZNF683+-associated features, for the ZNF683+-expressing clusters in our dataset. We did, however, notice variation in several of the included TCR features among ZNF683+ cells, and we cannot exclude the possibility that the lack of a significant increase in “imhc” resulted from the limited data available.

Intriguingly, biases in TCR features were striking for a thymic PDCD1+ expressing cluster, including enrichment for large, positively charged amino acids associated with strong peptide-MHC:TCR interactions, and in this case resulting in a significantly upregulated “imhc” score. In our previous report, we observed a branch point between conventional and agonist selected thymocytes at a developmental time point prior to the SP stage, with the agonist selected lineages upregulating markers of a strong TCR response and signalling to APC subsets. The observation of ZNF683+-associated TCR biases in the PDCD1+ cluster, together with reports of a PDCD1+ precursor population for CD8\(\mathrm{\alpha \alpha }\) T cells31, further supports this branch point. However, this must be corroborated by functional studies, and additionally, the role played by the developmental timing of TRA recombination for the choice between agonist or conventional positive selection needs further elucidation. We further note that the “imhc” score was upregulated also in DP thymocyte clusters, potentially implying that the TCR features reported to be associated with ZNF683+ T cells are also broadly present among less mature thymocyte stages or thymocytes that have not yet undergone conventional selection.

Plasticity between conventional and agonist selected lineages is exemplified by the ability of conventional, mature CD4+ T cells to differentiate to induced Tregs in the periphery. Induced Tregs appear to have a reduced lineage stability compared to thymically derived, natural Tregs, an aspect that must be considered in the development of Treg-derived therapies45,46,47. Mirroring this plasticity, peripheral CD4+ T cells may also differentiate to become CD4+CD8\(\mathrm{\alpha \alpha }\) IELs48. However, a model of TCR-specific biases for the PDCD1+ thymic population, suggested as a thymic CD8\(\mathrm{\alpha \alpha }\) IEL precursor diverging prior to CD4+ lineage commitment, could imply cell-intrinsic differences between thymic-derived and peripherally induced CD8\(\mathrm{\alpha \alpha }\) IELs, with particular implications in immune-mediated diseases in the gut such as IBD9. As such, further insights into differences between IELs of different developmental origin would increase the understanding of the role of the IEL compartment in immune regulation and disease.

Altogether, our study supported that PDCD1+ thymocytes exhibit a TCR repertoire bearing similarity to previously reported ZNF683+ populations, fitting with a branch point between conventional and agonist-selected thymocyte populations and potentially implying a developmentally timed role for the TCR in human CD8\(\mathrm{\alpha \alpha }\) thymic T cell differentiation.