sciCSR infers B cell state transition and predicts class-switch recombination dynamics using single-cell transcriptomic data

Class-switch recombination (CSR) is an integral part of B cell maturation. Here we present sciCSR (pronounced ‘scissor’, single-cell inference of class-switch recombination), a computational pipeline that analyzes CSR events and dynamics of B cells from single-cell RNA sequencing (scRNA-seq) experiments. Validated on both simulated and real data, sciCSR re-analyzes scRNA-seq alignments to differentiate productive heavy-chain immunoglobulin transcripts from germline ‘sterile’ transcripts. From a snapshot of B cell scRNA-seq data, a Markov state model is built to infer the dynamics and direction of CSR. Applying sciCSR on severe acute respiratory syndrome coronavirus 2 vaccination time-course scRNA-seq data, we observe that sciCSR predicts, using data from an earlier time point in the collected time-course, the isotype distribution of B cell receptor repertoires of subsequent time points with high accuracy (cosine similarity ~0.9). Using processes specific to B cells, sciCSR identifies transitions that are often missed by conventional RNA velocity analyses and can reveal insights into the dynamics of B cell CSR during immune response.

signals beyond this dominant isotype would make predictions of future switches more noisy.To do so we resample cells from the Kim et al. dataset (Fig. ia), gradually saturating the system with the current isotype (e.g.IgG1) by randomly relabelling an increasing number of cells which has switched beyond the current isotype (e.g.IgA1 relabelled as IgG1, Fig. ib).We then fit transition models on the resampled datasets using sciCSR and compare the predicted isotype distribution against the isotype distribution measured at the subsequent timepoint.Expectedly, the more saturated the system with the current state, the less accurate the predicted isotype distribution.Specifically, predictions become noticeably problematic in cases where >95% of the cells are of one isotype (Fig. ic).Therefore, users should exercise caution when applying sciCSR to obtain predictive models from data where diversity of isotype in the population is thus limited.Supplementary Note 2. Robustness of sciCSR in studying rare populations.
Here we investigate the minimum number of cells needed for sciCSR to accurately recover transitions underlying the data.Recall from the manuscript & Methods that sciCSR provides CSR/SHM potential scores for CellRank 1 to fit Markov State Models (MSM).Intuitively, the number of available observations would affect the ability of MSM to correctly identify states and infer transitions between them.One way to answer this question by generating a synthetic dataset of size comparable to test cases presented where sciCSR successfully recovers transitions, and apply sciCSR on this synthetic dataset.By downsampling the number of cells in each state and re-apply sciCSR on this data, we can then compare the inference results to the ground-truth.

Description of synthetic dataset
We consider a simple mixture of three states: IgM, IgG1 and IgA1, and the class-switching dynamics between these states.This can be taken as the transitions between three B cell subsets: Naïve, Classical Memory, DN1, as we found previously that these B cell subsets are enriched respectively in IgM/IgG1/IgA1 expression • Cells in the IgM state were randomly assigned CSR potential of a Naïve cell from Stewart et al.; • Cells in the IgG1 state were randomly assigned CSR potential of a C-mem cell, and; • Cells in the IgA1 state were randomly assigned CSR potential of a DN1 cell.
As mentioned above, these cell subsets were characterised by enrichment of these isotypes in scRNA-seq data 2 .sciCSR was invoked with the CSR potential and the kNN graph as inputs with default parameters.The resulting fluxes were compared against the ground-truth.In this simulation, with an equal number of cells in each state we would expect that the majority (~ 50%) of the total flux to be attributed to transitions from IgM to IgA1 directly, and the remaining flux separated equally between IgM à IgG1 and IgG1 à IgA1.• For grouping cells to predict CSR, the natural choice is to use productive transcript expression to group the cells.Since the definition of productive transcripts would require at least 2 reads in scRNAseq data (1 mapping to the VDJ gene segment, 1 mapping to the coding exon for the C region) and therefore impose additional sparsity (see Supplementary Figure S3), we use the scBCR-seq data to annotate the productive isotype of each B cell, wherever such data is available.This takes advantage of the targeted enrichment of productive transcripts in the generation of scBCR-seq libraries.
• For instructing transitions, we have shown in the main text the CSR potential (Figure 4), derived using both productive and sterile transcript expression patterns from the scRNA-seq data, is able to order B cells by their maturation stages.This score can be used effectively as a pseudotemporal ordering to order the B cells and construct a transition matrix that describes the transition likelihood between single B cells.

B. Markov models and Transition Path Theory
CSR formulated as Markov models: Following from (A), the problem of predicting CSR is operationally defined as modelling the transitions between groups of B cells of each given productive isotype.We approach this by constructing a Markov chain between these isotypes, using the transition matrix constructed using CSR potential.(Alternatively, we can also group cells differently, and/or use somatic hypermutation [SHM] instead to derive the transition matrix.)These Markov chains are just like those typical of any other applications of Markovian methods: we can, for example, use the sciCSR Markov chains to simulate new sequences of states (switching events) which are consistent with the seen data generated under the same experimental condition (we used this property to compare the transition information embedded in CSR/SHM/RNA velocity, by comparing the frequency different cell states are visited in these simulated sequences, see main text Figure 6d-f).

Differences versus conventional Markov models:
Classically, in applying Markov chains to model biological sequences and phylogenies, we are interested in a maximum likelihood solution to describe the observed sequences and/or tree topologies.Whilst we can formulate the problem of modelling CSR as nominating a maximum-likelihood solution, i.e. the most likely series of switching events that is consistent with the given data, here we are interested in the dynamics of these switching events, not only the maximum-likelihood solution: given the data, we aim to extract all the possible transitions and describe their dynamics, assigning likelihood of sampling these transitions in the Markov chains.Note this is also in contrast to models of cellular dynamics in developmental biology systems, where there is normally one clear pathway towards a certain cell fate from the progenitor; here we are interested more in the range of possible pathways the system can take in CSR.
Transition Path Theory (TPT) provides a solution which allows us to analyse this ensemble of transition paths: we start at a pre-defined source state and enumerate, given the transition likelihoods between the states, possible paths which can traverse the states and reach the (again predefined) target.This allows us to calculate the frequency of transition paths that would flow between any two states: this represents the "flux" depicted in the figures as output of the sciCSR method.Because TPT considers all transitions, fluxes refer to any switching events involving the two states: for example, to switch from isotype X to Y, one can switch directly without an intermediate isotype, or switch via an intermediate Z.Moreover, the transition does not always occur successfully -it is possible to have attempted transitions which ultimately fall back to the starting state because, for example, the sterile transcription level is insufficient or AID is absent.Fluxes incorporate all successful transitions that switch from state X to Y; these 'fluxes' therefore represent a holistic, informationrich summary of the dynamics of the system, normally omitted in maximum-likelihood analyses of outputs from Markov models.

Fig. i .
Fig. i.Robustness of sciCSR prediction in systems with limited CSR activity.(a)Schematic to illustrate resampling of data to evaluate the impact of the current isotype predictions on sciCSR predictions of the isotype distribution at a future timepoint.(b) Example of resampling on the donor 07, w7.The resampling generates a series of datasets (bar plot at the bottom) with decreasing proportion of cells which are currently beyond the dominant isotype in the distribution (IgG1 in this case).sciCSR prediction of future isotype distribution for selected resampling conditions were shown as barplots (top).(c) Cosine similarity between sciCSR predictions made using the donor 07 w7 scRNA-seq data and w15 scBCR-seq isotype distribution, for the resampled datasets with different proportions of cells beyond the current isotype (IgG1).
Fig. ii below demonstrates that sciCSR inference recapitulates this expected outcome, and the inference is robust across a range of cell population sizes, up to groups of n = 10 cells.

Fig.
Fig. ii.sciCSR inference on a simulated mixture of IgM, IgG1 and IgA1 cells, with equal numbers of cells belonging to each isotype.Data were downsampled to the stated number of cells per group before inference using sciCSR.Heat colour corresponds to the proportion of flux attributed to each transition.