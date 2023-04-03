Learning cell-specific RNA kinetics by a relay velocity model

The cellDancer algorithm is a deep learning framework to generalize the estimation of RNA velocity in both homogeneous and heterogeneous cell populations from scRNA-seq data by estimating cell-dependent transcription (α), splicing (β) and degradation (γ) rates. Cell-specific α, β and γ were predicted by an RNA velocity model that incorporated the neighbor cells (see details regarding the selection of the neighbor cells in the Methods). Specifically, we resolved the RNA velocity kinetics by estimating the reaction rates from the weights and biases of the nodes in a DNN, which is a generalized framework of velocity estimation (see a demonstration in Supplementary Note 1). To train the cellDancer DNN, we first discretized the original reaction kinetics as follows:

$$\begin{array}{rcl}\displaystyle\frac{{u\left( {t + \Delta t} \right)\, -\, u\left( t \right)}}{\Delta t} & = & \alpha \left( t \right) - \beta \left( t \right)u\left( t \right),\\ \displaystyle\frac{{s\left( {t + \Delta t} \right)\, - \,s\left( t \right)}}{{\Delta t}} & = & \beta \left( t \right)u\left( t \right) - \gamma \left( t \right)s\left( t \right),\end{array}$$

where time t is discretized and Δt is a small time slot. In our model, α, β and γ are cell specific. For an individual gene in cell i, cellDancer used a DNN to predict cell-specific rates α(t i ), β(t i ) and γ(t i ) from the spliced and unspliced mRNA abundances u(t i ) and s(t i ) of genes at time t and neighboring cells of i (Fig. 1b). Second, we extrapolated s(t i + Δt) and u(t i + Δt) of cell i at time t + Δt to infer a velocity vector that points from the current state to the future in the gene phase portrait. We defined a loss function by summing every cell’s maximum cosine similarity for the predicted and observed velocity vectors (Methods). Finally, optimized rates of each cell were obtained by minimizing the loss function (Fig. 1b).

We initially evaluated the training progress of cellDancer on several well-studied genes in pancreatic endocrinogenesis and mouse hippocampus development17. We observed that cellDancer captured the transcriptional dynamics of these genes (Fig. 1c and Supplementary Fig. 1). Then, we scaled up the performance evaluation of cellDancer on 1,000 simulated mono-kinetic genes with the shared β, γ and two-step α values. The predicted parameters are highly correlated with the ground truth (R2 = 0.98 for α/β and 0.93 for γ/β; Extended Data Fig. 1a). Remarkably, cellDancer can identify two clusters of α values representing active (positive) and repressive expression phases (centered ~0) on a benchmark dataset, without a prior constraint of a two-step transcription rate (Extended Data Fig. 1b).

Inferring RNA velocity in multi-rate kinetics

As cellDancer provides the single-cell resolution of α, β and γ, we next examined whether cellDancer could resolve the multi-rate kinetic regimes. We simulated three multiple kinetic regimes, including transcriptional boost, multi-lineage forward and multi-lineage backward genes (Extended Data Fig. 1c–e, right panels, and Methods). Transcriptional boost refers to a boost in the expression induced by a change in the transcription rate; multi-lineage forward and multi-lineage backward refer to induction and repression in separate lineages, respectively. We generated 2,000 cells and 1,000 genes for each regime. We compared cellDancer with scVelo (dynamic) and velocyto (static) algorithms and two deep learning algorithms, DeepVelo19 and VeloVAE20. The error rates in cellDancer were significantly lower than those in scVelo, velocyto, DeepVelo and VeloVAE in all three simulated regimes (Extended Data Fig. 1c–e; P < 0.001, one-sided Wilcoxon test). Specifically, cellDancer exhibited the lowest error rate for simulated transcriptional boost, multi-forward branching and multi-backward branching kinetics with 13%, 3% and 9% compared to velocyto, scVelo, DeepVelo and VeloVAE, respectively (Supplementary Table 1). To test the effect of imbalanced cell numbers in different lineages or stages, we downsampled the cells at the stage after transcriptional boosting (Extended Data Fig. 1c) and the cells in lineage 1 (Extended Data Fig. 1d,e). Results showed that cellDancer is not affected by the bias of cell distribution. Next, we estimated the required number of epochs to optimize cellDancer DNN. cellDancer converged at 25 epochs for mono-kinetic, multi-forward and multi-backward branching genes and 100 epochs for transcriptional boost genes (Extended Data Fig. 1f–i).

Delineating transcriptional boost on single-cell resolution

We compared cellDancer to the dynamical model of scVelo on the scRNA-seq experiment of mouse gastrulation erythropoiesis2 (Extended Data Fig. 2a and Fig. 2a), in which transcriptional boost genes were reported13. The vector flow in a uniform manifold approximation and projection (UMAP) embedding of the transcriptome clearly suggests that cellDancer recaptures the progression of erythroid differentiation (Fig. 2a, top), whereas scVelo’s prediction was reversed18 (Fig. 2a, bottom).

Fig. 2: Delineating gastrulation erythroid maturation and resolving transcriptional boost. a, The velocities derived from cellDancer (top) are consistent with the erythroid differentiation but opposite in scVelo dynamic model (bottom) by using all genes. b, The velocities derived from cellDancer, scVelo dynamic model, DeepVelo and VeloVAE for the transcriptional boost genes (Hba-x and Smim1) are illustrated on the phase portraits. The cells are colored according to the cell types. The box plots of α for each cell type predicted by cellDancer are included to show the boost in the α rates in the course of erythroid maturation, especially in erythroid 3. c, The velocities derived from cellDancer for gastrulation erythroid maturation using transcriptional boost genes are projected on the UMAP of the original work, demonstrating that cellDancer can infer the correct cell differentiation direction by using only the transcriptional boost genes. d, Gene-shared pseudotime on UMAP is consistent with the progression of gastrulation erythroid maturation. e, Genes that show high similarity in transcriptional changes along time are classified into eight clusters according to their transcriptional changes. The heat map describes the expression of the genes along time (rows: genes; columns: cells ordered according to the pseudotime). Genes were selected by Pearson correlation coefficient (R2) > 0.8. f, Average expression of each cluster along the pseudotime (top) and the enriched pathways for each cluster of genes (bottom) (Benjamini–Hochberg procedure, one-sided, P < 0.05). P value indicates the significance of enrichment of a pathway in Fisher’s exact test. g, In silico perturbation analysis by dynamo shows a critical role of Gata2 in hematopoiesis. Full size image

Barile et al.18 identified 89 multiple rate kinetics (MURK) genes, such as Smim1 and Hba-x, of which transcription rates boost in the middle of erythroid differentiation, and showed that the prediction of scVelo was severely affected by the boost of transcription, resulting in incorrect predicted directions. cellDancer predicted the correct changes of well-known MURK genes, such as Smim1 and Hba-x, on the phase portraits (Fig. 2b), whereas scVelo, DeepVelo and VeloVAE had incorrect predictions. Moreover, cellDancer revealed the transcriptional boost by the cell-specific α (Fig. 2b). We next tested the overall prediction of cellDancer on transcriptional boost genes. We applied cellDancer and scVelo to the 89 MURK genes and projected the velocity inference to the transcriptome UMAP. cellDancer recaptured the correct directional flow of differentiation using only MURK genes (Fig. 2c), whereas scVelo, DeepVelo and VeloVAE predicted an opposed direction in multiple cell types (Extended Data Fig. 2b).

Next, we demonstrated cellDancer’s capabilities of deciphering transcriptional changes along the differentiation pseudotime. We first inferred major trajectories during cell differentiation from the transition matrix based on the correlation of velocities among neighbor cells (Methods). Then, we estimated a universal pseudotime from trajectories to capture the cell’s position along with the erythroid maturation. The pseudotime of cellDancer accurately illustrated the transcriptional changes of genes (Extended Data Fig. 2c) and the terminal of erythroid maturation (Fig. 2d). To delineate the dynamics of transcriptional activity, we grouped genes into eight clusters based on the similarity in the transcriptional changes along pseudotime (Fig. 2e). The expression of genes in the first three clusters was high at the early stage in the hematoendothelial progenitor cells and diminished during differentiation. Gene expression in clusters 4–6 decreased slower than the gene expression in the first three clusters and decreased close to zero in the erythroid 3 subpopulation. Gene expression in clusters 7 and 8 increased during erythroid maturation. We next investigated the biological function of each gene cluster during erythroid cell differentiation. Gene Ontology (GO) analysis through DAVID21 showed that these genes are highly enriched in the angiogenesis and wound healing pathways. Genes in clusters 4–6 were enriched in basic cellular functions, including cell cycle, cell division, chromatin organization, RNA splicing and translation pathways. It is not surprising that these genes are enriched in erythrocyte development, heme biosynthetic process, oxygen transport and cellular oxidant detoxification pathways (Fig. 2f). Finally, we applied dynamo22 to in silico suppress the expression of Gata2, a critical regulator in hematopoiesis, in blood progenitor 1. We observed the diversions of hematopoietic fate after the perturbation (Fig. 2g), which is consistent with the experimental study23.

Inferring RNA velocities on each branch for branching genes

We evaluated cellDancer using data from the branching lineages in mouse hippocampus development. There are five major branching lineages in the mouse hippocampus, corresponding to dentate gyrus granule neurons, pyramidal neurons in subiculum and CA1, pyramidal neurons in CA2/3/4, oligodendrocyte precursors (OPCs) and astrocytes12. The cell velocity graph shows that cellDancer accurately inferred five major branching lineages in hippocampus development (Fig. 3a), confirming the reliable performance of cellDancer on multi-lineage populations.

Fig. 3: Identifying the branching lineage in the hippocampus development. a, The velocities derived from cellDancer for the mouse hippocampus development dataset are visualized on the pre-defined t-SNE embedding. Directions of the projected cell velocities on t-SNE are in good agreement with the reported directions. b, The phase portraits of two branching genes (Ntrk2 and Gnao1) predicted by cellDancer, scVelo dynamic mode, velocyto, DeepVelo and VeloVAE demonstrate the advantage of cellDancer in predicting the velocities of the branching genes. The RNA velocities of Ntrk2 and Gnao1 predicted by cellDancer are consistent with the expectation of hippocampus developmental progress, whereas the directions predicted by others are inconsistent in part. The cells are colored according to the cell types. c, Distribution of the minimized loss for all the genes. Those genes with low loss scores show mono-kinetic or divergent dynamics, whereas genes with high loss scores show pattern-less phase portraits. d, The GO pathway enrichment analysis using adjusted P values of Fisher’s exact test (Benjamini–Hochberg procedure, one-sided, P < 0.05) of DAVID for the 500 genes with the lowest training loss score shows that these genes are highly involved in pathways associated with nervous and brain development. e, Gene-shared pseudotime is projected on t-SNE by cellDancer, and the most probable paths are inferred by dynamo, showing the order of cell differentiation during hippocampus development. f, The phase portraits (left, cells colored according to a), the expression on t-SNE embedding (middle) and the expression pseudotime profiles (right) for the genes Dcx and Psd3. Dcx (top) and Psd3 (bottom) have distinct dynamic behaviors. Dcx is a mono-kinetic gene (left), and its expression gradually increases in neuroblasts (right). Psd3 is a branching gene (left), and its expression increases in each branching lineage at different speeds (right). FDR, false discovery rate; nIPC, neural intermediate progenitor cell. Full size image

We further studied the velocity inference of individual branching genes. As branching genes have different reaction rates among lineages, they have lineage-specific regulation of transcription, splicing and degradation and often play an important role in hippocampus development. For example, branching genes are vital to neurogenesis (Diaph3, Klf7 and Ncald; Extended Data Fig. 3)24,25,26 and are involved in the differentiation of the neural system (Cadm1 and Gpm6b)27,28. Branching genes are also related to neurological or neuropsychiatric disorders. For instance, mutations of Gnao1 may contribute to epilepsy, developmental delay and movement disorders in the neural system29. Aberrant Psd3 proteins are related to autism spectrum disorder and schizophrenia30. We applied cellDancer to the branching genes. Phase portraits show that cellDancer can accurately infer the velocities of branching genes on each lineage (Fig. 3b and Extended Data Fig. 3), whereas scVelo, velocyto, DeepVelo and VeloVAE predicted the correct velocities on a limited number of cells (Fig. 3b and Supplementary Fig. 2). Moreover, cell-specific α, β and γ were inferred on each branch. For instance, neurotrophic tyrosine kinase receptor type 2 (Ntrk2)31 has two major branches: the upper branch corresponds to astrocytes and OPCs, and the lower branch corresponds to dentate gyrus granule neurons and pyramidal neurons (Fig. 3b). Astrocytes and OPCs have high α and low β, resulting in high expression of unspliced Ntrk2 on the upper branch. Dentate gyrus granule neurons and pyramidal neurons have high β and low γ, resulting in high expression of spliced Ntrk2 on the lower branch (Extended Data Fig. 3).

cellDancer calculates a minimized loss function after optimizing a DNN for each gene. A small loss score indicates a good fit with the RNA velocity model. We ranked genes based on their loss function score. Top-ranking genes include both mono-kinetic and branching genes (Fig. 3c). Next, we performed GO pathway enrichment analysis through DAVID21 for the top 500 genes. The enriched pathways are associated with neurogenesis, nervous system development, neuron differentiation, synaptic signaling, chemical synaptic transmission and brain development (Fig. 3d).

We applied pseudotime analysis to infer the differentiation order of cells in hippocampus development. cellDancer automatically identified radial glia cells as a shared root state of hippocampus development (Fig. 3e), which is in good agreement with the previous study32. We also identified five terminal states without prior knowledge of the number of branches in the development process and applied dynamo to predict the most probable path of each terminal state (Fig. 3e). The pseudotime analysis of cellDancer suggests that astrocytes and OPCs are produced earlier than granule neurons and pyramidal neurons. Together, cellDancer has the capability to infer the global differentiation pseudotime of branching cell lineages.

We investigated the temporal progression of transcription during hippocampus development. We observed multiple expression patterns of individual genes on different branches. For instance, Dcx transiently upregulates in neuroblasts with consistently low expression in astrocytes (Fig. 3f), which is supported by previous studies that Dcx transiently expresses in the early neurogenesis stage and is a widely used marker for neurogenesis33,34. By contrast, genes associated with neurogenesis, such as Slc4a10 (ref. 35), Ncald26 and Ntrk2 (ref. 31), show increasing expression in all branches at different rates (Extended Data Fig. 4).

Vector fields analysis using cell-specific RNA velocity

cellDancer extends the bulk reaction rates (α, β and γ) to single-cell resolution in an scRNA-seq experiment. As gene expression is regulated by transcription, splicing and degradation, the reaction rates tend to be more stable than expression in a cell type during cell differentiation (Fig. 4a). Thus, we asked if the cell-dependent reaction rates in cellDancer provide biological insights into cell identity. We applied cellDancer to infer cell-dependent α, β and γ in the endocrine development of the mouse pancreas profiled from embryonic day 15.5 (E15.5)36. Previous works reported four terminal cell types in endocrinogenesis, including glucagon-producing alpha-cells, insulin-producing beta-cells, somatostatin-producing delta-cells and ghrelin-producing epsilon-cells37. UMAP of transcriptome shows that alpha-, beta-, delta- and epsilon-cells are distributed closely (Fig. 4b). Reaction parameters are always more consistent than transcriptomes in a cell type. For instance, expression of Sulf2 increases in Ngn3-low endocrine progenitors and decreases in pre-endocrine (Fig. 4c), whereas α is a similar positive value in Ngn3-low endocrine progenitors and ~0 in pre-endocrine. Next, we investigated the overall similarity of α, β and γ in each cell type. We applied UMAP to embed α, β and γ into two dimensions. Alpha-, beta-, delta- and epsilon-cells separate into distinct groups on UMAP of α, β and γ (Fig. 4d and Supplementary Fig. 3), suggesting that cell-specific α, β and γ are available as an indicator of cell identity. Notably, the cycling subpopulation of ductal cells and endocrine progenitors was separated from those without cycling (Fig. 4e).

Fig. 4: Deciphering cell identity with cell-specific reaction rates and analyzing gene regulation through vector fields. a, Schematic illustration shows that the α, β or γ rates of the genes may be a good indicator of the cell types rather than the expressions of the genes. b, The velocities derived from cellDancer for the pancreatic endocrinogenesis cells are visualized on the pre-defined UMAP embedding. c, Phase portraits of the gene Sulf2. The α rates of the Sulf2 gene for each cell calculated by cellDancer clearly illustrate the gene’s induction and regression phases (left). Sulf2 is in induction in the Ngn3-high embryonic progenitor (EP) cell type and in regression in the pre-endocrine cell type, whereas it is barely transcribed in other cell types (right). d,e, UMAP embedding using the cell-specific α, β and γ rates calculated by cellDancer indicates that our computed kinetics rates might be useful in assigning cell subpopulations (d) and cell identity (e). f, The velocity vector fields were learned by dynamo. The red digit 0 reflects the identified emitting fixed point. The black digits 1, 2 and 3 reflect the absorbing fixed points. g, Jacobian analysis and the gene expression of Arx and Pax4 on the UMAP space. It shows that Pax4 is downregulated by Arx in alpha-cells. Arx is downregulated by Pax4 in beta-cells. Full size image

Furthermore, we inputted the cell velocity to the established framework dynamo, which provides rich downstream analyses by learning differentiable velocity vector fields and inferring gene regulation networks. Noticeably, absorbing fixed points are identified in the alpha-, beta- and epsilon-cells, and an emitting fixed point is identified in the pancreas progenitor cells (Fig. 4f). To investigate the alpha-cell and beta-cell fate determination, we inspected the expression of Arx and Pax4, two well-known transcription factors that determine the endocrine cell fates (the alpha and beta lineages)38. Consistent with the previous study38, we observed exclusively high expression of Arx and Pax4 in the alpha-cells and beta-cells, respectively (Fig. 4g). Then, we used dynamo to perform Jacobian analyses and detected mutual inhibition between Arx and Pax4 in the alpha-cells and beta-cells. These analyses are in line with the experimental findings39 and provide mechanistic insight from gene regulation at single-cell resolution, showing that cellDancer can be seamlessly integrated with downstream analysis, such as dynamo vector field analysis.

Revealing the turnover strategies of mRNA during cell cycle

A previous study showed that metabolic labeling technology, such as sequencing mRNA labeled with 5-ethynyl-uridine (EU) in single cells (scEU-seq), can measure the synthesis and degradation of mRNA using the sequencing method40. Furthermore, Qiu et al.22 showed that scEU-seq can be used to predict the dynamics of the cell cycle. To investigate whether the predicted kinetic parameters are consistent with the experimental measurements, we used metabolic labeling data (that is, scEU-seq) of RPE1-FUCCI cells at specific points during cell cycle progression as a benchmark40. We first clustered RPE1-FUCCI cells into eight groups based on cell cycle stages and calculated the average spliced and unspliced expression of cell-cycle-associated genes, which also have synthesis and degradation rates in scEU-seq (Extended Data Fig. 5a). We applied cellDancer to predict the velocities and kinetic parameters of cell cycle genes and compared the predicted α and γ to the experimentally derived synthesis and degradation rates measured by scEU-seq40 (Extended Data Fig. 5b). Overall, the predicted α and γ are associated with the experimental measurements of mRNA synthesis and degradation (Extended Data Fig. 5b,c), especially in the highly expressed genes (Extended Data Fig. 5a). We also observed a difference between the predicted α and scEU-seq synthesis rates in the G1 state for the low-expression genes, of which expression starts to increase at the G1 state (Extended Data Fig. 5a). Our prediction captures this increase by a relatively large α in the G1 state, whereas scEU-seq shows a low synthesis rate, which may be due to the potential limitation of scEU-seq in the low-expression genes. Next, we predicted the velocity flow and pseudotime of the cell cycle procession using cell cycle genes. cellDancer predicts the direction of transcriptome shifting and the pseudotime during the cell cycle (Extended Data Fig. 5d). Together, the cellDancer-predicted kinetic parameters reflect the reality of mRNA turnover rates in cell cycle.

We further investigated the functions of genes with different kinetic patterns. We grouped genes into seven clusters according to dynamic patterns of α and γ (Extended Data Fig. 6a). We calculated the correlation of α and γ and the average expression in each cluster (Extended Data Fig. 6b). We identified three positively correlated groups and four negatively correlated groups, indicating different turnover strategies in the clusters. Next, we investigated the functions of genes in each cluster through DAVID21 (Extended Data Fig. 6c). Overall, all clusters are associated with cell cycle pathways, including cell division, proliferation, chromatin remodeling, DNA replication and cell cycle checkpoints. We noticed that the genes in cluster F have large transcription and degradation rates in the mitosis stage, indicating a fast turnover of mRNAs. The genes in cluster F are enriched in pathways related to cell communication, including signal transduction, enzyme-linked receptor protein signaling, TGF-β receptor signaling and stress-activated protein kinase signaling, suggesting a quick communication of cells during mitosis.

To investigate the capacity of cell-specific rates in identifying cell subpopulations, we recaptured that pseudotime is continuous in the gene expression space during the cell cycle. Specifically, the G2 phase (pseudotime 0.8~1) is in proximity to the M phase (pseudotime 0~0.2) (Extended Data Fig. 6d). Then, we clustered the cells into 17 subpopulations according to the cell-specific rates (Extended Data Fig. 6d) using SCANPY41 and used the hierarchical method to further cluster each subpopulation (Extended Data Fig. 6e). We found that these subpopulations were globally clustered together in good agreement with cell cycle pseudotime except clusters 3 and 4 (a cell subpopulation at the M phase). The reaction rates of this cell subpopulation are more in line with clusters 1 and 2, which are at the G1 and S stages (Extended Data Fig. 6e). Next, we compared the gene expression and reaction rates of this intricate cell subpopulation with the other cells. We identified 116 differentially expressed genes and 181 genes having differential transcriptional rates by comparing this subpopulation to the rest and found that only 10% of genes having differential transcriptional rates were captured by the raw expression (Extended Data Fig. 6f). We further investigated the enriched pathways of these 163 genes that are uniquely identified by the rates through DAVID21. Those genes are enriched with cell division pathways, such as cytokinesis, cell division and mitotic metaphase congression (Extended Data Fig. 6g), suggesting that transcriptional regulation plays an important role in cell division at the M stage.

Decoding human embryonic glutamatergic neurogenesis

We further investigated RNA velocity on an scRNA-seq dataset of the developing human forebrain at 10 weeks after conception, which was used as a benchmark in previous studies12,42. We used cellDancer to predict RNA velocity on human embryonic glutamatergic neurogenesis. The velocity on the embedding space and the derived pseudotime show that cellDancer accurately recaptures the cell fate of human embryonic glutamatergic neurogenesis (Extended Data Fig. 7a,b). The velocities of genes that are vital to neural development and neurogenesis, such as ELAVL4 (ref. 43) and DCX33,34, were also correctly predicted (Extended Data Fig. 7c).

To test whether cellDancer is sensitive to the methods of neighbor cell detection, we applied cellDancer to predict velocity vector flow based on the nearest neighbors defined by the spliced RNAs or by the spliced and unspliced RNAs. Results suggest that the prediction of velocities using spliced RNAs is consistent with the prediction using spliced and unspliced RNAs (Extended Data Fig. 7a).

cellDancer has a robust and efficient performance

The high proportion of zero reads is a key feature in scRNA-seq data, one cause of which is technical dropout. We tested whether cellDancer is robust with technical dropout (Extended Data Fig. 8a). cellDancer was able to correctly predict the gene dynamics even with high dropout ratios and learned RNA velocities in noisy scRNA-seq data (Extended Data Fig. 8b).

Next, we tested the robustness of our algorithm among different cell numbers. We gradually reduced the number of cells from 10,000 to 1,000 in the simulation dataset to predict RNA velocity and compared the prediction of α/β and α/γ. Results show that our model is robust in data with sparsity (Extended Data Fig. 8c).

We tested the sensitivity of the stopping criteria for the training of cellDancer DNN. Two key parameters, ‘checkpoint’ and ‘patience’, are associated with the stopping criteria. We performed the full cellDancer analysis in the mouse hippocampus development experiment using a different number of checkpoints and patience for training. cellDancer shows low sensitivity to the stopping criteria of training (Extended Data Fig. 9). Furthermore, cellDancer independently predicted an individual DNN for each gene, which allows us to apply the multi-processing approach to speed up the efficiency. Overall, cellDancer has an optimized runtime (Extended Data Fig. 10).