Trajectory-based differential expression analysis for single-cell sequencing data

Trajectory inference has radically enhanced single-cell RNA-seq research by enabling the study of dynamic changes in gene expression. Downstream of trajectory inference, it is vital to discover genes that are (i) associated with the lineages in the trajectory, or (ii) differentially expressed between lineages, to illuminate the underlying biological processes. Current data analysis procedures, however, either fail to exploit the continuous resolution provided by trajectory inference, or fail to pinpoint the exact types of differential expression. We introduce tradeSeq, a powerful generalized additive model framework based on the negative binomial distribution that allows flexible inference of both within-lineage and between-lineage differential expression. By incorporating observation-level weights, the model additionally allows to account for zero inflation. We evaluate the method on simulated datasets and on real datasets from droplet-based and full-length protocols, and show that it yields biological insights through a clear interpretation of the data.


Supplementary Methods
Eigenvalue decomposition ofΣβ g Let C correspond to the (LK) × L(L − 1)M/2 matrix that defines the linear contrasts of interest for the patternTest, i.e., every column of C corresponds to the comparison of two points for a pair of lineages. Tests for the contrasts are performed using a Wald test statistic defined as withΣβ g the estimated variance-covariance matrix of the estimated smoother coefficients. Lettingα g = β T g C andΣα g = C TΣβ g C, we can rewrite the Wald statistic as Taking the eigendecomposition ofΣα g , whereV g is an L(L − 1)M/2 × L(L − 1)M/2 matrix with columns corresponding to the L(L − 1)M/2 eigenvectors ofΣα g andΛ g the L(L − 1)M/2 × L(L − 1)M/2 diagonal matrix of eigenvalues λ i ∈ [0, 1] ofΣα g , in decreasing order. Note that, sinceΛ g is a diagonal matrix, we can simply invert its diagonal elements instead of inverting the full matrix. We determine the rank r ofΣα g by calculating the number of eigenvalues that are larger than 1e −8 λ 1 , with λ 1 corresponding to the largest eigenvalue. When the rank r ofΣα g is less than L(L − 1)M/2 (i.e., the matrix is not of full rank), we only use the first r eigenvectors fromV g , associated with the largest r eigenvalues fromΛ g . This provides an efficient computation of the test statistic and avoids singularity problems with the estimated variance-covariance matrix of the contrasts 1 .

Defining Z based on user-supplied weights
If one has user-supplied weights W = (W li ∈ [0, 1] : l ∈ {1, . . . , L}, i ∈ {1, . . . , n}) for the assignment of cells to lineages, one can construct the binary matrix Z from W as follows. First, note that the weights W may be defined differently depending on the TI method that was used to estimate them. For example, slingshot 2 defines weights based on the distance from a cell to a particular lineage; hence, the sum of the weights across all lineages for a particular cell may be greater than 1. As such, these weights cannot be interpreted as probabilities. GPfates 3 , however, does return posterior probabilities that a cell i belongs to a particular lineage l, where L l=1 W li = 1 for each i. We therefore first normalize the weights for each cell, such that, for normalized weights W * li , the sum across lineages equals one, i.e., L l=1 W * li = 1 for each cell i. Next, we assign each cell i to a lineage by sampling one observation from a multinomial distribution with L groups and probabilities W * li . The lineage assignments are then encoded in the L × n matrix Z, by setting all elements of the i th column equal to zero except for a 1 in the row corresponding to the sampled lineage for cell i.
The multinomial sampling to assign each cell to a lineage may introduce variability in the results if the models are fit multiple times, due to differing cell allocations. This is especially so if there is a high uncertainty about the lineage allocation (e.g., a cell is equally likely to belong to each of two lineages), which typically occurs around the inception of a trajectory. While we ensure computational reproducibility by setting a seed in the software implementation, the results may vary slightly over different seeds. To quantify this variability, we use the data of Paul et al. 4 and allocate cells to lineages using 10 different seeds. Since we expect the variability across different assignments to be largest at the inception of the lineage, we evaluate DE using tradeSeq's startVsEndTest. Using a global test across the two lineages, the number of DE genes at a 5% nominal FDR level varied between 1, 990 and 2, 049, with 1, 739 DE genes shared across all 10 assignments (Supplementary Figure 31). For each assignment, at least 93% of the top 1, 000 DE genes are shared with the top 1, 000 DE genes of any other assignment.

Supplementary Note 1: Computation time and memory usage benchmark
To assess time and memory requirements, scRNA-seq datasets with a bifurcating trajectory were simulated using the same framework as in the simulation study. Three datasets with 100, 1, 000, and 10, 000 cells were simulated (small, medium, and large datasets), each consisting of 5, 000 genes. For BEAM and GPfates, only the fitting and DE testing part was assessed for each method, not the trajectory inference part. All methods were ran with default options. For tradeSeq, the fitGAM function was assessed with 4 knots, as determined in the simulation study. The different tests implemented in tradeSeq were benchmarked separately (Supplementary Figure 16). Their running times are very small (always below 30 seconds), as compared to the fitGAM function, and do not increase for datasets with increasing numbers of cells.
To benchmark time requirements, the microbenchmark package was used, and each method was run 10, 2, and 2 times on, respectively, the small, medium, and large datasets. Variations in running times were very small (always under the minute), especially in comparison to between method differences. Methods that reached the 4-hour mark without finishing were killed. This is the case for ImpulseDE2 on the datasets with 10 3 and 10 4 cells, and for GPfates on the largest dataset. Memory benchmark was assesses using the Rprof function; maximum memory usage was recorded.
Results are shown in Figure 4. ImpulseDE2 is by far the slowest, taking over 3.5 hours to finish on a small dataset of 100 cells. GPfates runs in about 30s on the small dataset but scales poorly. BEAM, edgeR, and tradeSeq are quite fast and scale very well, even to large datasets, with BEAM scaling the best. In terms of memory requirements, all methods scale well to 10, 000 cells. It should also be noted that tradeSeq, ImpulseDE2, and BEAM can utilize multiple cores but were benchmarked using only one core.

Supplementary Note 2: Bulk RNA-seq time-course dataset
While in this manuscript we focus on DE analysis downstream of TI, the applicability of tradeSeq extends beyond this setting. We demonstrate this by using tradeSeq on a bulk RNA-seq time-course study from Kiselev et al. 5 , where we compare gene expression between wild type and PIK3CA H1047R cell lines upon stimulation of epidermal growth factor (EGF). Gene expression was measured for three replicates in each condition over six time-points, ranging from 0 to 300 minutes post EGF stimulation. The original analysis in the manuscript assessed DE between the cell lines for each time-point separately using DESeq2, and found 7, 486 DE genes at a 1% nominal FDR level (Benjamini-Hochberg 6 FDR-controlling procedure). We perform an analogous analysis using tradeSeq, by modeling gene expression measures as smooth functions of time and looking for differences in expression patterns with patternTest. This yields 7, 184 DE genes at a 1% nominal FDR level. Around 89% of these genes overlap with the original DE list of 7, 486 genes, demonstrating that the utility of tradeSeq goes beyond scRNA-seq applications.

Supplementary Note 4: Mouse bone marrow dataset
Dimensionality reduction. We observed that the biology of the ICA dimensionality reduction does not fully preserve the underlying biology. Indeed, a 2-dimensional visualization of the ICA dimensionality reduction (Figure 5a) shows that there is a seemingly large gap between the multipotent progenitors and the remaining cell types, and that a number of erythrocytes and granulocyte-macrophage progenitors (GMP) are misclassified as multipotent progenitors. In addition, megakaryocytes, which are thrombocyte progenitors and as such should not belong to any of the two lineages, seem to be split between the erythrocyte and leukocyte lineages. However, when applying UMAP dimensionality reduction ( Figure  5b), these issues are resolved and the underlying biology seems better preserved than with ICA.
Discovering cell type markers. The diffEndTest procedure from tradeSeq finds 2, 233 significantly differentially expressed genes at a 5% nominal FDR level,while BEAM discovers 584 genes at a 5% nominal FDR level when testing whether the association between gene expression and pseudotime depends on the lineage (Benjamini-Hochberg 6 FDR-controlling procedure). Since the identification of a larger set of DE genes does not necessarily imply more relevant biology, we select carefully constructed gene sets from de Graaf et al. 8 to perform gene set enrichment analysis (GSEA) on blood cell types. As we are comparing erythrocytes with a mixture of leukocytes, we expect gene sets related to erythrocytes to be significant. Indeed, the erythrocyte gene set is the only one to be found significant by fgsea 9 for the tradeSeq analysis (FDR adjusted p-value < .001, with normalized enrichment score of 1.49), while no significant gene sets are found for the BEAM analysis (as reference, the FDR adjusted p-value for the erythrocyte gene set is 0.58). In this case, tradeSeq is therefore better able to recover a meaningful biological signal (Supplementary Figure 24). If one assumes that the cell type labels are known for all cells in the dataset, a clusterbased comparison is possible, where the different clusters correspond to the identified cell types. We use edgeR 10 to assess differential expression between erythrocytes and neutrophils, since this comparison is most analogous to tradeSeq's diffEndTest. Only edgeR finds evidence for gene sets related to eosinophils and T-cells (FDR adjusted p-values of 0.042 and 0.049, respectively), however, the eosinophil cells were removed from this dataset prior to analysis (see Methods, subsection 'Case studies: Mouse bone marrow dataset'). The GSEA results for edgeR also provide less evidence for erythrocytes (FDR adjusted p-value of 0.043, normalized enrichment score of 1.21) as compared to the tradeSeq analysis (Supplementary Figure  24). None of the methods, however, recover evidence for the neutrophil cell types that are identified at the end of the lineage (Figure 5b; FDR adjusted p-values p tradeSeq = 0.99, p BEAM = 0.86, and p edgeR = 0.81).
Discovering progenitor population markers. It might be interesting to examine genes with significantly different expression patterns, that show little evidence for DE at the endpoints. We therefore select genes with both a high Wald test statistic (low p-value) for the patternTest and a low test statistic (high p-value) for the diffEndTest. Following the approach described in 'Case studies: Mouse bone marrow dataset' (Methods), we assign a score for each gene. Remarkably, within the top eight genes, four genes (Erp29, Irf8, Psap, and ApoE ) were previously found to be major regulators of hematopoiesis. Indeed, Irf8 has previously been identified as a major transcription factor involved in myeloid lineage commitment 4,11 . Erp29 and Psap are direct targets of the Irf8 transcription factor 4,12,13 , while ApoE regulates stem cell proliferation in atherosclerotic mice 14 and was also identified as a marker gene in the original manuscript of Paul et al. 4 . The remaining four genes include Nedd4, Srrm2, Gatm, and Acin1.

Supplementary Note 5: Mouse olfactory epithelium dataset
Within-lineage DE. The top 20 gene sets as derived by the startVsEndTest procedure across all three lineages in a ZINB-tradeSeq analysis confirms the biology of the experiment. The HBCs were primed for differentiation and the top gene sets include responses to (organic, external, and endogenous) stimuli. In addition, neurogenesis and tissue development are the first and third most significant gene sets, respectively, while the remaining list contains sets related to cell development, differentiation, and epithelium development, amongst others. Although the neuronal and microvillous cell lineages undergo mitotic division, it is worth noting that cell cycle related gene sets are absent from the startVsEndTest results, since this process occurs during differentiation, but not in the resting HBC or differentiated cell populations.
Between-lineage DE. The only relevant comparison is between the diffEndTest procedure from tradeSeq and a discrete DE test between the differentiated cell types using ZINB-edgeR as introduced in Van den Berge et al. 15 . For both methods, we use a global test to compare mean expression between all three differentiated cell types. While a ZINB-edgeR analysis discovers 1, 984 genes, the ZINB-tradeSeq analysis discovers 3, 719 genes, which include ∼ 86% of the genes also discovered by the ZINB-edgeR analysis. In order to assess the relevance of the extra 1, 994 genes discovered with ZINB-tradeSeq, we perform GSEA on this gene set (Supplementary Table 3). The top 20 significant gene sets contain relevant biological processes for the system under study, such as "regulation of multicellular organismal development", "positive regulation of biosynthetic process", and "tissue development".
We can also identify genes that drive the branching based on the earlyDETest applied around the first branching point, i.e., between knots 1 and 3 (see Supplementary Figure 29 and Figure 6d). We apply stage-wise testing 16 (see Methods) to first assess any difference across the three lineages using a global test. We discover 2, 083 genes to be DE between any of the three lineages at a 5% nominal FDR level. Among these 2, 083 genes, we then discover 634 significant genes between the neuronal and microvillous lineages, 1, 068 significant genes between the microvillous and sustentacular lineages, and 1, 312 significant genes between the neuronal and sustentacular lineages (Supplementary Figure 30). In total, 151 genes are significant in all pairwise comparisons (Supplementary Figure 30), and these genes may be potentially important regulators of the transcriptional program involved in olfactory epithelium development. In these early stages of development, one could expect transcription factors to drive the differences between the three developmental lineages. Out of all 2, 083 significant genes, we recover 84 transcription factors, as identified by TFcheckpoint 17 . Interestingly, aside from their general functionality as regulators of gene expression, the list of 84 transcription factors is enriched for gene sets related to epithelial cell differentiation, cell fate commitment, neuron differentiation, amongst other relevant gene sets (Supplementary Table 4).  Supplementary Figure 1: Mouse bone marrow dataset: The NB-GAM is robust to the number of knots k. Gaussian kernel density plot of the percentage of deviance explained by the NB-GAM applied to each of the genes in the dataset from Paul et al. 4 , with number of knots k ranging from 3 to 14. The distributions are nearly identical for the different numbers of knots, except for 3 knots, suggesting we might want to select more than 3 knots for this dataset.  The dataset is represented in low-dimensional space using Gaussian process latent variable models as implemented in GPfates. Cells are colored according to their assignment probability to the blue lineage. Trajectories inferred by GPfates are shown when (a) pseudotime is estimated by GPfates and (b) true pseudotime is provided as input to GPfates.

Supplementary Tables and Figures
Supplementary Figure 4: Cyclic simulation scenario: Selecting the optimal number of knots k using the AIC. Selecting the optimal number of knots, k ∈ {3, . . . , 10}, using the Akaike information criterion (AIC) for a random subset of 250 genes, as implemented in the evaluateK function in tradeSeq. The left panel shows boxplots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range) of the differences in AIC value with respect to the gene-wise average AIC for the range of k. The middle panels show the evolution of the average AIC (second panel) and relative AIC (third panel) across k. The relative AIC is defined as the relative change with respect to the average AIC at k = 3. The barplot in the right panel shows the number of genes which achieve their lowest AIC value for a given k.
Here, only genes for which the AIC value varied substantially enough across k (i.e., range in AIC greater than 2) are considered.  The associationTest from tradeSeq has superior performance for discovering genes whose expression is associated with pseudotime, as compared to Monocle 3. When investigating differential expression between lineages of a trajectory, the patternTest of tradeSeq consistently outperforms the diffEndTest across all three TI methods, since it is capable of comparing expression across entire lineages. Panel (e) displays the average performance curves across the three bifurcating datasets where all TI methods recovered the correct topology. Here, all tradeSeq patternTest workflows, tradeSeq slingshot end, and edgeR have similar performance and all are superior to BEAM, ImpulseDE2, and GPfates. Note that the performance of tradeSeq Monocle2 end deteriorates as compared to tradeSeq slingshot end; the curve for tradeSeq GPfates end is not visible in this panel due to its low performance. For the multifurcating dataset of panel (f), tradeSeq slingshot has the highest performance, closely followed by tradeSeq Monocle2 and edgeR.
Supplementary Figure 8: Bifurcating simulation scenario: Selecting the optimal number of knots k using the AIC. Selecting the optimal number of knots, k ∈ {3, . . . , 10}, using the AIC for a random subset of 250 genes, as implemented in the evaluateK function in tradeSeq. The left panel shows boxplots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range) of the differences in AIC value with respect to the gene-wise average AIC for the range of k. The middle panels show the evolution of the average AIC (second panel) and relative AIC (third panel) across k. The relative AIC is defined as the relative change with respect to the average AIC at k = 3. The barplot in the right panel shows the number of genes which achieve their lowest AIC value for a given k. Here, only genes for which the AIC value varied substantially enough across k (i.e., range in AIC greater than 2) are considered.  Figure 11: Bifurcating simulation scenario: Mean FDR-TPR performance curves for trajectory-based differential expression analysis across all 10 simulated datasets. ImpulseDE2 is not plotted since we were unable to run it on several datasets. Supplementary Figure 12: Bifurcating simulation scenario: FDP-TPR performance curves for trajectorybased differential expression analysis for each of the 10 simulated datasets. Note that the BEAM and tradeSeq Monocle2 methods are not plotted for Datasets 3, 6, and 9, since Monocle2 failed to discover a branching trajectory for these datasets. We were unable to run ImpulseDE2 on datasets 4, 5, 6, 9, and 10 due to errors. Supplementary Figure 13: Bifurcating simulation scenario: FDP-TPR performance curves for trajectorybased differential expression analysis based on the simulation ground truth. To allow a comparison with the BEAM approach, which fits smoothers using 3 knots, we fitted the tradeSeq NB-GAM once with 3 knots and once with 4 knots. We found the latter to provide an optimal fit in terms of AIC. The tradeSeq patternTest is unaffected by the number of knots, hence the performance curves overlap. tradeSeq consistently outperforms both BEAM and ImpulseDE2 in all datasets. Note that we did not include the GPfates method in this evaluation, since we were unable to provide the simulation ground truth as input to the method.
Supplementary Figure 14: Multifurcating simulation scenario: GPfates inferred trajectory on one simulated dataset. Two-dimensional representation of the dataset for the multifurcating simulation scenario using Gaussian process latent variable models, as implemented in GPfates. Cells are colored according to their assignment probability to the blue lineage.
Supplementary Figure 15: Multifurcating simulation scenario: Selecting the optimal number of knots k using the AIC. Selecting the optimal number of knots, k ∈ {3, . . . , 10}, using the AIC for a random subset of 500 genes, as implemented in the evaluateK function in tradeSeq. The left panel shows boxplots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range) of the differences in AIC value with respect to the gene-wise average AIC for the range of k. The middle panels show the evolution of the average AIC (second panel) and relative AIC (third panel) across k. The relative AIC is defined as the relative change with respect to the average AIC at k = 3. The barplot in the right panel shows the number of genes which achieve their lowest AIC value for a given k. Here, only genes for which the AIC value varied substantially enough across k (i.e., range in AIC greater than 2) are considered. Datasets of increasing size (in terms of number of cells) were simulated, each consisting of 5, 000 genes. The fitGAM function of tradeSeq was ran with 4 knots. The computational time required to run the tests for all genes is benchmarked using the microbenchmark package, with 10 iterations each. patternTest and earlyDETest are slower than associationTest, diffEndTest, and startVsEndTest, but all take under 30 seconds to run. The time requirement is constant with respect to the number of cells.
Supplementary Figure 17: Adipocyte differentiation dataset: Selecting the optimal number of knots k using the AIC. Selecting the optimal number of knots, k ∈ {3, . . . , 10}, using the AIC for two random subsets (top and bottom rows represent one subset each) of 200 genes, as implemented in the evaluateK function in tradeSeq. The left panel shows boxplots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range) of the differences in AIC value with respect to the gene-wise average AIC for the range of k. The middle panels show the evolution of the average AIC (second panel) and relative AIC (third panel) across k. The relative AIC is defined as the relative change with respect to the average AIC at k = 3. The barplot in the right panel shows the number of genes which achieve their lowest AIC value for a given k. Here, only genes for which the AIC value varied substantially enough across k (i.e., range in AIC greater than 2) are considered.
Supplementary Figure 18: Adipoyte differentiaion dataset: Inferred trajectory. The scRNA-seq data are plotted in 2D UMAP space, and each cell is colored according to its cluster membership as derived by k-means clustering with k = 6 clusters. The black solid line represents the trajectory as estimated by slingshot.
Supplementary Figure 19: Adipocyte differentiation dataset: Top markers for progenitor cell population.
The scRNA-seq data are plotted in 2D UMAP space, and each cell is colored according to the expression of one of six genes (the expression range is divided into 4 bins, where blue corresponds to low expression and red corresponds to high expression). The top row corresponds to two marker genes, Dpp4+ and Wnt2, from the original manuscript 7 . Other plots are top genes identified with the startVsEndTest procedure from tradeSeq.

26
Supplementary Figure 20: Adipocyte differentiation dataset: Genes upregulated in the adipocyte precursor stage and a single differentiated cell type. The scRNA-seq data are plotted in 2D UMAP space, and each cell is colored according to the expression of one of two genes (the expression range is divided into 4 bins, where blue corresponds to low expression and red corresponds to high expression).
Supplementary Figure 21: Adipocyte differentiation dataset: Genes sporadically upregulated across the entire lineage for a single differentiated cell type. The scRNA-seq data are plotted in 2D UMAP space, and each cell is colored according to the expression of one of two genes (the expression range is divided into 4 bins, where blue corresponds to low expression and red corresponds to high expression). Enrichment is determined for each method based on its respective ranking of the genes according to evidence for differential expression. Genes that are contained in the erythrocyte gene set are denoted with vertical lines at the bottom of each figure, and the green curve represents the gene enrichment score along the gene rankings. tradeSeq has the highest enrichment score, as determined by the dashed red line, since genes that are related to erythrocytes predominantly have high rankings for differential expression, while the distribution of erythrocyte genes seems more uniform with, for example, the BEAM approach.
Supplementary Figure 25: Mouse bone marrow dataset: Stability of gene clustering methods across bootstrap samples. Boxplots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range) of gene clustering stability measures are shown for three clustering algorithms. The stability of a gene clustering method is evaluated using non-parametric bootstrapping of the cells (for computational reasons, we restricted this evaluation to six bootstrap samples). We consider all genes found to be significant at a 5% nominal FDR level by patternTest for the dataset of Paul et al. 4 . For each bootstrap sample, cells are sampled at random with replacement, the NB-GAM is refit using trade-Seq, and genes are clustered based on the tradeSeq fitted values, using both partitioning around medoids (PAM) and RSEC. We compare RSEC against PAM since the latter is also used in Monocle for gene clustering. For RSEC, we evaluate both clustering on the fitted values directly (method 'RSEC noDR' in the figure) as well as clustering after dimensionality reduction with principal component analysis (the default for RSEC, as implemented in clusterExperiment, with automatic determination of the number of principal components; method 'RSEC' in the figure). The stability of the clustering is evaluated by comparing the bootstrapped clusterings with the original clustering based on the full dataset using the adjusted Rand index (ARI) 19 .
Supplementary Figure 26: Olfactory epithelium dataset: Selecting the optimal number of knots k using the AIC. Selecting the optimal number of knots, k ∈ {3, . . . , 30}, using the AIC for two random subsets (top and bottom rows represent one subset each) of 1, 000 genes, as implemented in the evaluateK function in tradeSeq. The left panel shows boxplots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range) of the differences in AIC value with respect to the gene-wise average AIC for the range of k. The middle panels show the evolution of the average AIC (second panel) and relative AIC (third panel) across k. The relative AIC is defined as the relative change with respect to the average AIC at k = 3. The barplot in the right panel shows the number of genes which achieve their lowest AIC value for a given k. Here, only genes for which the AIC value varied substantially enough across k (i.e., range in AIC greater than 2) are considered.
Supplementary Figure 27: Mouse olfactory epithelium dataset: Cell cycle genes in the neuronal lineage.
The figure illustrates the enrichment of cell cyle genes in lists of genes whose expression was found to be most significantly associated with the neuronal lineage according to the associationTest procedure in tradeSeq. The list of cell cycle related genes was obtained from the Mouse Genome Informatics (MGI) website at http://www.informatics.jax.org/go/term/GO:0007049. On the x-axis, genes are ordered according to their significance based on associationTest. The y-axis shows the ratio of the number of cell cycle genes among a set of top significant genes relative to the number of cell cycle genes one would expect by chance (i.e., if DE genes were randomly found/sampled). Specifically, if we let C denote the proportion of genes associated with the cell cycle according to the MGI database, then, under the hypothesis that cell cycle genes are randomly discovered as DE, the expected number of cell cycle genes in the list of top N genes is N C. The relative number that is plotted on the y-axis is then the ratio between the number of cell cycle genes discovered by tradeSeq for a given top list of size N and N C.