Abstract
In singlecell RNA sequencing (scRNASeq), gene expression is assessed individually for each cell, allowing the investigation of developmental processes, such as embryogenesis and cellular differentiation and regeneration, at unprecedented resolution. In such dynamic biological systems, cellular states form a continuum, e.g., for the differentiation of stem cells into mature cell types. This process is often represented via a trajectory in a reduceddimensional representation of the scRNASeq dataset. While many methods have been suggested for trajectory inference, it is often unclear how to handle multiple biological groups or conditions, e.g., inferring and comparing the differentiation trajectories of wildtype and knockout stem cell populations. In this manuscript, we present condiments, a method for the inference and downstream interpretation of cell trajectories across multiple conditions. Our framework allows the interpretation of differences between conditions at the trajectory, cell population, and gene expression levels. We start by integrating datasets from multiple conditions into a single trajectory. By comparing the cell’s conditions along the trajectory’s path, we can detect largescale changes, indicative of differential progression or fate selection. We also demonstrate how to detect subtler changes by finding genes that exhibit different behaviors between these conditions along a differentiation path.
Similar content being viewed by others
Introduction
The emergence of RNA sequencing at the singlecell level (scRNASeq) has enabled a new degree of resolution in the study of cellular processes. The ability to consider biological processes as a continuous transition of cell states instead of individual discrete stages has permitted a finer and more comprehensive understanding of dynamic processes such as embryogenesis and cellular differentiation. Trajectory inference (TI) was one of the first applications that leveraged this continuum^{1} and a consequential number of methods have been proposed since then^{2,3,4}. Saelens et al.^{5} offer an extensive overview and comparison of such methods. Analysis of scRNASeq datasets using a curated database reveals that about half of all datasets were used for trajectory inference^{6}. At its core, TI represents a dynamic process as a directed graph. Distinct paths along this graph are called lineages. Individual cells are then projected onto these lineages and – given a root state (either userprovided or automatically detected) – their distance along each path is called pseudotime. In this setting, developmental processes are often represented in a tree structure, while cell cycles are represented as a loop. Following TI, other methods have been proposed to investigate differential expression (DE) along or between lineages, either as parts of TI methods^{3,7} or as separate modules that can be combined to create a full pipeline^{8}.
Spearheaded by technological and laboratory developments allowing increased throughput of scRNASeq studies, recent experiments study dynamic systems affected by different experimental perturbations or biological conditions. This includes, for example, situations where a biological process is studied both under a normal (or control) condition and under an intervention such as a treatment^{9,10,11} or a genetic modification^{12}. In other instances, one may want to contrast healthy versus diseased^{13} cells or even more than two conditions^{14}. In steadystate systems, a reasonable approach consists of clustering the cells into biologically relevant cell identities and assessing differential abundance, i.e., imbalances between conditions, for each cluster. However, in dynamic systems, clustering the cells and assessing condition imbalances for each cluster ignores the continuity of the cellular state transition process, making these approaches suboptimal. For example, borrowing from the field of mass cytometry^{15}, milo^{16} and DAseq^{17} rely on a common lowdimensional representation of all observations and define datadriven local neighborhoods in which they test for differences in compositions. Each of these methods shows clear improvements in performance over clusterbased methods, which assess imbalances between conditions for each cell state/type cluster, and provide a more principled approach that better reflects the nature of the system.
However, many studies with multiple conditions actually involve processes that can be described by a trajectory. Utilizing this underlying biology could increase both the interpretability of the results and the ability to detect true and meaningful changes between conditions. In this manuscript, we present the condiments workflow, a general framework to analyze dynamic processes under multiple conditions that leverages the concept of a trajectory structure. While condiments has a more specific focus than milo or DAseq, it compensates for this by improving the quality of the differential abundance assessment with betterperforming tests and simplifying and enhancing the biological interpretation by breaking down the comparison into several smaller and pertinent questions. Our proposed analysis workflow is divided into three steps. In Step 1, condiments considers the trajectory inference question, assessing whether the dynamic process is fundamentally different between conditions, which we call differential topology. In Step 2, it tests for global differences between conditions, both along lineages – differential progression – and between lineages – differential fate selection. Lastly, in Step 3, it estimates gene expression profiles similarly to Van den Berge et al.^{8} and tests whether gene expression patterns differ between conditions along lineages, therefore extending the scope of differential expression and improving on clusterbased methods such as MAST^{18}, scde^{18}, or padoga2^{19}.
In this manuscript, we first present the condiments workflow, by detailing the underlying statistical model and providing an explanation and intuition for each step. We then benchmark condiments against more general methods that test for differential abundance to showcase how leveraging the existence of a trajectory improves the assessment of differential abundance. Finally, we demonstrate the flexibility and improved interpretability of the condiments workflow in three case studies that span a variety of biological settings and topologies.
Results
General model and workflow
To help visualize the workflow, we first simulated toy datasets that illustrate different scenarios that we will later encounter in our case studies. In Fig. 1, we have four examples, one per column, with two conditions each: a control condition and a knockout (KO) condition. The first example represents a setting where there are no differences at all between conditions. As we can see in the reduceddimensional representation (first row), the orange and blue cells are similarly distributed. Therefore, we observe no differences, whether differential topology, differential progression, or differential fate selection. In the second and third examples, cells differentiate along the same structure so we have no differential topology. In the second example, however, we can see that there are nearly no orange cells at the beginning of the trajectory and that cells in the KO condition progress faster: this is an example of differential progression. In the third example, there are nearly no orange cells after the bifurcation in the bottom lineage and cells in the KO condition select the top lineage preferentially compared to the control: this is an example of differential fate selection. Note that, since lineages are not distinguishable before they bifurcate, we only see differential progression in the bottom lineage since orange cells only progress along that lineage until the bifurcation. In the fourth example, we can see that the top lineage is altered in the KO condition: this is an example of differential topology and we would infer separate trajectories. We can still compare the lineages and, after correcting for the differential topology, we see no differential progression or differential fate selection. Case studies described in a later section will give realworld examples of similar settings.
We now describe the data structure and statistical model. All mathematical symbols used in this section are recapitulated in Table 1. We observe gene expression measures for J genes in n cells, resulting in a J × n count matrix Y. For each cell i, we also know its condition label c(i) ∈ {1, …, C} (e.g.,"treatment” or “control”, “knockout” or “wildtype”). We assume that, for each condition c, there is an underlying developmental structure \({{{{{{{{\mathcal{T}}}}}}}}}_{\!\!c}\), or trajectory, that possesses a set of L_{c} lineages.
For a given cell i with condition c(i), its position along the developmental path \({{{{{{{{\mathcal{T}}}}}}}}}_{c(i)}\) is defined by two vectors: a vector of L_{c(i)} pseudotimes T_{i}, which represent how far the cell has progressed along each lineage; and a unitnorm vector of L_{c(i)} weights W_{i} (∣∣W_{i}∣∣_{1} = 1), which represents how likely it is that the cell belongs to each lineage. That is, for each cell i, there is one pseudotime and one weight per lineage, with:
The (multivariate) cumulative distribution functions (CDF) G_{c} and H_{c} are conditionspecific and we make limited assumptions on their properties (see the Methods section for details). Using this notation, we can properly define a trajectory inference (TI) method as a function that takes as input Y – and potentially other arguments (e.g., cell cluster labels) – and returns estimates of L_{c}, T, and W through the estimate for \({{{{{{{{\mathcal{T}}}}}}}}}_{\!\!c}\). That is, methods such as slingshot^{2} or monocle3^{20} use the count matrix to estimate the number of lineages, as well as the pseudotime and lineage assignments (or weights) of each cell.
The first question to ask in our workflow (Step 1) is: Should we fit a common trajectory to all cells regardless of their condition? Or are the developmental trajectories too dissimilar between conditions? To demonstrate what this means, consider two extremes. For a dataset that consists of a mix of bone marrow stem cells and epithelial stem cells, using tissue as our condition, it is obvious that the developmental trajectories of the two conditions are not identical and should be estimated separately. On the other hand, if we consider a dataset where only a few genes are differentially expressed between conditions, the impact on the developmental process will be minimal and it is sensible to estimate a single common trajectory.
We favor fitting a common trajectory for several reasons. Firstly, fitting a common trajectory is a more stable procedure since more cells are used to infer the trajectory. Secondly, our workflow still provides ways to test for differences between conditions even if a common trajectory is inferred. In particular, fitting a common trajectory between conditions does not require that cells of distinct conditions differentiate similarly along that trajectory. Finally, fitting different trajectories greatly complicates downstream analyses since we may need to map between distinct developmental structures in order to be able to compare them (e.g., each lineage in the first trajectory must match exactly one lineage in the second trajectory). Therefore, our workflow recommends fitting a common trajectory if the differences between conditions are small enough.
To quantify what small enough is, we rely on two approaches. The first is a visual diagnostic tool called imbalance score. It requires as input a reduceddimensional representation X of the data Y and the condition labels. Each cell is assigned a score that measures the imbalance between the local and global distributions of condition labels. Similarly to Dann et al.^{16,21}, the neighborhood of a cell is defined using a knearest neighbor graph on X, which allows the method to scale very well to large values of n. Celllevel scores are then locally scaled using smoothers in the reduceddimensional space (see the Methods section).
However, a visual representation of the scores may not always be sufficient to decide whether or not to fit a common trajectory in less obvious cases. Therefore, we introduce a more quantitative approach, the topologyTest. This test assesses whether we can reject the following null hypothesis:
We can test hypothesis (2) using the topologyTest, which involves comparing the actual distribution of pseudotimes for conditionspecific trajectories to their distribution when condition labels are permuted (see the Methods section for details). Since we want to favor fitting a common trajectory and we only want to discover differences that are not only statistically significant but also biologically relevant, the tests typically include a minimum magnitude requirement for considering the difference between distributions to be significant (similar to a minimum logfoldchange for assessing DE). Examples of results for the topologyTest can be seen in Fig. 1. In the first column, there is no differential topology between the blue and orange condition, while in the fourth column, the top lineage is different between the two conditions.
In practice, the topologyTest requires maintaining a mapping between each of the trajectories, both between conditions and between permutations (see the Methods section where we define a mapping precisely). Trajectory inference remains a semisupervised task, that generally cannot be fully automated. In particular, the number of estimated lineages might change between different permutations for a given condition, precluding a mapping. As such, the topologyTest is only compatible with certain TI methods that allow for the specification of an underlying skeleton structure, including slingshot and TSCAN^{2,4}. More details and practical implementation considerations are discussed in the Methods section.
For Steps 2 and 3, we are no longer limited to specific TI methods. Any TI method can be used as input. We can then ask whether cells from different conditions behave similarly as they progress along the trajectory.
The workflow then focuses on differences between conditions at the trajectory level (Step 2) and asks: What are the global differences between conditions? To facilitate the interpretation of the results, we break this into two separate subquestions.
Although the topology might be common, cells from different conditions might progress at different rates along the trajectory (Step 2a). For example, a treatment might limit the cells’ differentiation potential compared to the control, or instead speed it up. In the first case, one would expect to have more cells at the early stages and fewer at the terminal state, when comparing treatment and control. In the second column of Fig. 1, we can see another example: there are more orange cells at the latest stages of the trajectory, compared to the blue cells. Using our statistical framework, testing for differential progression amounts to testing equality of the pseudotime distributions between conditions:
This test can also be conducted at the individuallineage level, by comparing univariate distributions.
In order to assess the null hypotheses in the progressionTest, we rely on nonparametric tests to compare two or more distributions, e.g., the KolmogorovSmirnov test^{22} or the classifier test^{23}. More details and practical implementation considerations are discussed in the Methods section.
Although the topology might be common, cells in different conditions might also differentiate in varying proportions between the lineages (Step 2b). For example, an intervention might lead to cells selecting one lineage over another, compared to the control condition, or might alter survival rates of differentiated cells between two end states. In the third column of Fig. 1, we can see another example: there are nearly no orange cells at the later stage of the bottom lineage of the trajectory, compared to the blue cells. Cells in the orange condition are more likely to end up in the top lineage. In all examples above, the weight distribution will be different between the control and treatment. Assessing differential fate selection at the global level amounts to testing, in our statistical framework, the null hypothesis of equal weight distributions between conditions
The above null hypotheses can again be tested by relying on nonparametric test statistics. We also discuss specific details and practical implementation in the Methods section. This test can also be conducted for a single pair of lineages \((l,l{^{\prime}} )\).
The progressionTest and fateSelectionTest are quite linked and will therefore often return similar results, as Fig. 1 shows. However, they do answer somewhat different questions. In particular, looking at singlelineage (progressionTest) and lineagepair (fateSelectionTest) test statistics will allow for a better understanding of the global differences between conditions. Differential fate selection does not necessarily imply differential progression and vice versa. The simulations will show some examples of this.
Steps 1 and 2 focus on differences at a global level (i.e., aggregated over all genes) and will detect large changes between conditions. However, such major changes are usually ultimately driven by underlying differences in gene expression patterns (Step 3). Furthermore, even in the absence of global differences, conditions might still have a more subtle impact at the gene level. In the third step, we therefore compare gene expression patterns between conditions for each of the lineages.
Following the tradeSeq manuscript by Van den Berge et al.^{8}, we consider a general and flexible model for gene expression, where the gene expression measure Y_{ji} for gene j in cell i is modeled with the negative binomial generalized additive model (NBGAM) described in Equation (11). We extend the tradeSeq model by additionally estimating conditionspecific average gene expression profiles for each gene. We therefore rely on lineagespecific, genespecific, and conditionspecific smoothers, s_{jlc}.
With this notation, we can introduce the conditionTest, which, for a given gene j, tests the null hypothesis that these smoothers are identical across conditions:
As in tradeSeq, we rely on the Wald test to test H_{0} in terms of the smoothers’ regression coefficients. We can also use the fitted smoothers to visualize conditionspecific gene expression or cluster genes according to their expression patterns.
Simulations
We generate multiple trajectories using the dyngen simulation framework provided by Cannoodt et al.^{24}. Within this framework, it is possible to knock out a specific gene. Here, we knock out a master regulator that drives differentiation into the second lineage. The strength of this knockout can be controlled via a multiplier parameter m: If m = 1, there is no effect; if m = 0, the knockout is total; if 0 < m < 1, we have partial knockout; if m > 1, the master regulator is overexpressed and cells differentiate much faster along the second lineage.
Four types of datasets are generated: a) simple branching trajectories (two lineages, e.g., Fig. 2a) of 3, 500 cells, with equal parts wildtype and knockout; b) trajectories with two consecutive branchings (and thus three lineages, e.g., Fig. 2b) of 3, 500 cells, with equal parts wildtype and knockout; c) branching trajectories (two lineages) of 5, 000 cells with three conditions, wildtype, knockout with multiplier m, and induction with multiplier 1/m (Fig. 2c); d) more complex trajectories (five lineages, e.g, Fig. 2d) of 3, 500 cells, with equal parts wildtype and knockout.
Since the simulation framework cannot generate trajectories with distinct topologies for the different conditions, we start the condiments workflow at Step 2. We use either slingshot or monocle3 upstream of condiments. We compare the progressionTest and fateSelectionTest to methods that also do not rely on clustering, but instead take into account the continuum of differentiation. milo^{16} and DAseq^{17} both define local neighborhoods using knearest neighbors graphs. They then look at differences of condition proportions in these neighborhoods to test for what they call differential abundance. These methods return multiple tests per dataset (i.e., one per neighborhood), so we adjust for multiple hypothesis testing using the BenjaminiHochberg procedure for control of the false discovery rate (FDR)^{25}. By applying milo, DAseq, and condiments on the simulated datasets, we can compare the results of the tests versus the values of m: We count a true positive when a test rejects the null and m ≠ 1, and a true negative when the test fails to reject the null and m = 1. Note that since monocle3 performs hardassignments of cells to lineages, the weight distributions are degenerate, and thus we cannot run the fateSelectionTest downstream of this method.
We compare the methods’ ability to detect correct differences between conditions using five measures: The true negative rate (TNR), positive predictive value (PPV), true positive rate (TPR), negative predictive value (NPV), and F1score, when controlling the FDR at a nominal level of 5%. More details on the simulation scenarios and performance measures can be found in the Methods section. Results are displayed in Fig. 2e.
On all simulations, all methods display strong results for the TNR and PPV. However, the performances for the TPR (power), NPV, and F1rate vary quite widely. On the datasets with two or three lineages, the progressionTest, downstream of either slingshot or monocle3, performs the best, followed by the fateSelectionTest, DAseq,DAseq, and milo. On the third simulation setting with three conditions, we cannot benchmark DAseq since its testing framework is restricted to two conditions. DAseq slightly outperforms the progressionTest on the dataset with 5 lineages although the results are quite similar. On those datasets, the knockout only affects the tip of lineage 5 but not the proportion of cells that select that lineage. Therefore, the fact that the fateSelectionTest has no power (. 01) is expected. This displays the increased flexibility and interpretability procured by the condiments workflow, while still outperforming competitors.
The dyngen framework does not generate trajectories with differential topology, therefore the correct workflow would be to fit a common trajectory. We can nonetheless study the impact of fitting separate trajectories on the rest of the workflow. We therefore consider two settings: correctly fitting a common trajectory or incorrectly fitting separate trajectories, to represent the two possible outcomes of Step 1. We then compare the performance of the progressionTest and fateSelectionTest in the two settings. The test performances are identical for the progressionTest, and only slightly impacted in the fateSelectionTest case, when looking at a receiver operating characteristic (ROC) plot (see Fig. S5). The lower performance when incorrectly fitting separate trajectories may stem from the fact we tend to have more false positives, which can be expected: fitting separate trajectories when it is not needed amounts to overfitting to the data.
Even if the correct decision is taken after Step 1, the trajectory inference may still be imperfect. We therefore assess the performance of the progressionTest as we add increasing noise to the true simulated pseudotimes and lineage assignment values. Since the dyngen framework assumes hard lineage assignments, we cannot test the fateSelectionTest. With up to 60% of cells improperly assigned (noise of.3) and 200% noise on pseudotime (see the Methods section for details), the progressionTest still correctly rejects the null at the 5% nominal significance level, showing that moderate errors in the trajectory inference have small impact on the test outcome. See Fig. S4.
This is also reflected in the previous simulations of Fig. 2. While slingshot correctly identifies the proper number of lineages 98% of the time, monocle3 only does so 54% of the time. This is probably because the former allows for the specification of both start and endpoints, while the latter only allows for a start point. Nonetheless, the performance of the progressionTest following monocle3 is nearly identical to that downstream of slingshot.
We consider four real datasets as case studies for the application of the condiments workflow. Table 2 gives an overview of these datasets and summary results. These case studies aim to demonstrate the versatility and usefulness of the condiments workflow, as well as showcase how to interpret and use the tests in practice. The results for the TGFβ dataset and the Fibrosis dataset are presented below while the results for the TCDD and KRAS datasets are presented in Supplementary Results. Preprocessing for all three datasets is described in Supplementary Methods^{26,27,28}.
TGFβ dataset
McFalineFigueroa et al.^{9} studied the epithelialtomesenchymal transition (EMT), where cells migrate from the epithelium (inner part of the tissue culture dish) to the mesenchyme (outer part of the tissue culture dish) during development. The developmental process therefore is both temporal and spatial. Moreover, the authors studied this system under two settings: a mock (control) condition and a condition under activation of transforming growth factor β (TGFβ).
After preprocessing, normalization, and integration (see Supplementary Methods S1.3), we have a dataset of 9, 268 cells, of which 5, 207 are mock and 4, 241 are TGFβactivated. The dataset is represented in reduced dimension using UMAP^{29} (Fig. 3a). Adding the spatial label of the cells (Fig. 3b) shows that the reduceddimensional representation of the gene expression data captures the differentiation process.
We can then run the condiments workflow, beginning with the differential topology question. The imbalance score of each cell is computed and displayed in Fig. 3c. Although some regions do display strong imbalance in conditions, there is no specific pattern along the developmental path. This is confirmed when we run the topologyTest, which has a nominal pvalue of 0.38. We clearly fail to reject the null hypothesis and we consequently fit a common trajectory to both conditions using slingshot with the two spatial labels as clusters. The resulting singlelineage trajectory is shown in Fig. 3d.
Next, we ask whether the TGFβ treatment impacts the differentiation speed. The developmental stage of each cell is estimated using its pseudotime. Plotting the percondition kernel density estimates of the pseudotime distributions in Fig. 3e reveals a strong treatment effect. The pseudotime distribution for the mock cells is trimodal, likely reflecting initial, intermediary, and terminal states. By contrast, the initial mode is not present in the TGFβ condition, and the second is skewed towards higher pseudotime values. This is very consistent with the fact that the treatment is a growth factor that would stimulate differentiation, as shown in the original publication and confirmed in the literature on EMT^{30}. Testing for equality of the two distributions with the progressionTest confirms the visual interpretation. The nominal pvalue associated with the test is smaller than 2.2 × 10^{−16} and we reject the null that the distributions are identical. Since the trajectory is limited to one lineage, the fateSelectionTest is not applicable.
Then, we proceed to identifying genes whose expression patterns differ between the mock and TGFβ conditions. After gene filtering, we fit smoothers to 10, 549 genes, relying on the model described in Equation (11). We test whether the smoothers are significantly different between conditions using the conditionTest. Testing against a logfoldchange threshold of 2, we find 1, 993 genes that are dynamically differentially expressed between the two conditions when controlling the false discovery rate at a nominal level of 5%. Figure 3f shows the two genes with the highest Wald test statistic. The first gene, LAMC2, was also found to be differentially expressed in the original publication and has been shown to regulate EMT^{31}. The second gene, TGFBI or TGFβinduced gene, is not surprising, and was also labelled as differentially expressed in the original publication. The DE genes confirm known biology around TGFβ signaling and EMT. For example, it is known that TGFβ signaling occurs through a complex of TGFβ receptor 1 (TGFBR1) and TGFβ receptor 2 (TGFBR2)^{32}, and both genes are found to be significantly upregulated after TGFβ treatment (Fig. S10). Looking at all 1, 906 DE genes, we can cluster and display their expression patterns along the lineage for both conditions (Fig. 3g and identify several groups of genes that have different patterns between the two conditions. These groups of genes may be further explored to better understand TGFβ induced EMT. For example, genes that are differentially expressed later in the EMT may be further downstream targets of TGFβ signaling.
Finally, we perform a gene set enrichment analysis on all the genes that are differentially expressed between the conditions using fgsea^{33}. We find two significant gene sets that differentiate treatment and control: biological adhesion and locomotion. This is in full concordance with the biology: the TGFβ treatment accelerates the migration and differentiation of cells on the tissue culture dish. Indeed, the TGFβaccelerated EMT requires epithelial cells to disband intercellular junctions and subsequently migrate to the mesenchyme.
We also reproduce the analysis using monocle3 as a trajectory inference method. We cannot perform Step 1 with monocle3. We use the results from Step1 with slingshot and fit a common trajectory: monocle3 also finds a single lineage developmental path (Fig. 4a). We can then test for differential progression and find the same result as with slingshot: We have clear differential progression (pvalue of 2.2 × 10^{−16}). Likewise, we find that the Wald statistics from the conditionTest are very highly correlated (Pearson correlation coefficient of 0.97) when performed downstream of either slingshot or monocle3. Finally, using the same FDR nominal level as before, we find 1613 DE genes downstream of monocle3, 92% of which were also deemed DE downstream of slingshot. Therefore, changing the trajectory inference method has no strong impact on the downstream analysis.
We can also compare the results produced by DAseq and milo on this dataset. Both methods find similar regions of differential abundance (Fig. 4b, c), but the results are hard to interpret; they are not linked to the trajectory so biological interpretation of the statistical result is more challenging.
Fibrosis dataset
Habermann et al.^{34} investigated chronic interstitial lung diseases (ILD) by sequencing 10 nonfibrotic control lungs and 20 ILD lungs. In its later stage, the disease progresses to pulmonary fibrosis, associated with epithelial tissue remodelling. The original paper uses scRNASeq to investigate how this remodelling plays among many cell types. Here, we focus on a subset of the data comprised of 14, 462 cells, with 5, 405 control and 9, 057 ILD cells. These data underlie Fig. 3 of the original paper and focus on two cell lineages, starting from, respectively, AT2 and SCGB3A2+ cells, that each differentiate independently into AT1 cells.
We use the preprocessed and normalized data from the original paper to obtain the reduceddimensional representation depicted in Fig. 5a (colored by disease status) and Fig. 5b (colored by cell type as defined in the original work). The reduceddimensional representation captures the differentiation process, with both AT2 cells and SCGB3A2+ cells differentiating into AT1 cells through an intermediate ’transitory AT2’ stage. Moreover, the SCGB3A2+ cell type contains nearly no control cells.
To dive more deeply into the differences between control and ILD cells, we run the condiments workflow. The imbalance score, displayed in Fig. 5c, strongly validates the initial observation: The AT2AT1 lineage has relatively low imbalance scores while the SCGB3A2+ cluster shows strong imbalance. Interestingly, the transitional AT2 cluster shows no particular local imbalance, suggesting that it is indeed a common stage for both lineages, as the original paper demonstrated.
We then run the topologyTest downstream of slingshot (see the Methods section). On the full dataset, the test’s pvalue is 3.0 × 10^{−12}: estimating the lineages from each condition separately produces trajectories that are far more divergent than what can be produced from permutation. This is expected, as only 3% of the cells in the SCGB3A2+ cluster come from the control group. When the trajectory is estimated with only the control cells, the SCGB3A2+ lineage is spuriously drawn towards the AT2 cells (see Fig. S11).
However, when running the topologyTest after removing the SCGB3A2+ cells (and therefore only focusing on the second lineage), we fail to reject the null (pvalue of 1). This common lineage can therefore be estimated using both conditions, while the first is only present in the treatment condition. We can therefore fit a common trajectory to the full dataset, and we find the two lineages depicted in Fig. 5d.
We can then move to Step 2 of the condiments workflow. The fateSelectionTest gives clear results: its pvalue is ≤2.2 × 10^{−16}. When we look at the distribution of lineage weights in Fig. 5e, we can see three clear peaks: at 0 (the cell does not belong to that lineage),. 5 (the cell belongs to both lineages), and 1 (the cell only belongs to that lineage). Because the sum of weights over the two lineages equals 1 for each cell, the distribution of the weights along Lineage 2 is symmetrical to the distribution along Lineage 1. We can clearly see that, for the control condition, we have some cells that belong to both lineages (weights of. 5), but most cells belong only to Lineage 2 (weights of 0 for Lineage 1 and of 1 for Lineage 2). The fateSelectionTest provides a principled way of assessing this. Note that the test is named based on a setting where cells differentiate from a single root stage into multiple endpoints. Here, since we have the reverse (multiple root stages progression to a similar endpoint), it would be more appropriate to speak of differential origins. Looking at distributions of pseudotimes reinforces the fact that few cells in control belong to lineage 1. On the first lineage, most of the control cells are at the beginning of the lineage while the ILD cells are more evenly distributed, with a strong concentration at the end (see Fig. 5f). On the second (shared) lineage, there is some differential progression, but much less. Concordingly, the progressionTest’s pvalue is ≤2.2 × 10^{−16}.
We can then proceed on to Step 3 and look for genes with different expression patterns between conditions, within a lineage. We fit the smoothers to 10, 100 genes. Testing against a logfoldchange threshold of 2, we find only 3 genes that are differentially expressed between conditions, according to the conditionTest, when controlling the FDR at a nominal level of 5%. This result might seem surprising at first but it is expected. Indeed, if we had a reverse scenario with a single root point and a branching into two lineages, we would expect many genes to be differentially expressed between conditions, thus driving the global differences we observed in Step 2. Here, however, all global differences come from the fact that there are only a handful of control cells in the SCGB3A2+ stage. Once we account for this, cells differentiate similarly. In the disease state, the AT2 to AT1 differentiation path is unchanged compared to the normal state; the only difference is that AT1 cells can also originate from a new cell state (SCGB3A2+).
To compare the two possible origins of AT1 cells, we can look for genes that have strong dynamic differences between lineages, regardless of condition. Using the patternTest from tradeSeq, we find 1, 589 DE genes when controlling the FDR at a nominal level of 5%. The gene with the strongest difference between the two lineages is the SCGB3A2 gene (Fig. S12a). This gene was also listed in the original paper, as well as 3 other genes: SFTPC, ABCA3, and AGER (see Fig. S12b–d). While our workflow confirms all four candidate genes as DE genes, we additionally identify many more to be further investigated. To further explore the relevance of these DE genes, we perform a gene set enrichment analysis on all the genes that are differentially expressed between the lineages. The top enriched gene sets include adhesion, defense response, cell proliferation, epithelium development and morphogenesis, as well as immune response (Supplementary Table S1); biological processes that are all relevant to the system being studied.
Discussion
In this manuscript, we have introduced condiments, a full workflow to analyze dynamic systems under multiple conditions. By separating the analysis into several steps, condiments offers a flexible framework with increased interpretability. Indeed, we follow a natural progression through a topdown approach, by first studying overall betweencondition differences in trajectories with the topologyTest, then differences in abundance at the trajectory level with the progressionTest and fateSelectionTest, and finally genelevel differences in expression with the conditionTest.
As demonstrated in the simulation studies, taking into account the dynamic nature of biological systems via the trajectory representation enables condiments to better detect true changes between conditions. The flexibility offered by our implementation, which provides multiple nonparametric tests for comparing distributions, also allows us to investigate a wide array of scenarios. This is evident in the four case studies presented in the manuscript. Indeed, in the TGFβ case study we have a developmental system under treatment and control conditions, while in the Fibrosis case study we compare a normal system and its disease counterpart. In the TCDD case study, the continuum does not represent a developmental process but rather spatial separation. In the KRAS case study, the conditions do not reflect different treatments but instead different cancer models. This shows that condiments can be used to analyze a wide range of datasets.
Taking batch effects into account can be difficult, particularly as the different conditions often also represent different batches. Indeed, some interventions cannot be delivered on a cellbycell basis and this creates unavoidable confounding between batches and conditions. Normalization and integration of the datasets must therefore be done without eliminating the underlying biological signal. This balance can be hard to strike, as discussed in Zhao et al.^{17}. Proper experimental design – such as having several batches per condition – or limiting batch effects as much as possible – for example, sequencing a mix of conditions together – can help lessen this impact. In settings where there is no confounding of batches and conditions, condiments offers additional tools to better perform batch correction. Still, some amount of confounding is sometimes inherent to the nature of the problem under study.
Normalization can still remove most of the batch effects before performing trajectory inference. If the normalization is insufficient, this might lead to rejecting the null in the topologyTest, as seen in the KRAS case study. In that setting, it is still possible to perform the downstream analysis: fitting one trajectory per condition is in fact a way to account for the remaining batch effects and therefore solve the issue for Step 2. Finally, when fitting smoothers to gene expression measures in Step 3, it is possible to add covariates such as batch, if batches are not fully confounded with conditions.
The tests used in the workflow (e.g., KolmogorovSmirnov test) assume that the pseudotime and weight vectors are known and independent random variables for each cell. However, this is not the case; they are estimated using TI methods which use all samples to infer the trajectory, and each estimate inherently has some uncertainty. Here, we ignore this dependence, as is the case in other differential abundance methods, which assume that the reduceddimensional coordinates are observed independent random variables even though they are being estimated using the full dataset. We stress that, rather than attaching strong probabilistic interpretations to pvalues (which, as in most RNASeq applications, would involve a variety of hardtoverify assumptions and would not necessarily add much value to the analysis), we view the pvalues produced by the condiments workflow as useful numerical summaries for guiding the decision to fit a common trajectory or conditionspecific trajectories and for exploring trajectories across conditions and identifying genes for further inspection. Other methods have been presented that can generate pvalues with probabilistic interpretation^{35,36}. However, current methods are restricted to only one lineage, or scale poorly beyond a thousand cells, showing the need for further work in this domain.
Splitting the data into two groups, where the first is used to estimate the trajectory and the second is used for pseudotime and weight estimation could, in theory, alleviate the dependence issue, at the cost of smaller sample sizes. However, this would ignore the fact that, in practice, users perform exploratory steps using the full data before performing the final integration, dimensionality reduction, and trajectory inference. Moreover, results on simulations show that all methods considered keep excellent control of the false discovery rate despite the violation of the independence assumptions. This issue of “doubledipping” therefore seems to have a limited impact in practice.
The two issues raised in the previous paragraphs highlight the need for independent benchmarking. Simulation frameworks such as dyngen^{24} are crucial. They also need to be complemented by realworld case studies, which will become easier as more and more datasets that study dynamic systems under multiple conditions are being published. condiments has thus been developed to be a general and flexible workflow that will be of use to researchers asking complex and everchanging questions.
Methods
Tests for equality of distributions
Consider a set of n i.i.d. observations, X, with X_{i} ~ P_{1}, and a second set of m i.i.d. observations, Y, with Y_{j} ~ P_{2}, independent from X. For example, in our setting, X and Y may represent estimated pseudotimes for cells from two different conditions. We limit ourselves to the case where X and Y are random vectors of the same dimension d.
The general goal is to test the null hypothesis that X and Y have the same distribution, i.e., H_{0}: P_{1} = P_{2}.
The twosample and weighted KolmogorovSmirnov test
Consider the case where X and Y are scalar random variables (i.e., d = 1). The associated empirical cumulative distribution functions (ECDF) are denoted, respectively, by F_{1,n} and F_{2,m}. The univariate case occurs, for example, when there is only one lineage in the trajectory(ies), so that the pseudotime estimates are scalars.
In this setting, one can test H_{0} using the standard KolmogorovSmirnov test^{22}, with test statistic defined as:
The rejection region at nominal level α is
That is, we reject the null hypothesis at the αlevel if and only if \({{{{{{{{\bf{D}}}}}}}}}_{n,m}\ge \sqrt{1/2\times \log \alpha /2\times \frac{n+m}{n\times m}}\).
We can also consider a more general setting where we have weights w_{1,i} ∈ [0, 1] and w_{2,j} ∈ [0, 1] for each of the observations, i = 1, …, n and j = 1, …, m. In trajectory inference, the weights may denote the probability that a cell belongs to a particular lineage in the trajectory. Following Monahan^{37}, we modify the KolmogorovSmirnov test in two ways. Firstly, the empirical cumulative distribution functions are modified to account for the weights
Secondly, the definition of D_{n,m} is unchanged, but the significance threshold is updated, that is, the rejection region is
where
The multivariate classifier test
Suppose that we have a binary classifier δ( ⋅ ), which could be, for example, a multinomial regression or SVM classifier. This classifier is a function from the support of X and Y into {1, 2}. The data are first split into a learning and a test set, such that the test set contains n_{test} observations, equallydrawn from each population, i.e., there are n_{test}/2 observations X^{(test)} from X and n_{test}/2 observations Y^{(test)} from Y. Next, the classifier is trained on the learning set. We denote by \(Acc\equiv  \{i:\delta ({{{{{{{{\bf{X}}}}}}}}}_{i}^{({{{{{{{\rm{test}}}}}}}})})=1\}\bigcup \{ \, j:\delta ({{{{{{{{\bf{Y}}}}}}}}}_{j}^{({{{{{{{\rm{test}}}}}}}})})=2\}\) the number of correct assignations made by the classifier on the test set.
If n = m, under the null hypothesis of identical distributions, no classifier will be able to perform better on the test set than a random assignment would, i.e., where the predicted label is a Bernoulli(1/2) random variable. Therefore, testing the equality of the distributions of X and Y can be formulated as testing
Under the null hypothesis, the distribution of Acc is:
As detailed in LopezPaz and Oquab^{23}, one can use the classifier to devise a test that will guarantee the control of the Type 1 error rate.
In practice, we make no assumptions about the way in which the distributions we want to compare might differ, which means the classifier needs to be quite flexible. Following LopezPaz and Oquab^{23}, we chose to use either a knearestneighbor classifier (kNN) or a Random Forests classifier^{38}, since such classifiers are fast and flexible. Hyperparameters are chosen through crossvalidation on the learning set. To avoid issues with class imbalance, we downsample the distribution with the largest number of samples first so that each distribution has the same number of observations. That is, we have \({n}^{{\prime} }=\min (m,n)\) observations in each condition (or class). A fraction (by default 30%, userdefined) is kept as test data, so that \({n}_{{{{{{{{\rm{test}}}}}}}}}=0.3\times {n}^{{\prime} }\). We then train the classifier on the learning data, and select the tuning parameters through crossvalidation on that learning set. Finally, we predict the labels on the test set and compute the accuracy of the classifier on that test set. This yields our classifier test statistic.
It is interesting to note that the classifier test is valid no matter the classifier chosen. However, the choice of classifier will have obvious impact on the power of the test.
Other multivariate methods
Although we have found that the classifier test performs best in practice, there are many methods that test for the equality of two multivariate distributions. We have implemented a few such methods in condiments, in case users would like to try them: The twosample kernel test^{39}, the permutation test relying on the Wasserstein distance^{40,41}, or the distinct method^{42}. (see descriptions in the Supplementary Methods S1.1).
Extending the setting to more than two distributions
Consider C ≥ 2 sets of samples, such that, for c ∈ {1, …, C}, we have n_{c} i.i.d. observations X^{(c)} with \({{{{{{{{\bf{X}}}}}}}}}_{i}^{(c)} \sim {{{{{{{{\bf{P}}}}}}}}}_{c}\). We want to test the null hypothesis:
While extensions of the KolmogorovSmirnov test^{43} and the twosample kernel test^{44} have been proposed, we choose to focus only on the framework that is most easily extended to C conditions, namely, the classifier test. Indeed, the Ccondition classifier test requires choosing a multipleclass classifier instead of a binary classifier (which is the case for the kNN classifier and Random Forests), selecting n_{test}/C observations for each class in the test set, and testing:
Under the null hypothesis, the distribution of Acc is:
Extending the setting by considering an effect size
The null hypothesis of the (weighted) KolmogorovSmirnov test is H_{0} : P_{1} = P_{2}. We can modify this null hypothesis by considering an effect size threshold t, such that \({H}_{0}(t):\mathop{\sup }\nolimits_{x}\left\vert \right.{P}_{1}(x){P}_{2}(x)\left\vert \right.\le t\). The test statistic is then modified as:
and the remainder of the testing procedure is left unchanged.
Similarly, the null and alternative hypotheses of the classifier test can be modified to test against an effect size threshold t as follows
General statistical model for the trajectories
Consider a set of condition labels c ∈ {1, …, C} (e.g.,"treatment” or “control”, “knockout” or “wildtype”). For each condition, there is a given topology/trajectory \({{{{{{{{\mathcal{T}}}}}}}}}_{\!\!c}\) that underlies the developmental process. This topology is generally in the form of a tree, with a starting state which then differentiates along one or more lineages; but one can also have a circular graph, e.g., for the cell cycle. In general, a trajectory is defined as a directed graph.
We denote by L_{c} the number of unique paths – or lineages – in the trajectory \({{{{{{{{\mathcal{T}}}}}}}}}_{\!\!c}\). For example, for a tree structure, paths go from the root node (stem cell type) to the leaf nodes (differentiated cell type). For a cell cycle, any node can be be used as the start. A cell i from condition c(i) is characterized by the following random variables:
Note that the distribution functions are conditionspecific. We further make the following assumptions:

All G_{c} and H_{c} distributions are continuous;

The support of G_{c} is bounded in \({{\mathbb{R}}}^{{L}_{c}}\) for each c;

The support of H_{c} is \({[0,1]}^{{L}_{c}}\) for each c.
The gene expression model will be discussed below, in the Testing for differential expression section.
Many algorithms have been developed to infer lineages from singlecell data^{5}. Most algorithms provide a binary indicator of lineage assignment, that is, the W_{i} vectors are composed of 0s and 1s, so that a cell either belongs to a lineage or it does not. (Note that when cells fall along a lineage prior to a branching event, this vector may include multiple 1s, violating our constraint that the W_{i} have unit norm. In such cases, we normalize the weights to sum to 1).
Mapping between trajectories
Many of the tests that we introduce below assume that the cells from different conditions follow “similar” trajectories. In practice, this means that we either have a common trajectory for all conditions or that there is a possible mapping from one lineage to another. The term “mapping” is more rigorously defined as follows.
Definition 1
The trajectories \(\{{{{{{{{{\mathcal{T}}}}}}}}}_{\!\!c}:c\in \{1,\ldots,C\}\}\) have a mapping if and only if ∀ (c_{1}, c_{2}) ∈ {1, …, C}^{2}, \({{{{{{{{\mathcal{T}}}}}}}}}_{{c}_{1}}\) and \({{{{{{{{\mathcal{T}}}}}}}}}_{{c}_{2}}\) are isomorphic.
If there is a mapping, this implies in particular that the number of lineages L_{c} per trajectory \({{{{{{{{\mathcal{T}}}}}}}}}_{\!\!c}\) is the same across all conditions c and we call this this common value L. Since a graph is always isomorphic with itself, a common trajectory is a special case of a situation where there is a mapping.
Definition 2
The trajectories \(\{{{{{{{{{\mathcal{T}}}}}}}}}_{\!\!c}:c\in \{1,\ldots,C\}\}\) have a partial mapping if and only if ∀ (c_{1}, c_{2}) ∈ {1, …, C}^{2}, there exist subgraphs \({{{{{{{{{\mathcal{T}}}}}}}}}^{{\prime} }}_{{c}_{1}}\subset {{{{{{{{\mathcal{T}}}}}}}}}_{{c}_{1}}\) and \({{{{{{{{{\mathcal{T}}}}}}}}}^{{\prime} }}_{{c}_{2}}\subset {{{{{{{{\mathcal{T}}}}}}}}}_{{c}_{2}}\) that are isomorphic.
Essentially, this means that the changes induced by the various conditions do not disturb the topology of the original trajectory too much. The above mathematical definitions aim to formalize what too much is. Indeed, if the conditions lead to very drastic changes, it will be quite obvious that the trajectories are different and comparing them will mostly be either noninformative or will not require a complex framework. We aim to build a test that retains reasonable power in more subtle cases.
Imbalance score
Consider a set of n cells, with associated condition labels c(i) ∈ {1, …, C} and coordinate vectors X_{i} in d dimensions, usually corresponding to a reduceddimensional representation of the expression data obtained via PCA or UMAP^{29,45}, i = 1, …, n.
Let \({{{{{{{\bf{p}}}}}}}}={\{{p}_{c}\}}_{c\in \{1,\ldots,C\}}\) denote the “global” distribution of cell conditions, where p_{c} is the overall proportion of cells with label c in the sample of size n. The imbalance score of a cell reflects the deviation of the “local” distribution of conditions in a neighborhood of that cell compared to the global distribution p. Specifically, for each cell i, we compute its knearestneighbor graph using the Euclidean distance in the reduceddimensional space. We therefore have a set of k neighbors and a set of associated neighbor condition labels c_{i,κ} for κ ∈ {1, …, k}. We then assign to the cell a zscore, based on the multinomial test statistic \(P({\{{c}_{i,\kappa }\}}_{\kappa \in \{1,\ldots,k\}},{{{{{{{\bf{p}}}}}}}})\), as defined in supplementary section S1.2. Finally, we smooth the zscores in the reduceddimensional space by fitting s cubic splines for each dimension. The fitted values for each of the cells are the imbalance scores. Thus, the imbalance scores rely on two userdefined parameters, k and s. We set default values of 10 for both parameters. However, since this is meant to be an exploratory tool, we encourage users to try different values for these tuning parameters and observe the effect on the imbalance scores to better understand their data.
Differential topology
The imbalance score only provides an exploratory visual assessment of local imbalances in the distribution of cell conditions. However, we need a more global and formal way to test for differences in topology between conditionspecific trajectories. That is, we wish to test the null hypothesis
In practice, in order to test H_{0}, we have a set of cells i with condition labels c(i). Not all trajectory inference methods define \({{{{{{{\mathcal{T}}}}}}}}\) in the same way, but they all do estimate pseudotimes. We can thus estimate the pseudotimes of each cell when fitting a trajectory for each condition. We then want to compare this distribution of pseudotimes to a null distribution of pseudotimes corresponding to a common topology. To generate this null distribution, we use permutations in the following manner:

a.
Estimate T_{i} for each i by inferring one trajectory per condition, using any trajectory inference method.

b.
Randomly permute the condition labels c(i) to obtain new labels \(c{(i)}^{{\prime} }\), reestimate \({{{{{{{{\bf{{T}}}}}}}^{{\prime} }}}}_{i}\) for each i.

c.
Repeat the permutation r times (by default, r = 100).
Under the null hypothesis of a common trajectory, the nT_{i} should therefore be drawn from the same distribution as the r × n\({{{{{{{{\bf{{T}}}}}}}^{{\prime} }}}}_{i}\). We can test this using the weighted KolmogorovSmirnov test (if L = 1), the kernel twosample test (if C = 2), or the classifier test (any C). This is the topologyTest.
The aforementioned tests require that the samples be independent between the distributions under comparison. However, here, the two distributions correspond to different pseudotime estimates for the same cells so the samples are not independent. Even within distributions, the independence assumption is violated, as the pseudotimes are estimated using trajectory inference methods that rely on all samples. Moreover, within the \({{{{{{{{\bf{{T}}}}}}}^{{\prime} }}}}_{i}\), we have r pseudotime estimates of each cell.
The first two violations of the assumptions are hard to avoid and are further addressed in the Discussion section. However, we can eliminate the third one by simply taking the average \({{{{{{{{\bf{{T}}}}}}}^{{\prime} }}}}_{i}\) for each cell. We then compare two distributions each with n samples. Both options (with and without averaging) are implemented in the condiments R package, but the default is the average.
Under the null, there should exist a mapping between trajectories, both within conditions and between permutations. However, in practice, most trajectory inference methods will be too unstable to allow for automatic mapping between the runs. Indeed, they might find a different number of lineages for some runs. Moreover, even if the number of lineages and graph structure remained the same across all permutations, mapping between permutations would break even more the independence assumption since the condition labels would need to be used.
Therefore, for now, the topologyTest test is limited to two trajectory inference methods, slingshot^{2} and TSCAN^{4}, where a set graph structure can be prespecified. Both methods rely on constructing a minimum spanning tree (MST) on the centroids of input clusters in a reduceddimensional space to model branching lineages. In TSCAN, a cell’s pseudotime along a lineage is determined by its projection onto a particular branch of the tree and its weight of assignment is determined by its distance from the branch. slingshot additionally fits simultaneous principal curves. A cell’s pseudotime along a lineage is determined by its projection onto a particular curve and its weight of assignment is determined by its distance from the curve.
We therefore construct the MST on the full dataset (i.e., using all the cells regardless of their condition label), based on userdefined cluster labels. Then, we keep the same graph structure as input to either TI method: the nodes are the centers of the clusters, but restrained to cells of a given condition. This way, the path and graph structure are preserved. Note, however, that there is no guarantee that the graph remains the MST when it is used for TI on a subset of cells.
In the examples from Fig. S1, the skeleton of the trajectory is represented by a series of nodes and edges. In examples S1bd, the knockout (KO) has no impact on this skeleton compared to the wildtype (WT). In example S1e, the knockout modifies the skeleton, in that the locations of the nodes change. However, the adjacency matrix does not change and the two skeletons represent isomorphic graphs, so that the skeleton structure is preserved.
A common skeleton structure can also be used if the null hypothesis of the topologyTest is rejected. The availability of a mapping between lineages means that the next steps of the workflow can be conducted as if we had failed to reject the null hypothesis, as done in Fig. S1e. The KRAS (supplementary section S2.8) case study presents an example of this. Even if the null is rejected by the topologyTest and separate trajectories must be fitted for each condition, a common skeleton structure can still be used to map between trajectories. In cases where no common skeleton structure exists, such as Fig. S1f, no automatic mapping exists. Differential abundance can be assessed but requires a manual mapping. Differential expression can still be conducted as well.
Testing for differential progression
The progressionTest requires that a (partial) mapping exist between trajectories, or that a common trajectory be fitted across conditions. If the mapping is only partial, we restrict ourselves to the mappable parts of the trajectories (i.e., subgraphs).
For a given lineage l, we want to test the null hypothesis that the pseudotimes along the lineage are identically distributed between conditions, which we call identical progression. Specifically, we want to test that the l^{th} components G_{lc} of the distribution functions G_{c} are identical across conditions
We can also test for global differences across all lineages, that is,
If C = 2, all tests introduced at the beginning of the Methods section can be used to test the hypothesis in Equation (7). If C > 2, we need to rely on the classifier test.
If L = 1, the hypotheses in Equations (7) and (8) are identical. However, for L > 1, the functions G_{c} are not univariate distributions and this requires different testing procedures.
For L > 1, we can use observational weights for each cell corresponding to the probability that it belongs to each individual lineage. Two settings are possible.

Test the null hypothesis in Equation (7) for each lineage using the KolmogorovSmirnov test and perform a global test using the classifier test or the kernel twosample test.

Test the null hypothesis in Equation (7) for each lineage using the KolmogorovSmirnov test and combine the pvalues p_{l} for each lineage l using Stouffer’s Zscore method^{46}, where each lineage is associated with observational weights \({W}_{l}=\mathop{\sum }\nolimits_{i=1}^{n}{{{{{{{{\bf{W}}}}}}}}}_{i}[l]\), with W_{i}[l] the l^{th}coordinate of the vector W_{i}. The nominal pvalue associated with the global test is then
$${p}_{glob}\equiv \frac{\mathop{\sum }\nolimits_{l=1}^{L}{W}_{l}{p}_{l}}{\sqrt{\mathop{\sum }\nolimits_{i=1}^{L}{W}_{l}^{2}}}.$$
Note that the second setting violates the assumption of Stouffer’s Zscore method, since the pvalues are not i.i.d. However, this violation does not seem to matter in practice and this test outperforms others, so we set it as default.
Testing for differential fate selection
The fateSelectionTest requires that a (partial) mapping exist between trajectories. If the mapping is only partial, we restrict ourselves to the mappable parts of the trajectories.
For a given pair of lineages \(\{l,{l}^{{\prime} }\}\), we want to test the null hypothesis that the cells differentiate between l and \({l}^{{\prime} }\) in the same way between all conditions, which we call identical fate selection. Specifically, we want to test that the l^{th} and \({l}^{{\prime} th}\) components of the distribution functions H_{c} for the weights are the same across conditions
We can also test for a global difference across all pairs of lineages, that is,
Since all variables are multivariate, we cannot use the KolmogorovSmirnov test. By default, this test relies on the classifier test with random forests as a classifier.
Testing for differential expression
The gene expression model does not require a mapping or even a partial mapping between trajectories. Indeed, it can work equally well with a common trajectory, different trajectories, or even partially shared trajectories where some lineages can be mapped between the trajectories for various conditions and others cannot. To reflect this, we consider all L_{tot} lineages together. We introduce a new weight vector for each cell:
where \({{{{{{{\mathcal{M}}}}}}}}({{{{{{{\bf{{W}}}}}}}_{i}}})\) is a binary (or onehot) encoding representation of a multinomial distribution with proportions W_{i} as in tradeSeq, i.e. \({\mathbb{P}}({Z}_{ilc(i)}=1)={{{{{{{\bf{{W}}}}}}}_{i}}}[l]\) if \(l\in {{{{{{{{\mathcal{T}}}}}}}}}_{c(i)}\), and l ∈ Tc(i) is an abuse of notation to denote that a lineage belongs to a particular trajectory.
Likewise, we modify the pseudotime vectors to have length L_{tot}, such that
We adapt the model from Van den Berge et al.^{8} to allow for conditionspecific expression. For a given gene j, the expression measure Y_{ji} for that gene in cell i can be modeled as:
where the mean μ_{ji} of the negative binomial distribution is linked to the additive predictor η_{ji} using a logarithmic link function. The U matrix represents an additional design matrix, corresponding, for example, to a batch effect, and N_{i} represents sequencing depth, i.e., N_{i} = ∑_{j}Yij.
The model relies on lineagespecific, genespecific, and conditionspecific smoothers s_{jlc}, which are linear combinations of K cubic basis functions, \({s}_{jlc}(t)=\mathop{\sum }\nolimits_{k=1}^{K}{b}_{k}(t){\beta }_{jlck}\).
With this notation, we can introduce the conditionTest, which, for a given gene j, tests the null hypothesis that the smoothers are identical across conditions:
We fit the model using the mgcv package^{47} and test the null hypothesis using a Wald test for each gene. Note that, although the gene expression model can be fitted without any mapping, the conditionTest only exists for lineages with a mapping for at least two conditions.
Furthermore, rather than attaching strong probabilistic interpretations to pvalues (which, as in most RNAseq applications, would involve a variety of hardtoverify assumptions and would not necessarily add much value to the analysis), we view the pvalues produced by the condiments workflow simply as useful numerical summaries for exploring trajectories across conditions and identifying genes for further inspection.
Simulation study
The simulation study relies on the dyngen framework of Cannoodt et al.^{24} and all datasets are simulated as follows. 1/ A common trajectory is generated, with an underlying gene network that drives the differentiation along the trajectory. 2/ A set of N_{WT} cells belonging to the wildtype condition (i.e., with no modification of the gene network) is generated. 3/ One master regulator that drives differentiation into one of the lineages is impacted, by multiplying the wildtype expression rate of that gene by a factor m. If m = 1, there is no effect; if m > 1, the gene is overexpressed; and if m < 1, the gene is underexpressed, with m = 0 amounting to a total knockout. 4/ A set of N_{KO} = N_{WT} knockout cells is generated using the common trajectory with the modified gene network. 5/ A common reduceddimensional representation is computed using multidimensional scaling.
We generate four types of datasets, over a range of values of m: a simple trajectory with L = 2 lineages and C = 2 conditions (WT and KO) denoted by \({{{{{{{{\mathcal{T}}}}}}}}}_{1}\); a trajectory with two consecutive branchings with L = 3 lineages and C = 2 conditions (WT and KO) denoted by \({{{{{{{{\mathcal{T}}}}}}}}}_{2}\); a simple trajectory with L = 2 lineages and C = 3 conditions (WT, KO, and UP) denoted by \({{{{{{{{\mathcal{T}}}}}}}}}_{3}\) (for the latter case, dyngen Steps 34/ are repeated twice, with values of m for KO and 1/m for UP); and a trajectory with two consecutive branchings with L = 5 lineages and C = 2 conditions (WT and KO) denoted by \({{{{{{{{\mathcal{T}}}}}}}}}_{4}\).
For \({{{{{{{{\mathcal{T}}}}}}}}}_{1}\) and \({{{{{{{{\mathcal{T}}}}}}}}}_{2}\), we use values of m ∈ {0. 5, 0. 8, 0. 9, 0. 95, 1, 1/0. 95, 1/0. 9, 1/0. 8, 1/0. 5}, such that at the extremes the KO cells fully ignore some lineages. Values of. 95 and 1/. 95 represent the closest to no condition effect (m = 1), where the effect was still picked out by some tests. For \({{{{{{{{\mathcal{T}}}}}}}}}_{3}\), since the simulation is symmetrical in m, we pick m ∈ {0. 5, 0. 8, 0. 9, 0. 95, 1}. For \({{{{{{{{\mathcal{T}}}}}}}}}_{4}\), we use values of m ∈ {0. 1, 0. 2, 0.3, 0. 5, 1, 2, 3, 5, 10}.
We have one large dataset per value of m and per trajectory type. We use these large datasets to generate smaller ones of size n, by sampling 10% of the cells from each condition 50 times and applying the various tests on the smaller datasets. The reason for first generating a large dataset and then smaller ones by subsampling instead of generating small ones straightaway are computational: the generation of the datasets is timeconsuming and the part that scales with N_{WT} can be parallelized. Hence, it is almost as fast to generate a large dataset than a small one with dyngen. We pick N_{WT} = 20,000 (for the large dataset) and thus n = 2000.
Since we generate many datasets with true effect (m ≠ 1) but only one null dataset, the size of N_{WT} for m = 1 is doubled to 40,000. To be comparable, the fraction of cells sampled is decreased to 5%, so that n = 2000 and we perform 100 subsamplings. Table 3 recapitulates all of this.
To run the condiments workflow, we first estimate the trajectories using slingshot with the clusters provided by dyngen. Then, we run the progressionTest and the fateSelectionTest with default arguments.
We compare condiments to two other methods. milo^{16} and DAseq^{17} both look at differences in condition proportions within local neighborhoods, using knearestneighbor graphs to define this locality. Then, milo uses a negative binomial GLM to compare counts for each neighborhood, while DAseq uses a logistic classifier test. Therefore, both methods test for differential abundance in multiple regions. To account for multiple testing, we adjust the pvalues using the BenjaminiHochberg^{25} FDRcontrolling procedure.
We select an adjusted pvalue cutoff of 0.05, which amount to controlling the FDR at nominal level 5%. For a given dataset, we can look at the results of each test on all simulated datasets for all values of m. For each test, the number of true positives (TP) is the number of simulated datasets where m ≠ 1 and the adjusted pvalue is less than 0.05, the number of true negatives (TN) is the number of simulated datasets where m = 1 and the adjusted pvalue is greater than or equal to 0.05, the number of false positives (FP) is the number of simulated datasets where m = 1 and the adjusted pvalue is less than 0.05, and the number of false negatives (FN) is the number of simulated datasets where m ≠ 1 and the adjusted pvalue is greater than or equal to 0.05. We then examine five measures built on these four variables:
We also sought to generate incorrect inferences as follows. We generate a dataset of type \({{{{{{{{\mathcal{T}}}}}}}}}_{1}\) with a multiplier effect m = 0.85. We use the true pseudotimes and lineage assignment values, as defined by dyngen, and we add increasing noise along two dimensions: pseudotime and lineage assignment. For the latter, we randomly select a proportion p of cells and change their lineage assignments; p = 0 means that we use the true lineage assignments, while p = . 5 means that the lineage assignment of the cells is fully random. For the former, we add random Gaussian noise such that, for a cell i with pseudotime T_{i}, we return a new pseudotime \({\tilde{{{{{{{{\bf{T}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}={{{{{{{{\bf{T}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\times {{{{{{{{\mathcal{N}}}}}}}}}_{2}({{\mathbb{I}}}_{2},sd\times {{\mathbb{I}}}_{2})\). We choose p ∈ {0, . 1, . 2, . 3, . 4, . 5} and sd ∈ [0: 40]/10. For each value of p and sd, we did 100 runs and selected the median pvalue.
Version of tools used
The following version have been used:

condiments: 1.4.0

tradeSeq: 1.10.0

slingshot: 2.4.0

monocle3: 1.2.9

DASeq: 1.0.0

miloR: 1.4.0

dyngen: 1.0.3
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The results from this paper can be fully reproduced by following along the vignettes at https://hectorrdb.github.io/condimentsPaper. Code to reproduce the datasets used for the simulation study, as well as processed versions of all four datasets used in the case studies, augmented by metadata, are in particular available at https://github.com/HectorRDB/condimentsPaper/tree/main/data. Functions to recreate the processed versions, using raw counts obtained from GEO (TGFB dataset: GSE114687  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114687, TCDD dataset: GSE148339  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148339, KRAS dataset: GSE137912  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE137912, Fibrosis dataset: GSE135893  https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135893) are also provided.
Code availability
The condiments workflow is available as an R package from GitHub (https://github.com/HectorRDB/condiments^{48}) and through the Bioconductor Project (https://www.bioconductor.org/packages/release/bioc/html/condiments.html). All the methods to test for equality of two (or k) distributions were implemented for use by others in an R package called Ecume, available through CRAN (http://cran.rproject.org) and that can be explored at https://hectorrdb.github.io/Ecume.
References
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Street, K. et al. Slingshot: Cell lineage and pseudotime inference for singlecell transcriptomics. BMC Genom. 19, https://doi.org/10.1186/s1286401847720 (2018).
Lönnberg, T. et al. Singlecell RNAseq and computational analysis using temporal mixture modeling resolves TH1/TFH fate bifurcation in malaria. Sci. Immunol. 2, (2017).
Ji, Z. & Ji, H. TSCAN: Pseudotime reconstruction and evaluation in singlecell RNAseq analysis. Nucleic Acids Res. 44, e117 (2016).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of singlecell trajectory inference methods. Nat. Biotechnol. 1, 4 (2019).
Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in singlecell transcriptomics. Database J Biol Databases Curation. https://doi.org/10.1093/database/baaa073 (2020).
Qiu, X. et al. Reversed graph embedding resolves complex singlecell trajectories. Nat. Methods 14, 979–982 (2017).
Van den Berge, K. et al. Trajectorybased differential expression analysis for singlecell sequencing data. Nat. Commun. 11, 1201 (2020).
McFalineFigueroa, J. L. et al. A pooled singlecell genetic screen identifies regulatory checkpoints in the continuum of the epithelialtomesenchymal transition. Nat. Genet. 51, 1389–1398 (2019).
Nault, R., Fader, K. A., Bhattacharya, S. & Zacharewski, T. R. SingleNuclei RNA Sequencing Assessment of the Hepatic Effects of 2,3,7,8Tetrachlorodibenzopdioxin. CMGH 11, 147–159 (2021).
Wells, K. L. et al. Combined transient ablation and singlecell rnasequencing reveals the development of medullary thymic epithelial cells. Elife 9, e60188 (2020).
Parisian, A. D. et al. SMARCB1 loss interacts with neuronal differentiation state to block maturation and impact cell stability. Genes Dev. 34, 1316–1329 (2020).
Ramachandran, P. et al. Resolving the fibrotic niche of human liver cirrhosis at singlecell level. Nature 575, 512–518 (2019). ISSN 14764687.
BaranGale, J. et al. Ageing compromises mouse thymus function and remodels epithelial cell differentiation. eLife 9, 1–71 (2020).
Lun, A. T. L., Richard, A. C. & Marioni, J. C. Testing for differential abundance in mass cytometry data. Nat. Methods 14, 707–709 (2017).
Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on singlecell data using knearest neighbor graphs. Nat. Biotechnol. 40, 245–253 (2022).
Zhao, J. et al. Detection of differentially abundant cell subpopulations in scrnaseq data. Proc. Natl. Acad. Sci. 118, e2100293118 (2021).
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to singlecell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Barkas, N., Petukhov, V., Kharchenko, P. & Biederstedt, E.pagoda2: Single Cell Analysis and Differential Expression, 2021. URL https://CRAN.Rproject.org/package=pagoda2. R package version 1.0.5.
Stuart, T. et al. Comprehensive Integration of SingleCell Data. Cell 177, 1888–1902.e21 (2019).
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at singlecell resolution. Nat. Biotechnol. 39, 619–629 (2021).
Smirnov, N. V. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. Univ. Moscou 2, 3–14 (1939).
LopezPaz, D. & Oquab, M. Revisiting Classifier TwoSample Tests. arxiv, pages 1–15, oct 2016. http://arxiv.org/abs/1610.06545.
Cannoodt, R., Saelens, W., Deconinck, L. & Saeys, Y. Spearheading future omics analyses using dyngen, a multimodal simulator of single cells. Nat. Commun. 12, 1–9 (2021).
Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate  a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series BMethodological 1995.pdf. J. Royal Stat. Soc. Series B (Methological) 57, 289–300 (1995).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, JeanPhilippe A general and flexible method for signal extraction from singlecell RNAseq data. Nat. Commun. 9, 284 (2018).
Halpern, K.B. et al. Singlecell spatial reconstruction reveals global division of labour in the mammalian liver. Nature 542, 1–5 (2017).
Becht, E. et al. Dimensionality reduction for visualizing singlecell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Xu, J., Lamouille, S. & Derynck, R. TGFβinduced epithelial to mesenchymal transition. Cell Res. 19, 156–172 (2009).
Pei, YaoFei, Liu, J., Cheng, J., Wu, WeiDing & Liu, Xi. Qiang Silencing of LAMC2 Reverses EpithelialMesenchymal Transition and Inhibits Angiogenesis in Cholangiocarcinoma via Inactivation of the Epidermal Growth Factor Receptor Signaling Pathway. American J. Pathol. 189, 1637–1653 (2019).
Feng, X. H. & Derynck, R. Specificity and versatility in TGFβ signaling through smads, oct 2005. ISSN 10810706. https://www.annualreviews.org/doi/abs/10.1146/annurev.cellbio.21.022404.142018.
Korotkevich, G., Sukhov, V. & Sergushichev, A. Fast gene set enrichment analysis. bioRxiv, 060012. https://doi.org/10.1101/060012 (2016).
Habermann, A. C. et al. Singlecell rna sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Sci. Adv. 6, https://doi.org/10.1126/sciadv.aba1972 (2020). ISSN 23752548.
Hou, W. et al. A statistical framework for differential pseudotime analysis with multiple singlecell rnaseq samples. Nat. Commun. 14, 7286 (2023).
Song, D. & Li, J. J. PseudotimeDE: inference of differential gene expression along cell pseudotime with wellcalibrated pvalues from singlecell RNA sequencing data. Genome Biol. 22, 1–25 (2021).
Monahan, J. F. Numerical methods of statistics, second edition. Cambridge University Press, 2011. ISBN 9780511977176. https://doi.org/10.1017/CBO9780511977176.
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001). ISSN 08856125.
Gretton, A., Borgwardt, K., Rasch, M. J., Schölkopf, B., & Smola, A. A Kernel TwoSample Test. undefined, 2012.
Wasserstein, L. N. Markov processes over denumerable products of spaces describing large systems of automata. Problems Inform. Trans. 5, 47–52 (1969).
Dobrushin, R. L. Prescribing a system of random variables by conditional distributions. Theory Probab. Appl. 15, 458–486 (1970).
Tiberi, S., Crowell, H. L., Samartsidis, P., Weber, L. M. & Robinson, M. D. distinct: A novel approach to differential distribution analyses. Ann Appl Stat. 17, 1681–1700 (2023).
Kiefer, J. KSample Analogues of the KolmogorovSmirnov and CramerV. Mises Tests. Ann Math. Stat. 30, 420–447 (1959).
Armando, B. Sosthene Kali, Nkiet, Guy Martia, and Ogouyandjou, C. kSample problem based on generalized maximum mean discrepancy. arxiv, URL http://arxiv.org/abs/1811.09103. (2018).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arxiv, feb 2018. URL http://arxiv.org/abs/1802.03426.
Stouffer, S. A. Adjustment during army life. Princeton University Press, 1949.
Wood, S. N. Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. CRC press, https://doi.org/10.1201/9781315370279, 2019. ISBN 9781498728348.
de Bézieux, H. R. & Turaga, N. Hectorrdb/condiments: Release for publication, December 2023. https://doi.org/10.5281/zenodo.10359879.
Xue, J. Y. et al. Rapid nonuniform adaptation to conformationspecific KRAS(G12C) inhibition. Nature 577, 421–425 (2020).
Acknowledgements
This work was supported by the NIH R01 R01DC007235. KVdB was a postdoctoral fellow of the Belgian American Educational Foundation (BAEF) and was supported by the Research Foundation Flanders (FWO), grants 1246220N, G062219N, and V411821N.
Author information
Authors and Affiliations
Contributions
H.R.B. and K.V.D.B. developed the methods under S.D.'s supervision. H.R.B. ran the experiments and designed the simulation studies, with directions from K.S. and S.D. H.R.B. wrote the first draft of the manuscript, and H.R.B., K.V.D.B., K.S. and S.D. provided comments and helped in the revisions.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Roux de Bézieux, H., Van den Berge, K., Street, K. et al. Trajectory inference across multiple conditions with condiments. Nat Commun 15, 833 (2024). https://doi.org/10.1038/s41467024448230
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467024448230
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.