Introduction

Analyzing the trajectories of cellular differentiation holds promise for key questions across biology, from how lineages diverge during embryonic development to how cell types destabilize with age or are perturbed in disease. Single-cell measurement technologies like single-cell RNA-sequencing (scRNA-seq)1,2, single-cell ATAC-seq3, and CRISPR-based lineage tracing4,5,6 have opened new windows on these processes, but it remains challenging to analyze dynamic changes in cell state and cell lineage over time because the measurements are destructive: cells must be lysed before information about their state or lineage can be recovered. In response, there has been a flurry of work on designing methods to infer developmental trajectories from static snapshots of cell state7,8,9,10, including our own efforts11. While initial efforts have shed some light on important biological questions relating to embryonic development7,12, hematopoiesis13, and induced pluripotent stem cell reprogramming11, the field of trajectory inference is still in its infancy.

One of the most significant deficiencies of existing trajectory inference methods is that they are not designed to incorporate the rich information from lineage tracing. Technologies for reconstructing cellular lineage trees have seen tremendous recent advances, fueled by the CRISPR-Cas9 genome editing technology5,6,14. While developmental biologists have long used various methods to tag cells and trace the lineage of their descendants, newer approaches make it possible to recover more complex lineage relationships, including the full lineage tree of a population of cells4,5,6. These technologies employ CRISPR-Cas9 to continuously mutate an array of synthetic DNA barcodes, which are incorporated into the chromosomes so that they are inherited by daughter cells and can be further mutated over the course of development. By analyzing the pattern of mutations in the barcodes, one can reconstruct a lineage tree describing shared ancestry within a population of cells. Recent advances allow the DNA barcodes to be expressed as transcripts and recovered together with the rest of the transcriptome in scRNA-seq5,6. Alternative methods use somatic mutations in mitochondria15 to recover similar information without needing to introduce DNA barcodes.

Each of these technologies enables simultaneous collection of data on cell state and cell lineage. The two types of information are complementary: state measurements can be used for improved inference of lineage trees16, while the lineage tree itself is intimately related to trajectories. However, high-resolution lineage tracing, even if informed by gene expression data as in16, does not obviate the need for trajectory inference because the state of the ancestral cells remains unknown. While the problems of reconstructing lineage trees and inferring trajectories have attracted substantial attention individually17,18, there is much to be gained from combining these two complementary perspectives19.

Here, we propose an integrated mathematical framework for inferring developmental trajectories from snapshots of both cell lineage and cell state. Our framework, called LineageOT, is broadly applicable to lineage tracing time courses, where populations of cells are profiled with both scRNA-seq and lineage tracing at various time points along a developmental process (Fig. 1). As a proof of concept, we test our methodology on a time-course of C. elegans embryonic development (Figs. 2, 3), collected with scRNA-seq20. Because the lineage tree of C. elegans is known21, we have an objective measure of performance. We find that our method significantly improves trajectory inference both on this dataset and on simulated examples where algorithms without lineage information cannot completely recover the correct trajectories (Fig. 4). Our results show a path towards realizing the substantial benefits of lineage tracing19,22 in applications across developmental biology.

Fig. 1: Schematic of the LineageOT model and inference procedure.
figure 1

a A lineage tree embedded in two dimensional gene expression space. As cells change state over time, they trace out paths. Branches in the tree correspond to cell divisions, giving rise to four cells at the measurement time (red circles). Each cell has a barcode to track its lineage. Starting from the ancestral barcode sequence AAAA, mutations are indicated with a red star on the lineage tree and the change to the sequence is shown in red. b Embedded lineage trees from two independent realizations of the developmental process measured at times t1 (blue) and t2 (red). c The setup from (b) is shown in a 3d plot with lineage trees visualized in the vertical dimension. For each time point, we observe cell states (dots) and also the lineage tree, but not the lineage tree embedded in state space. d A purely state-based algorithm would fail to recover the correct trajectories in this example. Green lines connect cells at t2 to their nearest neighbor at t1. Dashed lines indicate erroneous connections. e, f The LineageOT procedure consists of two steps. e Adjust cells at time t2 (purple arrows), based on lineage information. Cells with shared lineage are moved closer together, towards an estimate of ancestral state (solid dots). f Infer a coupling (green lines) connecting the adjusted cells from time t2 (red) to cells from time t1 (blue). This corrects the mistake made in (d).

Fig. 2: Complex trajectories in C. elegans development.
figure 2

a UMAP of 81286 C. elegans cells from20, using coordinates provided by Packer et al. Color indicates estimated time since fertilization following the colorbar in (b). b In the boxed region from (a), multiple developmental trajectories in the hypodermis converge to the same UMAP coordinates, suggesting a convergence in gene expression.

Fig. 3: When tested on lineage-labeled C. elegans data, LineageOT outperforms optimal transport with no lineage information.
figure 3

a Relative accuracy of optimal transport (OT) and LineageOT on the 5123 cells with complete lineage annotations. Errors were normalized by dividing by the error of the noninformative independent coupling. b The error in predicting ancestor states, like the error for predicting descendant states (Fig. S523), is lower for most cells with LineageOT. Here each point represents one cell from the 270 min time point, which was coupled to the 210 min time point. The red line marks equal error for both methods. For each method in both (a, b) and (f, g), we chose the entropy parameters that gave the minimum error from parameter scans like those in (c, d). LineageOT consistently improves on Waddington-OT for reasonable values of the entropy parameter, both in ancestor error (c) and descendant error (d), shown here for the 210–270 min couplings. e UMAP visualization of the cells from the 210 (blue) and 270 min (red) time points. f, g Here, in the same UMAP, cells are colored by the ancestor (f) or descendant (g) error from Waddington-OT minus the same error from LineageOT. Blue indicates better performance by LineageOT, red better performance by Waddington-OT. The cells from 210 min and 270 min in (f) and (g), respectively, are gray, as the corresponding error metric does not apply to them.

Fig. 4: LineageOT matches the performance of Waddington-OT for simple trajectories and exceeds it for complex trajectories.
figure 4

ac For a simple bifurcation, optimal transport alone works well and adding lineage information makes little difference. a We simulated a cluster of cells at an early time point splitting into two clusters at a later time point. Green lines connect ancestors in blue to descendants in red in (a, d, g, k). The ancestor errors (b) and descendant errors (c) are similar for optimal transport (OT, orange) and LineageOT (blue) with any entropy parameter, even when LineageOT is given an imperfect tree fitted to simulated barcodes (green). df For a convergent trajectory, LineageOT significantly improves ancestor prediction with no loss of accuracy in descendant prediction, even with an imperfectly fitted lineage tree. d Here we simulated two early clusters that each split; later, two of the resulting clusters merge together. Using LineageOT reduces error substantially for ancestor prediction (e) and slightly for descendant prediction (f). gi The improvement due to lineage information when trajectories converge does not require nearby unconverged clusters. Here we see qualitatively similar improvement for two early clusters whose distributions of descendant cells almost entirely overlap. km With sufficient time between samples, clusters of cells may move closer to early time point cells that are not their ancestors. k In this simulation, after two early clusters each split, two of the late clusters are closer to non-ancestral cells than to their true ancestors. Optimal transport couples clusters incorrectly, leading to high error for predicting both ancestors (l), and descendants (m). LineageOT corrects the errors in this example by averaging with other clusters that are mapped correctly.

Results

A unified framework for lineage tracing and trajectory inference

We develop a mathematical framework for analyzing scRNA-seq time courses equipped with a lineage tree at each time point. We formulate the goal of trajectory inference in terms of recovering the embedding of these lineage trees, defined as follows. As a population of cells develops, each cell traces out trajectories in a high-dimensional vector space of cellular states (e.g., gene expression space). Cell divisions create branching paths, and the trajectories of related cells coincide up to the point when their ancestry diverges (Fig. 1a). For example, if all the cells share a common ancestor, then the trajectories will all originate from a common point. This collection of branching paths forms what we call the embedded lineage tree for the population. Note the emphasis on “embedded”—without this modifier, the term lineage tree refers to the coordinate-free tree structure, where all information about the embedded state of each ancestral node is lost, like those in Fig. 1c.

Single-cell measurement technologies allow us to sample from a population and measure cell states together with barcodes that enable recovery of the lineage tree any point in time (Fig. 1a). However, because the measurements are destructive, we cannot directly chart the embedded lineage tree at multiple time points. One can, however, leverage the reproducibility of development and collect samples from separate populations at different time points (Fig. 1b, c). For example, one can prepare two independent populations of cells and collect samples from the first population at time t1 and samples from the second population at time t2. The key question is then: which cell from the first sample would have given rise to each cell from the second sample, if these were two views of the same population?

Importantly, this cannot be solved in general from the topology of the lineage trees alone. Both biological variability and simple subsampling of cells in a tissue could cause the lineage tree at t2 to be topologically distinct from an extension of the lineage tree at t1, making it impossible to directly patch one tree onto the other in a biologically meaningful way. We sketch a hypothetical example with sampling in Fig. S123. Instead, lineage information must be used together with gene expression data in a combined approach.

We have recently demonstrated11 that a classical mathematical tool called optimal transport24,25,26 can be applied to infer state couplings (Methods 1) from a scRNA-seq time-course, without any information about cell lineage. This method, called Waddington-OT, connects cells sampled at time t1 to their putative descendants at time t2 by minimizing the total distance traveled by all cells. It also includes entropic regularization with a tunable regularization parameter to model the inherent stochasticity in developmental trajectories and allows for variable rates of growth across cells by adjusting the distributions at times t1 and t2 based on estimates of division rates. The inferred connections approximate the frequency of transitions between regions of cell-state space, i.e., the couplings of the developmental process. Correctly-recovered couplings encode information about the probability of each differentiation pathway and the genes associated with fate specification.

Our present notion of an embedded lineage tree refines the notion of a coupling from11. While Waddington-OT aims for a state coupling describing all possible ancestries of a hypothetical cell with a given state, our embedded lineage tree gives rise to a lineage-resolved coupling predicting the ancestry of the particular cells we observed. The distinction is significant in situations where cells can arrive at a particular state from different ancestral states (Methods 2). Lineage tracing helps resolve such ambiguities: without lineage tracing, we must assume that cells with similar states have similar ancestral states; with lineage tracing, we instead assume that cells with similar lineage have similar ancestral states.

We apply optimal transport to recover lineage couplings, considered as approximations to embedded lineage trees, from scRNA-seq time courses equipped with an unembedded lineage tree at each time point. We refer to these datasets as scRNA-lineage time courses. In practice, the unembedded trees can be reconstructed from mutations accumulated in DNA barcodes over the course of development (Fig. 1a), or some lineage information might be known in advance (as in C. elegans development). Lentiviral barcoding13 would also provide usable, though low-resolution, lineage trees through tracing clones: cells from each clone would form leaves of a separate sub-tree, with an ancestral node at the time of barcoding. Here, we do not focus on how the unembedded lineage trees are estimated—our method assumes these are calculated separately, for example by neighbor joining27, and given as input to the algorithm. However, we do demonstrate in simulations below that our method is robust to errors in the estimated lineage tree.

Our method applies two key steps to recover the lineage coupling spanning a pair of time points t1, t2.

  1. 1.

    We first leverage the lineage tree to adjust the positions of the cells at time t2 (Fig. 1e).

  2. 2.

    We then connect them to their ancestors at time t1 (Fig. 1f) using entropically regularized optimal transport.

The adjustment in the first step can be interpreted as sharing information between closely related cells in order to construct a rough initial estimate of the ancestral states at the earlier time t1. The rationale behind the second step is based on Schrödinger’s discovery that entropically regularized optimal transport gives the maximum-likelihood coupling of diffusing particles28,29. This gives a rigorous interpretation of our methodology, as we explain below.

The core problem involves a single pair of time points, t1 and t2, where we are given cells x1, …, xn sampled at time t1, and cells y1, …, ym sampled at time t2 together with an estimate of their lineage tree. We assume that these data are sampled from trajectories generated by diffusion plus drift through Waddington’s landscape (i.e., a stochastic differential equation as described in Methods 2).

Since diffusion dominates drift on short time scales, we can estimate the ancestral state of yi at time t1 by assuming the dynamics are driven by pure diffusion. However, conditional on the lineage tree, the cells are not diffusing independently. Intuitively, cells with similar lineage should diffuse back toward one another to reach a common ancestral state. The difference in cell state across each of the edges of the lineage tree is given by an independent Gaussian random variable with variance proportional to the time-span along the edge (Methods 4). This implies that the ancestral state at time t1 for each yi is normally distributed with mean and variance that can be calculated from the lineage tree (Methods 4). Because the ancestral states of each yi are normally distributed, optimal transport will give the maximum-likelihood matching to the observed ancestors x1, …, xn, when we use an entropy parameter proportional to the inferred variance of ancestral states (Methods 4). This matching, or lineage-resolved coupling, summarizes our knowledge of the ancestral states of cells from t2 and the hypothetical descendant states of cells from t1, providing a window onto the embedded lineage tree of each time point.

As described in the original Waddington-OT paper11, couplings of the kind fit by LineageOT contain trajectory information covering the entire period between the earliest and latest sampling times. Given a long time course, each method fits a coupling between each consecutive pair of time points ti and ti+1; concatenating this sequence of pairwise couplings gives connections across longer periods. In between measured time points, trajectories can be estimated with geodesic interpolation. Moreover, in cases where the lineage tree at t2 is a direct extension of the lineage tree at t1, the correct coupling provides precisely the matching required to patch the tree topologies.

Subsequent analyses on the cell-cell couplings will reveal aspects of the underlying biological processes, such as transcription factors associated with any transition or the timing of fate specification. These are the same set of questions that other trajectory inference methods attempt to answer; the key question in comparing approaches is their relative accuracy. For LineageOT, the foundation for reliable downstream analyses is accurate recovery of couplings between two time points, which we demonstrate in the following sections.

LineageOT outperforms Waddington-OT on a lineage-resolved time-course of C. elegans embryonic development

We sought to test our method by applying it to a scRNA-lineage time-course. While CRISPR-based lineage tracing30,31 offers tremendous potential for generating scRNA-lineage time courses, this type of dataset has not yet been published. We reasoned, however, that we could create a scRNA-lineage time-course from an ordinary, non-barcoded, scRNA-seq time-course of C. elegans embryonic development20 because the lineage tree is entirely known21. The known lineage tree allows us to create a ground-truth reference to directly evaluate the accuracy of fitted trajectories, making this a uniquely appropriate dataset for testing LineageOT.

Packer et al. sampled 86,024 cells with 10× from loosely synchronized embryos spanning the first 800 min of C. elegans embryonic development. As visible in a UMAP embedding (Fig. 2a), the differentiation process is complex. In addition to the many branchings, the gene expression of distinct transient cell types converges for several tissues, including the hypodermis (Figs. 2b, S8 of20) and IL1/IL2 neurons (Fig. 4a of20). Such convergences cause difficulties for state-based trajectory inference, because cells with similar measured state have different histories.

Because the precise timing of each embryo is not known, Packer et al. estimated the developmental time of each cell by correlating gene expression levels with data from a previous bulk RNA time course20,32. They then divided the cells into groups with similar estimated developmental times. We treat the six groups of cells between 130 and 450 min post fertilization as discrete time points along a scRNA-seq time-course, using the end of each group’s time interval as the group’s time of sampling.

To obtain the scRNA-lineage time-course required for LineageOT, we needed to incorporate lineage information at each time point. Using both known marker genes and UMAP trajectories, Packer et al. annotated their dataset with cells’ location on the reference C. elegans lineage. The resulting tree can be used in LineageOT in the same way as a tree recovered from lineage tracing, with the potentially important difference that lineage labels are ultimately based on state measurements. Because lineage information is most helpful where it is not redundant with cell state data, that difference most likely biases this evaluation against LineageOT. For example, convergent trajectories that LineageOT might separate with lineage tracing independent of cell state may not be distinguishable.

While the majority of cells (54%) in the Packer et al. dataset are annotated with lineage information, the lineage of many of the annotated cells is not completely specified: some symmetric lineages are not distinguished (e.g., cells whose true lineage is ABprp or ABplp are all labeled as ABpxp). We explored three different strategies to get around this problem of incomplete lineage information. We first simply filtered out all cells with imperfect lineage annotation. This leaves us with only 5123 cells but with no ambiguity in the lineage tree. Second, we restricted attention to the well-annotated ABpxp sublineage, which contains 7087 cells (entirely distinct from the 5123 cells above), and we treated the lineages ABprp and ABplp as if they were identical. Third, we filtered out only cells completely lacking lineage annotation. For cells with incomplete annotations, we imputed a precise lineage label by randomly selecting from the options consistent with the partial annotation. For each approach, we also removed a small number of cells (<5%) whose assigned sampling time was before their birth time according to the reference lineage tree. These three strategies yield three scRNA-lineage datasets (visualized in Fig. S223) which we analyze separately. The results we describe below are broadly similar for each of the three strategies (Figs. 3, S3, S4 in23).

With each strategy, we applied both Waddington-OT11 and LineageOT to infer developmental trajectories and compare their performance. We provide both methods with ground-truth growth rates (Methods 7), and compute state couplings and lineage couplings connecting each pair of time points. The input cell states are the first 50 coordinates from principal components analysis of normalized, log-transformed counts for the 46,159 cells with partial lineage annotations, corrected for background counts as in20. For LineageOT, the input lineage trees come from mapping cell lineage annotations onto the known C. elegans lineage tree. This additional information, given to LineageOT and not Waddington-OT, is precisely the information now measurable with lineage tracing.

We compare each fitted coupling to a ground-truth lineage coupling computed directly from the lineage-annotated data. This ground truth is constructed by connecting each early cell to all late cells labeled as being its descendants. Creating a coupling in this way would not be possible in other organisms without cell annotations from a known, invariant lineage tree. While previous work18 has measured the success of trajectory inference by reducing to discrete branching representations, we directly check whether the predicted ancestors and descendants are similar in state to the true ancestors and descendants, respectively (Methods 6). These are two separate error metrics: the ancestor prediction error and the descendant prediction error. As an alternative, performance can be evaluated by the probability ancestor-descendant pairs from the ground truth are correctly linked. Though LineageOT does improve on Waddington-OT by that metric (Fig. S623), we prefer the ancestor and descendant error metrics because they do not require assuming the lineage trees from the two time points match.

In all our tests, LineageOT has consistently lower error for both ancestor and descendant prediction at reasonable levels of entropy (Figs. 3a, S3, S4 in 23), including after concatenating couplings across more than two time points (Fig. S723). LineageOT systematically predicts better for the majority of cells (Fig. 3b). The degree of improvement depends on the choice of entropic regularization parameter and the strategy for getting complete lineage annotations (Figs. 3c, d, S3, S4 in23), but there is no entropy choice for which LineageOT performs significantly worse. The increased accuracy comes from effectively using the information in the lineage tree.

Lineage-informed trajectory inference outperforms state-based trajectory inference on complex trajectories

We next explored the performance of LineageOT on simulated data, with the goal of characterizing some of the settings where lineage-based trajectory inference can significantly outperform state-based trajectory inference. We found that lineage information is most helpful in resolving convergent trajectories, where similar cells arise from different ancestral states. Moreover, we found that LineageOT is robust to imperfections in the lineage tree. Below we present four simulations illustrating these concepts.

In each simulation, we generate an embedded lineage tree by allowing an initial population of cells to follow a vector field with diffusion and also to divide (Methods 8). Each cell has a lineage barcode that randomly mutates and is inherited by the cell’s descendants. We sample populations of cells at two time points, compute couplings with Waddington-OT and LineageOT, and compare to the ground-truth coupling from the simulation, using the ancestor and descendant prediction errors we described above. We also test the robustness of LineageOT by giving the algorithm either (a) a lineage tree constructed from the simulated barcodes using a heuristic algorithm called neighbor-joining27 (Methods 5) or (b) the ground-truth lineage tree. For comparison, in the supplement we present the results of applying PAGA10a well-regarded trajectory inference method that only uses state information, to our simulated data.

Our first example, Simulation 1, is a simple bifurcation of a single progenitor cell type into two descendant cell types (Fig. 4a). This is one of the simplest trajectory structures to recover and one where ordinary state-based inference already does well. Given a sufficiently accurate tree, LineageOT performs marginally better at ancestor prediction (Fig. 4b) and marginally worse at descendant prediction (Fig. 4c). In hindsight, this is not surprising. The lineage tree, rather than providing substantial new information, just reaffirms the natural assumption that cells in the same cluster are a bit more closely related.

Inferring whether a single differentiated cell type came from multiple lineages is a common problem33 and one of the standard goals of lineage tracing methods22. These convergent trajectories are difficult for state-based trajectory inference, which cannot distinguish the different ancestries of cells with similar measured states. In Simulation 2, we simulate two clusters that each split; after the split, two of the resulting clusters merge together (Fig. 4d). Now lineage information is important: LineageOT can separate cells in the convergent cluster by ancestry, while state-based methods cannot. Incorporating lineage information leads to substantially better prediction of ancestors than purely state-based optimal transport (Fig. 4e), without undermining descendant prediction (Fig. 4f).

We included two unmerged clusters at the late time in Simulation 2 to illustrate how lineage information can resolve ambiguity: cells whose ancestry is unclear from expression alone should be coupled similarly to their close relatives in the lineage tree with unambiguous ancestry. The existence of separate unconverged clusters is not, however, necessary for separating cells by ancestry. In Simulation 3, we show two clusters that converge to a single final cell type (Fig. 4g). The distributions of descendants from each early cluster overlap too much for state-based methods like Waddington-OT to accurately infer ancestors. Despite the overlap, the descendant distributions remain sufficiently distinct for LineageOT to have nearly perfect assignment of late cells to early clusters, and thereby low ancestor error (Fig. 4h) with no loss of accuracy in descendant prediction (Fig. 4i).

Our fourth example illustrates that lineage information can go beyond resolving ambiguity and even correct mistakes from state-based inference. For Simulation 4, we consider two clusters that split so that two of the late-time clusters end up closer to early cells that are not their ancestors (Fig. 4k). Optimal transport fails in this case, mapping entire clusters to the wrong set of ancestors. The failure is not due to any mistake in the algorithm: any method that uses only state information could not correctly infer the trajectory from this data, as shown for PAGA in Fig. S823. LineageOT, on the other hand, can use the shared ancestry to match clusters correctly, leading to significantly better prediction of both ancestors and descendants (Fig. 4l–m).

In Simulation 4, increasing the temporal resolution by sampling the system in between the two time points could allow optimal transport or other state-based methods to accurately describe the trajectories. In real biological systems with this type of curled trajectory, lineage tracing may limit the need for many expensively-sampled time points, though it does add other costs like integrating the barcode editing technology into the genome. For convergent trajectories like Simulation 2 or 3, adding more time points without lineage information would not be enough: in that setting, lineage tracing is necessary for correct trajectory inference.

Discussion

Analyzing the trajectories cells traverse during differentiation is crucial for understanding development and for harnessing the potential of stem cell therapies. However, general-purpose techniques for directly measuring trajectories of cellular differentiation remain elusive. In certain biological contexts, such as hematopoeisis13 where cells are grown in suspension, daughter cells can be split and measured at different time points. These direct connections between time points allow for improved trajectory inference34. Such specialized techniques, however, are not applicable to systems where cells are adherent, because splitting daughter cells would perturb the developmental process. Trajectory inference from independent snapshots remains the most promising approach for understanding the genetic and epigenetic forces driving development in diverse biological contexts.

We present a general-purpose method for inferring developmental trajectories from scRNA-seq time courses equipped with lineage information each time point. Lineage tracing techniques are progressing from early demonstrations of the technology30,31 through elaboration of the potential value of the data22 and on toward future widespread use. We envision that scRNA-lineage time courses will soon replace traditional scRNA-seq time courses, because adding lineage information enables a far more powerful form of trajectory inference.

We demonstrate that LineageOT outperforms Waddington-OT on a time-course of C. elegans development (Figs. 3, S3, S4 in23), and we illustrate through simulation that LineageOT can accurately recover complex trajectory structures that are impossible to recover from measurements of cell state alone (Fig. 4d–i). Effectively using the lineage information experimentally accessible within each time point intuitively ought to improve inference accuracy across time points. LineageOT realizes that implicit potential.

Lineage trees are particularly helpful for untangling convergent trajectories, where cells arrive at a particular state from multiple ancestries. This occurs, for example, in the development of the lymphatic endothelium33, macrophage development from embryonic or monocyte-derived progenitors35, mouse gut endoderm36, and several tissues in C. elegans (20, Figs. 4a, S8, S17). While finer temporal resolution might allow state-based trajectory inference to succeed in some of these examples, LineageOT can achieve higher accuracy with fewer time points. The couplings we recover enable a direct, rigorous approach to answer biological questions about ancestor-descendant relation in developmental processes and predict regulators that govern those transitions, as demonstrated by11.

Our algorithm is derived from a flexible mathematical framework that can be adapted to include future methodological advances. Most immediately, novel methods for inferring a lineage tree from any kind of experiment, or from prior knowledge, can be used directly in the LineageOT pipeline. To leverage this to its fullest extent, one could incorporate an explicit quantification of uncertainty in the lineage tree. Furthermore, there could be significant advantages to simultaneously inferring the lineage tree together with the trajectories, rather than first fitting the tree and subsequently recovering a coupling. Finally, it might be possible to incorporate additional information, beyond cell state and cell lineage. For example, measurements of RNA velocity37 could be incorporated into our framework of estimating ancestor or descendant states and then coupling across time points. As with LineageOT, the resulting algorithm would apply optimal transport with a modified cost function.

All of these improvements would build on the key observation that lineage tracing allows us to share information across closely related cells. State-based trajectory inference relies exclusively on the assumption that each descendant considered individually should be close in state to its ancestor. As we have demonstrated, expanding that assumption to consider related cells together allows for more powerful trajectory inference that can recover more complicated trajectories without relying on the restrictive assumption that cells with similar states having similar ancestry. LineageOT analyses of future cell state and lineage time courses collected with current technologies will provide a more accurate window on the intricate processes of development.

Methods

Developmental stochastic processes and state couplings

A developmental stochastic process is a mathematical description of a population of cells developing over time, where a single cell is represented by a point in a high-dimensional vector space \({{{{{{{{{\mathcal{X}}}}}}}}}}\) of cellular states and a population of cells is represented by a probability distribution \({\mathbb{P}}\) on \({{{{{{{{{\mathcal{X}}}}}}}}}}\). In deriving our method, we do not assume any particular state space: a vector \({{{{{{{{{\bf{x}}}}}}}}}}\,\in\, {{{{{{{{{\mathcal{X}}}}}}}}}}\) could contain a cell’s raw gene expression, coordinates from principal components analysis of expression as used in our C. elegans case study, or other state measurements like chromatin accessibility.

When we profile the population with scRNA-seq, we model the resulting data as a set of random samples from \({\mathbb{P}}\). In the context of development, a time-varying distribution \({{\mathbb{P}}}_{t}\) represents the cells alive at time t, and the data from a scRNA-seq time-course consists of samples from \({{\mathbb{P}}}_{t}\) collected at various times t1, t2, …, tN. The crucial point is that the random samples from different time points are independent in the probabilistic sense, because each time point is typically collected from a separate biological sample.

This brings us to the second key concept of a developmental stochastic process: the notion of a coupling connecting a pair of time points. We distinguish between two kinds of couplings: state couplings (defined here) and lineage couplings (defined in the next section). Intuitively, the state coupling connecting time t1 to t2 specifies relationships between ancestral states at t1 and descendant states at t2. Mathematically, it is a joint probability distribution over pairs of cell states (x, y), with x and y corresponding to cells alive at t1 and t2 respectively. Conditioning on cell-state x at time t1 gives a distribution over possible descendant states y at time t2. In other words, while \({{\mathbb{P}}}_{t}\) simply describes the states of cells that exist at each time point, the state couplings specify the trajectories that give rise to the changes we observe in the population. The state couplings contain information lost in a scRNA-seq time-course: the measurements are destructive so we cannot simultaneously measure the state of a cell and the state of its ancestors or descendants.

Forward and backward lineage couplings

Even state couplings, however, still omit some of the information from a specific experiment or realization of the stochastic process. A cell j at time t2 has a true history, which may differ from the average history of cells with state equal to yj. Lineage information makes it possible to recover the history of j in particular. The history can again be described by a coupling, this time thought of as a coupling of specific sampled cells rather than of states. We name this a lineage-resolved coupling (or “lineage coupling” for short).

One reason the distinction matters is that our descriptions of cell state are incomplete. Gene expression profiles, for example, are only one easily measured part of the cell state. Cells with similar current gene expression but different history could in principle differ in other aspects of their current state. Investigating that possibility requires separating the cells by ancestry even when their current state measurements are similar.

Above, we introduced a developmental stochastic process from the perspective of time-dependent probability distributions \({{\mathbb{P}}}_{t}\) connected by state couplings. From a complementary perspective, we can consider the time evolution of individual cells rather than distributions.

Each cell state at time t is a point x(t) in the state space \({{{{{{{{{\mathcal{X}}}}}}}}}}\). Over time, cells follow some true path through \({{{{{{{{{\mathcal{X}}}}}}}}}}\) according to a stochastic differential equation combining diffusion and drift:

$$\,{{\mbox{d}}}\,{{{{{{{{{{\bf{X}}}}}}}}}}}_{t}\,=\,{{{{{{{{{\bf{v}}}}}}}}}}({{{{{{{{{{\bf{X}}}}}}}}}}}_{t})\,{{\mbox{d}}}t\,+\,\sqrt{2D}{{\mbox{d}}}\,{{{{{{{{{{\bf{B}}}}}}}}}}}_{t}$$
(1)

where v denotes a velocity field and Bt denotes standard Brownian motion scaled by the diffusion constant D. On short time scales, diffusion, which is \(O(\sqrt{\,{{\mbox{d}}}\,t})\), dominates drift, which is O(dt).

In this model, an experiment samples a set of cell paths {xi(t)} from a distribution \({{{{{{{{{\mathcal{P}}}}}}}}}}\) over the space of paths \([0,1]\,\to\, {{{{{{{{{\mathcal{X}}}}}}}}}}\). Importantly, these paths are not observed in full; we only see xi = xi(t1) for the one measurement time t1. In a time-course experiment, in addition to measuring a set of cell states {xi} at time t1 we also measure the states {yj} of a second set of identically prepared cells at time t2.

We then want to couple the early and late distributions in order to trace cells forward and backward in time. As described above, a coupling γ is a joint distribution over pairs; for a lineage coupling, these are pairs of cells (i, j) rather than pairs of states (x, y). When {xi} and {yj} are discrete sets, as they are here, γ is a matrix whose entries sum to 1. The true lineage coupling connects each cell j from the late-time point to its unobserved ancestor at the early time point. Though that true coupling is experimentally inaccessible, we can attempt to couple early or late cells respectively to cells similar in state to their hypothetical descendants or true ancestors.

The forward and backward questions are in principle different for lineage couplings. We could seek either a coupling γF such that \({\gamma }_{i,:}^{F}\), considered as a distribution on the {yj}, is approximately the true distribution of the descendants of cell i; or we could seek a coupling γB such that \({\gamma }_{:,j}^{B}\), considered as a distribution on the {xi}, is approximately the true distribution of the ancestors of cell j. For one cell, that true ancestor distribution will be a single point mass.

Optimal transport as maximum-likelihood estimate

For both the forwards and backwards problems, entropic optimal transport can be understood as the maximum-likelihood coupling between an infinite population of cells started with the distribution of {xi} and conditioned to end up with the distribution of {yj}. If the likelihood of a cell at x ending at y is \(p({{{{{{{{{\bf{y}}}}}}}}}}| {{{{{{{{{\bf{x}}}}}}}}}})\,=\,{e}^{-\frac{c({{{{{{{{{\bf{x}}}}}}}}}},{{{{{{{{{\bf{y}}}}}}}}}})}{\epsilon }}\), maximizing the log-likelihood \({{{{{{{{\mathrm{log}}}}}}}}}\,(p({{{{{{{{{\bf{y}}}}}}}}}}| {{{{{{{{{\bf{x}}}}}}}}}},\gamma ))\) leads to

$${\gamma }^{ML}\,=\,\mathop{{{{{{{{{{\rm{arg}}}}}}}}}}\ {{{{{{{{{\rm{min}}}}}}}}}}}\limits_{\gamma }\ \mathop{\sum}\limits_{ij}{\gamma }_{ij}c({{{{{{{{{{\bf{x}}}}}}}}}}}_{i},{{{{{{{{{{\bf{y}}}}}}}}}}}_{j})\,-\,\epsilon H(\gamma )$$
(2)

where \(H(\gamma )\,=\,-{\sum }_{ij}{\gamma }_{ij}{{{{{{{{\mathrm{log}}}}}}}}}\,({\gamma }_{ij})\) is the entropy of γ. This is precisely the objective function for optimal transport with cost c(x, y) and entropy parameter ϵ.

If the times t1 and t2 are sufficiently close together, the dynamics of x between t1 and t2 are approximately purely diffusive, so that \({{{{{{{{{\bf{x}}}}}}}}}}({t}_{2})\,-\,{{{{{{{{{\bf{x}}}}}}}}}}({t}_{1}) \,\sim\, {{{{{{{{{\mathcal{N}}}}}}}}}}(0,D({t}_{2}\,-\,{t}_{1})I)\). This then translates to a quadratic optimal transport cost

$$c({{{{{{{{{{\bf{x}}}}}}}}}}}_{i},{{{{{{{{{{\bf{y}}}}}}}}}}}_{j})\,=\,\parallel {{{{{{{{{{\bf{x}}}}}}}}}}}_{i}({t}_{1})\,-\,{{{{{{{{{{\bf{y}}}}}}}}}}}_{j}({t}_{1}){\parallel }^{2}$$
(3)

and entropy parameter

$$\epsilon \,=\,D({t}_{2}\,-\,{t}_{1}).$$
(4)

Because the likelihood is symmetric there is no difference between estimating forwards and estimating backwards. Other assumptions about the dynamics of the cells, such as might come from RNA velocity, could be incorporated here. Our goal with LineageOT is to find an appropriate replacement for the likelihood using the lineage information and use that as a cost for optimal transport.

Ancestor inference with lineage information

A complete lineage tree \({{{{{{{{{\mathcal{T}}}}}}}}}}\) for {yj} encodes the time \({t}_{{j}_{1},{j}_{2}}\) of the most recent common ancestor of each pair of cells \(\{{{{{{{{{{{\bf{y}}}}}}}}}}}_{{j}_{1}},{{{{{{{{{{\bf{y}}}}}}}}}}}_{{j}_{2}}\}\). In terms of paths \({{{{{{{{{{\bf{y}}}}}}}}}}}_{{j}_{1}}(t)\) and \({{{{{{{{{{\bf{y}}}}}}}}}}}_{{j}_{2}}(t)\) of the cells, a common ancestor at time \({t}_{{j}_{1},{j}_{2}}\) implies that

$$\forall t\le {t}_{{j}_{1},{j}_{2}},\quad {{{{{{{{{{\bf{y}}}}}}}}}}}_{{j}_{1}}(t)\,=\,{{{{{{{{{{\bf{y}}}}}}}}}}}_{{j}_{2}}(t).$$
(5)

This gives no direct information about the unknown {yj(t1)}; instead, it tells us something about the correlations among {yj(t1)}. For LineageOT, we follow the same maximum-likelihood derivation that leads to entropic optimal transport but replace the distribution of x conditional on yj with the distribution of x conditional on the full sample {yj} and the lineage tree \({{{{{{{{{\mathcal{T}}}}}}}}}}\):

$${\gamma }^{lineage}\,=\,\mathop{{{{{{{{{{\rm{arg}}}}}}}}}}\ {{{{{{{{{\rm{max}}}}}}}}}}}\limits_{\gamma }\ {{{{{{{{\mathrm{log}}}}}}}}}\,(p({{{{{{{{{\bf{x}}}}}}}}}}| \{{{{{{{{{{\bf{y}}}}}}}}}}\},\gamma ,{{{{{{{{{\mathcal{T}}}}}}}}}})).$$
(6)

Like Waddington-OT, LineageOT is derived from the diffusive model where the differences in cell state over time are Gaussian. In that model, conditional on the connectivity of \({{{{{{{{{\mathcal{T}}}}}}}}}}\) the cell-state values at the nodes (i.e., the common ancestors of the leaves) are sampled from a Gaussian graphical model on \({{{{{{{{{\mathcal{T}}}}}}}}}}\). We can then additionally condition on the observed values {yj(t2)} to find the posterior density \(p({{{{{{{{{{\bf{y}}}}}}}}}}}_{j}({t}_{1})| \{{{{{{{{{{{\bf{y}}}}}}}}}}}_{k}({t}_{2})\},{{{{{{{{{\mathcal{T}}}}}}}}}})\). This density will be Gaussian with each mean \({\bar{{{{{{{\bf{y}}}}}}}}}_{j}({t}_{1})\) equal to a weighted average of the values {yj(t2)}. We then use an entropically regularized optimal transport coupling between \(\{{{{{{{{{{{\bf{\bar{y}}}}}}}}}}}}_{j}({t}_{1})\}\) and {xi(t1)} to approximate the backwards coupling γB.

Specifically, LineageOT implements the following procedure:

  1. 1.

    Fit a lineage tree estimate \(\hat{{{{{{{{{{\mathcal{T}}}}}}}}}}}\) for {yj(t2)} including the estimated time of division of each most recent common ancestor, for example via neighbor joining on CRISPR barcodes.

  2. 2.

    Add nodes {yj(t1)} for the ancestor of each time t2 cell at time t1 to \(\hat{{{{{{{{{{\mathcal{T}}}}}}}}}}}\). Some cells may share an ancestor here.

  3. 3.

    Pick a reference cell y0(t2). The difference in state of other nodes of \(\hat{{{{{{{{{{\mathcal{T}}}}}}}}}}}\) with respect to this reference (i.e., yv − y0(t2)) is assumed to be normally distributed with mean zero; the precision matrix has entries

    $${{{\Lambda }}}_{uv}\,=\,\frac{1}{D| {t}_{u}\,-\,{t}_{v}| }{{{{{{{{{\bf{1}}}}}}}}}}\left[(u,v)\,\in\, \hat{{{{{{{{{{\mathcal{T}}}}}}}}}}}\right].$$
    (7)
  4. 4.

    Condition on the values yj(t2) for the set \({{{{{{{{{\mathcal{O}}}}}}}}}}\) of observed nodes. The conditional means for yv − y0(t2) in the set \({{{{{{{{{\mathcal{U}}}}}}}}}}\,=\,{{{{{{{{{{\mathcal{O}}}}}}}}}}}^{c}\) of unobserved nodes can then be found using the appropriately truncated precision matrix: for each gene g, with 1 the vector of \(| {{{{{{{{{\mathcal{O}}}}}}}}}}|\) ones,

    $${\mu }_{{{{{{{{{{\mathcal{U}}}}}}}}}}}^{g}\,=\,{{{\Lambda }}}_{{{{{{{{{{\mathcal{U}}}}}}}}}}{{{{{{{{{\mathcal{U}}}}}}}}}}}^{-1}{{{\Lambda }}}_{{{{{{{{{{\mathcal{U}}}}}}}}}}{{{{{{{{{\mathcal{O}}}}}}}}}}}({{{{{{{{{{\bf{y}}}}}}}}}}}_{{{{{{{{{{\mathcal{O}}}}}}}}}}}^{g}\,-\,{y}_{0}^{g}{{{{{{{{{\bf{1}}}}}}}}}}).$$
    (8)
  5. 5.

    Compute the entropic optimal transport coupling between {xi} and {yj} with cost

    $$c(i,j)\,=\,\frac{\parallel {{{{{{{{{{\bf{x}}}}}}}}}}}_{i}\,-\,{\mu }_{{{{{{{{{{{\bf{y}}}}}}}}}}}_{j}({t}_{1})}{\parallel }^{2}}{{\sigma }_{{{{{{{{{{{\bf{y}}}}}}}}}}}_{j}({t}_{1})}^{2}},$$
    (9)

    where \({\mu }_{{{{{{{{{{{\bf{y}}}}}}}}}}}_{j}({t}_{1})}\) and \({\sigma }_{{{{{{{{{{{\bf{y}}}}}}}}}}}_{j}({t}_{1})}^{2}\) are the conditional mean and variance respectively for the ancestor of each t2 cell at time t1.

In practice, despite being designed for ancestor prediction rather than descendant prediction, LineageOT outperforms entropic optimal transport on both tasks for all but the simplest trajectories.

Fitting a lineage tree

To apply LineageOT, we need to infer a lineage tree that will define the structural equation model. We do not optimize this step, instead relying on a heuristic algorithm called neighbor joining27. Neighbor joining starts from pairwise lineage distance estimates, which can be estimated in CRISPR-based barcoding approaches using the Hamming distances between observed barcodes30. The fitted tree will not be perfect, and indeed simulations with currently plausible experimental parameters find significant errors in the inferred tree topology38. As our own simulations demonstrate, however, an imperfectly inferred tree can still substantially improve trajectory inference. Moreover, the source of the tree does not matter: a lineage tree based on detailed prior biological knowledge, as is available for C. elegans, can be used directly in LineageOT.

For LineageOT, we need not only the tree topology but also the time elapsed along each edge of the tree. The raw lineage distances computed from Hamming distances, however, give very noisy estimates of the edge times. We therefore correct the distances using the fact that all cells were sampled at the same time; this means that all leaves of the tree must have the same total distance to the root. Minimizing the mean squared error to the Hamming distance estimates subject to this constraint is a quadratic program that can be solved with standard convex optimization techniques and significantly improves the estimated lineage distances (Fig. S923).

Error metrics

While we only produce one estimated coupling for both ancestor and descendant prediction, we separate out the two questions in evaluation. Given a true coupling γ*, we define the descendant prediction error \({{{{{{{{{{\mathcal{L}}}}}}}}}}}^{D}(\gamma )\) for a fitted coupling γ with the same marginal over {xi} as the mean squared optimal transport distance between γi,: and \({\gamma }_{i,:}^{* }\) considered as distributions over {yj}:

$${{{{{{{{{{\mathcal{L}}}}}}}}}}}^{D}(\gamma )\,=\,\mathop{\sum}\limits_{i}{W}_{2}^{2}\left({\gamma }_{i,:}^{* },{\gamma }_{i,:}\right)$$
(10)

where W2(μ, ν) denotes the optimal transport distance between distributions μ and ν with quadratic cost, also called the Wasserstein-2 distance. Symmetrically, we define the ancestor prediction error \({{{{{{{{{{\mathcal{L}}}}}}}}}}}^{A}(\gamma )\) for a fitted coupling γ with the same marginal over {yj} as the mean squared optimal transport distance between γ:,j and \({\gamma }_{:,j}^{* }\) considered as distributions over {xi}:

$${{{{{{{{{{\mathcal{L}}}}}}}}}}}^{A}(\gamma )\,=\,\mathop{\sum}\limits_{j}{W}_{2}^{2}\left({\gamma }_{:,j}^{* },{\gamma }_{:,j}\right).$$
(11)

C. elegans ground truth and growth rates

Our ground-truth coupling γ* for the C. elegans time-course is the forward coupling based on the lineage labels: we set \({\gamma }_{ij}^{* }\,=\,{(| \{{{{{{{{{{{\bf{x}}}}}}}}}}}_{i}\}| {n}_{d,i})}^{-1}\) if i is an ancestor of j and \({\gamma }_{ij}^{* }\,=\,0\) otherwise, where nd,i is the number of descendants of cell i in {j}. This forward coupling has a uniform marginal over {xi} but not over {yj}. For simplicity, rather than using soft marginal constraints with estimated growth rates as Waddington-OT does, we use the true marginals of γ* for all fitted couplings. Knowledge of the true marginals should help Waddington-OT and LineageOT approximately equally without significantly affecting the comparison between them.

Simulations

For our simulations, we construct a vector field to recreate a biologically plausible trajectory structure. Cells follow the vector field with diffusion and occasional cell division; the time between cell divisions is nearly constant, with normally distributed variability with a small variance (so that all sampled cell lifetimes are positive). Changing this variance in cell division times or setting it to 0 does not significantly affect our results. Meaningful dynamics occur in either two dimensions (for Simulations 1, 2, and 3) or three dimensions (for Simulation 4). The vector field is always constant in the first dimension, making x1 a proxy for time since the start of the simulation. In the remaining nontrivial one or two dimensions, we set the vector field to be the negative gradient of a potential with minima at locations we would like to have clusters, with the minima changing with x1. We simulate in three dimensions in all cases; for Simulations 1, 2, and 3, in the third dimension cells diffuse with no mean velocity. Thus, for example, the simple bifurcation of Simulation 1 follows the flow field

$${{{{{{{{{\bf{v}}}}}}}}}}(x)\,=\,\left(\begin{array}{l}{v}_{1}(x)\\ {v}_{2}(x)\\ {v}_{3}(x)\end{array}\right)\,=\,\left(\begin{array}{l}1\\ -{x}_{2}^{3}\,+\,{x}_{1}{x}_{2}\\ 0\end{array}\right).$$
(12)

Initially, with x1 < 0, x2 = 0 is the only stable value of x2; later, with x1 > 0, there are two stable states \({x}_{2}\,=\,\pm \sqrt{{x}_{1}}\). For the remaining simulations, which involve more complex piecewise-smooth flow fields, we refer readers to our code: https://github.com/aforr/LineageOT.

Each cell has a lineage barcode that randomly mutates and is inherited by the cell’s descendants. The global mutation rate r is set so that the expected proportion of sites in a barcode of length that are unmutated at the time of sampling t2, equal to \({e}^{-\frac{r{t}_{2}}{\ell }}\), is close to 0.5 and so relatively far from both 0 and 1. The rate is then neither so slow that little lineage information is recorded so fast that barcodes are saturated before the sampling time. This choice was inspired by similar numbers in experimental data from30.

For each vector field, we simulate a single embedded lineage tree measured at two time points and compute the couplings inferred by Waddington-OT, LineageOT given the true lineage tree, and LineageOT given a lineage tree fitted to the simulated barcodes. Because the simulated division rates are uniform across cells, we set the marginals for each fitted coupling to be uniform rather than inputting the true marginals as we did for the C. elegans evaluations. The fitted couplings are compared to the true coupling with the same ancestor and descendant prediction errors we used for C. elegans.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.