Article | Open | Published:

# Lineage marker synchrony in hematopoietic genealogies refutes the PU.1/GATA1 toggle switch paradigm

## Abstract

Molecular regulation of cell fate decisions underlies health and disease. To identify molecules that are active or regulated during a decision, and not before or after, the decision time point is crucial. However, cell fate markers are usually delayed and the time of decision therefore unknown. Fortunately, dividing cells induce temporal correlations in their progeny, which allow for retrospective inference of the decision time point. We present a computational method to infer decision time points from correlated marker signals in genealogies and apply it to differentiating hematopoietic stem cells. We find that myeloid lineage decisions happen generations before lineage marker onsets. Inferred decision time points are in agreement with data from colony assay experiments. The levels of the myeloid transcription factor PU.1 do not change during, but long after the predicted lineage decision event, indicating  that the PU.1/GATA1 toggle switch paradigm cannot explain the initiation of early myeloid lineage choice.

## Introduction

Tightly controlled and correctly timed cell fate decisions are crucial for the development and maintenance of any healthy organism. Understanding their molecular control is therefore essential for basic biological research and the development of future therapies. However, the identification of the exact time point when a cell fate decision happens is often impossible, since the emergence of an observable signal is usually delayed from the fate decision itself. If we can only observe the delayed signal, but not the actual decision-making process, factors that influence the decision remain unidentified.

Consider a cellular process where an unobservable event (e.g., a cell fate decision) leads to an observable phenotypic signal (e.g., a morphological change or the onset of a lineage marker) with a delay in time. From just observing the signal in non-dividing cells, one cannot infer the true time point of the unobserved event as the delay is typically unknown (Fig. 1a). However, if cells divide during the delay, this induces correlated signals in related cells (e.g., two sisters or four cousins, etc.). These correlated signals carry information about the length of the delay and hence about the timing of the unobserved event: for example, a delay of 1–2 generations causes correlated readouts in sister and cousin cells and suggests a decision in the mother or grandmother generation (see Fig. 1b).

Due to recent advances in time-lapse imaging and single-cell tracking1,2,3,4, it is now possible to obtain large genealogies of single cells and observe correlated signals. For example in yeast, sister cells switch gene expression of a simple regulatory circuit in a correlated fashion5. In mammalian hematopoiesis, differentiation is typically read out via the expression of a lineage specific differentiation marker6,7, e.g., the CD16/32 membrane receptor in the myeloid branch of hematopoiesis (Fig. 2a). However, these markers report the lineage decision only indirectly, because their expression is a delayed downstream consequence of a former unobserved event (Fig. 2b). Here, we parametrize the decision process and the marker delay in a computational model that combines a memory-less decision process and stochastic gene expression mimicking marker delay with graphical models and dynamic programming to cope with the computational complexity of genealogies. This allows us to calculate the probabilities of different decision scenarios (called hidden trees, Fig. 2c) and determine the most likely time point of the unobserved lineage decision.

We apply this method to a dataset of differentiating hematopoietic stem cell genealogies with annotated lineage marker onsets and find that myeloid/megakaryocytic-erythroid lineage decision happens several generations earlier than reported by lineage markers and that the dynamics of PU.1 during the lineage decision is inconsistent with a PU.1-based toggle switch driving the lineage decision.

## Results

### Prediction of an early cell fate decision in hematopoiesis

One hematopoietic lineage decision is the choice of hematopoietic stem and progenitor cells (HSPCs) between the megakaryocytic-erythroid (MegE) and the granulocyte-macrophage (GM) lineage7. The mutually exclusive expression of the transcription factors PU.1 and GATA1 in mature GM and MegE cells, respectively (see e.g.,8 for an overview), and their mutual binding and cross-antagonism inspired toggle switch models that predict transcription factor dynamics before and during this decision9,10,11,12,13,14. These models assume the switch to one of the cross-antagonistic transcription factors to precede and induce GM vs. MegE lineage choice, and serve as the de facto paradigm of binary cell fate choice on a molecular level15. However, since the exact timing of GM vs. MegE lineage choice remains unknown it is impossible to quantify the dynamics of PU.1 and GATA1 immediately before and during the actual lineage decision making.

To identify the time windows of this HSPC lineage decision making, and to compare it to the dynamics of PU.1 and GATA1 regulation, we used a dataset of sorted murine HSPCs, where endogenous PU.1 and GATA1 proteins are tagged with yellow and red fluorescent proteins, respectively (see Hoppe et al.7 for experimental details). Over 10,000 single cells have been tracked and quantified, generating cellular genealogies up to 12 generations deep. Definite GM lineage commitment is detected via CD16/32 onset using in-culture antibody staining (i.e., a fluorescent CD16/32 antibody is present in the medium and accumulates on cells that express CD16/32 on the membrane16,17, Fig. 2a). MegE lineage commitment is read out via GATA1–mCherry upregulation. We analyze 54 GM-fated and 20 MegE-fated genealogies from three independent experiments (see Fig. 3a, b).

To infer the time point of lineage choice, i.e., the time when a HSPC loses multipotency and commits towards the GM- or MegE-lineage, we fit our computational model to the observed genealogies by maximizing the likelihood of the data with respect to the model parameters (see Methods). For each individual genealogy (from now on also called “tree”), we calculate its likelihood given model parameters by summing up overall possible scenarios of differentiation, decomposing each tree into subtrees, and calculating the probability for each subtree using a graphical model (see Methods and Fig. 2d). After extensive testing on synthetic data (Supplementary Notes 1-2), we use our computational method to predict the most likely time point of lineage choice based only on the temporal correlations in related cells of CD16/32 or GATA1 onsets, respectively. The model decomposes the onset distribution into a differentiation probability (Fig. 3c) and a lineage decision marker delay distribution (Fig. 3d) to fit the observed marker onset distributions (Fig. 3e) and the observed correlation patterns. As shown in Fig. 3c, the estimated differentiation probability is exponentially decreasing with time. The majority (74%) of predicted lineage decisions happen already in the first or second generation of the genealogies (Supplementary Figure 1). While the tracked generations are only relative to the start of the experiment, HSPCs had just been freshly sorted and had been kept at 4 °C from harvesting of bone marrow until shortly before the start of the imaging experiment, thus most likely preventing cellular decision making during HSPC preparation. Such early differentiation is surprising as the established lineage markers CD16/32 and GATA1 can only be detected after many days in culture7. Notably, the predictions of lineage decision time points are unchanged for moderate measurement noise (up to one cell cycle length) in the annotated onsets (see Supplementary Note 5). The delay between the unexpected early differentiation and the onsets of the lineage marker was on average 78 h for GM and 54 h for MegE (Fig. 3d), with cell cycle lengths of 12 ± 5 h (for further characterization of the delay process and its induced correlations, see Supplementary Note 6).

Interestingly, the differentiation probability distributions are almost identical for GM and MegE genealogies (Fig. 3c) even though they were estimated independently from different trees. This suggests a mechanism where a process common to both lineages determines the timing of differentiation, while the dynamics of lineage marker expression as a consequence of differentiation are distinct for both lineages.

Next, we validate our finding of early differentiation events using data from independent colony assay experiments of sorted HSPCs, performed in the same experimental conditions (Hoppe et al.7). These colony assays allow to read out the amount of pure GM-, pure MegE-, and mixed GMMegE (containing all lineages) colonies formed from single HSPCs after 10 days of culture. While the differentiation distribution Φ(t) cannot be measured directly, it leaves a distinct fingerprint in these frequencies: if lineage decisions happen early, and thus in few cells within the colony, mostly pure GM or pure MegE colonies will emerge, and GMMegE colonies will be rare. In contrast, if decisions happen late and thus independently in many cells within the colony, mostly GMMegE colonies will emerge and pure GM or pure MegE colonies will be rare. This intuition can be formalized in a mathematical branching process model (see Methods and Marr et al.18), which predicts the proportions of GM, MegE, and GMMegE colonies for a given differentiation probability.

When supplied with the differentiation probability Φ(t) in Fig. 3c (estimated with our tree inference algorithm from time lapse data), the branching process model faithfully predicts the experimentally observed colony assay frequencies (see Fig. 3f, g). In particular, we are able to correctly predict the large frequency of observed GMMegE colonies (60 ± 7%), even though GMMegE genealogies were not used to estimate the differentiation probability with our tree inference algorithm (Fig. 3f, g). Note that GMMegE genealogies are rare in the time lapse dataset of Hoppe et al.7, due to the tracking strategy applied, where trees are often only partially tracked. The few observed ones (see Supplementary Figure 18) are indeed consistent with early differentiation events.

### PU.1 dynamics at the predicted time point of lineage choice

To investigate if PU.1 and GATA1 are the determinants or only a consequence of these HSPC lineage decisions, we analyzed the dynamics of endogenous PU.1 levels quantified from PU.1-eYFP fluorescence for each tracked cell and time point in the genealogies (Fig. 4a; for details, see Hoppe et al.7). In a typical branch of an HSPC genealogy, both the number of PU.1-eYFP proteins and the PU.1-eYFP concentration (intensity/cell area) rise before CD16/32 onset (Fig. 4a). This indeed matches the expectations from a toggle switch model including PU.19,10,11: initially balanced, the switch tilts in favor of PU.1, which gets upregulated and leads to commitment towards the GM lineage, signified by delayed CD16/32 expression. However, it is impossible to tell a priori if PU.1 upregulation induces the lineage decision or if it is a downstream consequence of an earlier lineage choice.

To that end, we use the predicted lineage choice time point from our model and compare it to the time point of PU.1 upregulation. We find no significant difference in PU.1-eYFP production, quantified by estimating the slope of the PU.1-eYFP concentration (red lines in Fig. 4a) in cell generations before, at, or directly after the identified lineage choice time point, (p = 0.25 and p = 0.15, rank-sum test, see Fig. 4b). In contrast, PU.1-eYFP production is significantly higher in later cells with CD16/32 onset (Fig. 4b). Similarly, when inferring lineage choice time points in 20 MegE-fated genealogies based on correlated onsets of GATA1 expression, we find that PU.1-eYFP production does not change in cells before, at and directly after the predicted time point (Fig. 4c). These results are robust across three independent experiments (see Supplementary Note 4).

Now we compare these findings to a model where a toggle switch involving PU.1 drives cell differentiation. We implemented a popular toggle switch model that is thought to drive binary lineage decision composed of two mutually repressing transcription factors (Fig. 4d inset; see Supplementary Note 3 for model details)9,10,11,12,13,14,19. This model exhibits three stable states (Fig. 4d): The state where both proteins are expressed at similar levels is associated with a progenitor cell. In the two other states, one of the two proteins is strongly upregulated, thereby repressing the other, representing two mutually exclusive differentiated lineages. Differentiation initiation occurs via noise driven transitions from the progenitor to one of the differentiated states. Using Gillespie’s algorithm20, we simulate genealogies from this toggle switch model starting from single cells in the undifferentiated state. Eventually this cell or its progeny will leave the progenitor state and proceed to one of the differentiated states, turning on marker expression. We now assume the underlying transcription factor dynamics to be unobserved and infer the putative differentiation time points from solely the correlated marker onsets in this synthetic dataset with our method. We find that in our synthetic dataset, the time point of predicted differentiation is identical to the time point where the toggle switch tilts (Fig. 4e). Here, the initial balance between the two factors is broken, one is upregulated while the other is downregulated in the predicted cells. Quantifying PU.1 production as in Fig. 4b, c, we find significant change between cell generations before and at the predicted differentiation time point, both for cells heading towards the GM-lineage (PU.1 upregulation, Fig. 4f) and towards the MegE lineage (PU.1 downregulation Fig. 4g). Similar results are found for different parameterizations and more complex models of the toggle switch (Supplementary Figures 12, 13). Our method is thus able to correctly predict differentiation events driven by a genetic toggle switch (Fig. 4e) and detect the cells where the involved transcription factors are differentially regulated (Fig. 4f, g) on synthetic data.

This approach shows that the experimentally observed marker onsets and PU.1 dynamics are inconsistent with a toggle switch involving PU.1, which initiates the lineage choice: If PU.1 was directly involved in the GM/MegE lineage decision, we would detect up- or downregulation in cells at the predicted time of lineage choice (compare Fig. 4b, f, as well as Fig. 4c, g). Thus, while PU.1’s importance in the execution of GM/MegE programs is undoubted21 (and demonstrated by knockout experiments22,23), it is not the initiator of lineage choice but rather an effector that locks down the chosen lineage.

## Discussion

The analysis of tree-structured data has a long history in the field of phylogenetics24,25. Here, the main challenge is to reconstruct a single unobserved sequence evolution tree using a stochastic model of nucleotide substitution26 and observed sequences at the leaves of the tree. In contrast, we directly observe the stem cell genealogies, estimate model parameters from multiple trees, and use a complex stochastic model which makes parameter inference challenging. Due to the data structure and the lack of available tools, genealogies typically have been studied using summary statistics18,27,28,29,30. Modeling has only rarely been used to gain mechanistic understanding about the observations31,32,33. These approaches rely on a simple Markov model of state changes (akin to our differentiation process) and assume that this state change is readily observable. The key difference in our approach is that we allow for a delayed observation of the underlying state change, detectable only several generations later. These long-range correlations are not accounted for by previous models.

Our model of differentiation and delay only approximates the underlying biological process. For example, we assumed that the differentiation rate is only time-dependent, whereas differentiation is likely to depend on other external factors, e.g., spatial interactions between cells and their microenvironment34. However, in our experiments, the high motility of blood progenitors results in fast mixing of cells and the impact of spatial interactions is presumably small (see Supplementary Figure 3). As time-lapse microscopy allows observing the spatial arrangement of cells, those effects can in principle be incorporated into the differentiation rate35.

We modeled the marker delay as a simple stochastic gene expression due to a lack of knowledge about internal processes. Typical gene expression parameters36 and reasonable detection limits would only allow for short delays in the range of hours. Correlations across multiple generations (as shown in Fig. 1a) however, cannot be explained by simple mechanisms, but are more likely caused by cascades in the underlying gene regulatory network that trigger differentiation. Our model can approximate such complex delay processes, e.g. via cascades of genes appropriately (see Supplementary Note 2-3).

A delay between the time point of lineage decision and the onset of lineage specific markers was expected, and some supporting data has been published. Paul et al.37 recently found a population of cells with GM-like transcriptional profiles but without CD16/32 expression. In our setting, this corresponds to cells that are located downstream of a differentiation decision in a genealogy but are still negative for CD16/32 (gray cells in Fig. 3a). Our data suggests that the delay until marker onset is much longer than expected. Furthermore, we could show that the PU.1/GATA1 toggle switch model, a paradigm of lineage choice in hematopoiesis is inconsistent with the observed marker onsets and inferred delays; the data shows that PU.1 expression only changes significantly several generations after the inferred time of lineage choice. This is in line with the interpretation of Velten et al.38 suggesting that the differentiation–tree model of hematopoiesis with binary lineage decisions at branching points should be revised. In addition, it fits to the finding that lineage choice can be predicted before marker onset based on cell morphology and movement30. In contrast to Kueh et al.39, who report a cell cycle elongation upon PU.1 upregulation in an LMPP-like population, we see a decrease in cell cycle lengths from the first generation to the second, and a stabilization afterwards at around 12 h (see Supplementary Figure 2). Importantly the cell cycle distributions are similar for GM- and MegE-annotated genealogies (see Supplementary Figure 2). The prolonged cell cycle in the generations 0 and 1 is most likely a result of stem cells gradually getting activated and starting to cycle when exposed to the media conditions of the experiment.

Finally, it is highly interesting, and as yet without any explanation, how such a long delay between lineage choice and marker onset can be encoded in eukaryotic cells. In bacteria Levine et al.40 demonstrated how a system of feedback loops could induce delayed cell fate decisions over several generations. However, it is unknown if similar mechanisms could account for the much longer delays on the order of several days, as estimated from our data.

Provided its extendibility and generality, we are confident that our method is applicable to a wide range of cellular decision problems. For example, it has been described that treatment of differentiating embryonic stem cells leads to a highly synchronized, delayed lineage choice days later17,41. Along the same line, reprogramming somatic cells into iPS cells is believed to be a stochastic process42,43, and e.g. analyzing the timing of reprogramming44 might give insight into this complex procedure. Similarly, it is thought that tumorigenesis is the result of stochastic state transitions between cancer stem cells and non-tumorigenic cells while metastases are generated when cells randomly undergo an epithelial–mesenchymal transition, detach from the tumor and spread the cancer into other body parts45. Here, our method could be used to trace back to the tumor- or metastasis-initiating cells in suitable time-lapse in vitro experiments in order to investigate what triggered these initial events.

## Methods

### Model assumptions

We introduce a computational method that, based on observed correlations, estimates a delay to obtain the true time point of the unobserved decision. Although our method is generally applicable to any decision process and associated delay that leads to correlated outcomes in tree-structured data, we focused on cellular differentiation. Time lapse microscopy combined with cell tracking and fluorescence signal quantification delivers genealogies of single cells with fate annotation typically read out via surface markers or cell morphology46,47,48,49 (Fig. 2a). Each genealogy starts with a single stem cell at t = t0 (the start of time lapse microscopy). During the experiment, the cell divides and gives rise to two daughter cells. These cells will later also divide, giving rise to further progeny. At time points t > t0 the onsets of lineage markers are observed (green in Fig. 2a).

We propose that observed correlations in marker expression emerge because of a delay between the unobserved differentiation time point and the observed marker onset. According to this generic model, an observed tree T can be explained by several scenarios that we call “hidden trees” $H T$(Fig. 2c). In order to infer the true time point of the lineage choice, one has to assign probabilities to these alternatives and predict the mostly likely hidden tree given the observed data.

Therefore, we propose a simple model of lineage choice and delay based on two assumptions:

1. i.

Lineage choice is independent between cells: No internal information is passed from mother to daughter cell that has influence on the timing. Thus, the probability to differentiate must only depend on factors that are not inherited during cell division. In the following, we will assume that the probability to differentiate is a function of time (see below).

2. ii.

The delay between lineage choice and marker onsets originates from a gene expression process that starts after the differentiation decision. The marker onset is detected once the amount of marker proteins in the cell crosses a certain threshold x* (Fig. 2c). If the cell divides before the protein amount exceeds the detection limit, its daughter cells inherit the marker expression from their mother. As daughter cells inherit the state of their mother, they become correlated with respect to marker onset; if one daughter reaches the detection limit, the other daughter will likely do the same. Because gene expression is intrinsically stochastic, the dynamics of both cells will not be exactly identical50.

### Differentiation process

We define a rate λ(t) so that λ(t)dt is the probability that the lineage decision occurs in the interval [t, t + dt] in a single cell, given that it has not occurred yet in the interval [0, t). Note that in survival analysis, λ is called the hazard rate51.

Next, we define the overall distribution of decision times ϕ(t), that is, the probability density to observe a decision at time t (known as event density in survival analysis). Both quantities are related via (see Supplementary Methods):

$ϕ t =λ t exp - ∫ 0 t d τ λ τ .$

For example, if λ(t) = λ is constant, the above equation yields ϕ(t) = λ exp(−λt), which is the probability density of an exponential distribution. Without loss of generality, but motivated by experimental observation18, we assume that the differentiation rate is a linear function of time such that

$λ t = a 0 + a 1 t.$
(1)

This represents a first order approximation to a potentially complex but unknown differentiation rate. It allows more flexibility than a zeroth-order approximation (λ(t) = a0) and is sufficient to encompass mechanistic models of lineage choice (see Results). From now on, we denote the parameters of the differentiation process as θ = (a0, a1) and write ϕ(t|θ) to make the dependence on the parameters explicit.

### Delay process

We model the marker delay as a stochastic gene expression process. Combining transcription and translation for simplicity, we obtain a birth–death process with two reactions, one producing a protein with rate α and the other removing a protein with rate γ (for details, see Supplementary Methods). We are only interested in the dynamics of the system until the protein numbers exceed the detection threshold x*, where we assume that the marker can be observed. The delay process is characterized by the first passage time distribution $ψ x 0 t$, that is, the probability that the protein number crosses the threshold x* for the first time at time t starting with x0 proteins initially, and the propagat or Px−> x(t), the probability to start a state x and after time t arrive at state x'. Both $ψ x 0 t$ and Px−> x(t) depend on the parameters η = (α, γ, x*) of the underlying model, but we have dropped this dependence for readability. We obtain $ψ x 0 t$ and Px−> x'(t) by numerically solving the Master Equation of the associated stochastic process (see Supplementary Methods).

### Statistical inference

Our goal is to estimate the parameters (θ, η) of the model from observed genealogies in order to predict lineage choice in a given tree. To that end, we derive the likelihood L(T|θ, η) of an observed tree T given the parameters, which is then optimized to find the maximum likelihood estimates $θ ^ , η ^$.

The entire process of differentiation and marker delay on genealogies has the Markov property; given the internal state in terms of ϕ and ψ of some cell i at time t, the subtree induced by this cellis independent of all other cells in the entire tree. This allows us to divide the problem into smaller subproblems, where we enumerate on a per cell basis all possibilities of differentiation events in an observed tree, which we termed “hidden trees” (see Fig. 1c).

The likelihood of the observed tree T given parameters θ and η is the sum of likelihoods of the hidden trees H, because these are competing alternatives (Fig. 2d):

$L T ∣ θ , η = ∑ H ∈ H T L H ∣ θ , η$
(2)

To derive the likelihood of a single hidden tree H, we partition the hidden tree into various subtrees Di induced by the differentiating cells and a single tree U that only contains undifferentiated cells (Fig. 2d). Due to the Markov property, the likelihood factorizes:

$L H ∣ θ , η =L U ∣ θ ∏ i L D i ∣ θ , η$
(3)

Note that the parameters θ also appear in the likelihoods for Di as the root of these subtrees is still undifferentiated for some unknown time (Supplementary Figure 6). The first term is readily computed from the decision process (Eq. 1) as the process generating it has no memory and factorizes across cells in U. The terms L(Di|θ, η) are more difficult to obtain, as the delay process has memory and hence the individual cells of the subtree cannot be treated independently. Also, one has to account for the unknown time interval where the root of the subtree is still undifferentiated (see Supplementary Figure 6). We represent each tree Di as a factor graph (Fig. 2d and Supplementary Methods). The factor graph models the dynamics of the delay process on the tree structure, whose state is only known at the leaves of the tree, where an onset is observed. We use message passing to integrate out all unobserved variables in the graph and thereby obtain the likelihood L(Di|θ, η)52. The sum over H in Eq. 2 consists of a large number of terms (it is double exponential in the number of cells53), hence an explicit summation is prohibitive for larger trees. However, the sum can efficiently be evaluated using dynamic programming (see Supplementary Methods).

Using Eq. 2 and 3, we can now perform maximum likelihood estimation of the underlying model parameters θ, η given a set of observed trees T1,…Tn:

$θ ^ , η ^ = argmax θ , η ∑ i = 1 n log L T i ∣ θ , η .$
(4)

To solve the above optimization problem, we apply a standard multiple-restart (Latin Hypercube54) optimization routine. Having learned the parameters $θ ^ , η ^$ via Eq. 4, we predict differentiation times and cells in the genealogies. For an observed tree T, we select the most likely hidden tree $Ĥ$ from the set of all possible hidden trees according to

$Ĥ= argmax H ∈ H T L H ∣ θ ^ , η ^$
(5)

$Ĥ$ is calculated recursively to avoid enumerating the entire set $H T$ (Supplementary Methods) and is used to predict which cells most likely differentiated. Note that one can additionally obtain the k-most likely hidden trees and their corresponding likelihoods (see Supplementary Figure 7).

### Branching process model for colony assays

To validate the estimated differentiation rate λ (Fig. 3c), we utilize colony assay data of single sorted HSPCs done in the same experimental conditions as the genealogies. Single HSPCs are sorted into separate microwells and form colonies over ten days. These colonies are classified into three categories: GM-colonies, which contain only granulocytes and monocytes; MegE-colonies, which contain only megakaryoctes and erythrocytes; and GMMegE colonies, which contain cell from both the GM and the MegE lineage. The relative frequencies and confidence intervals of these three colony types over the course of ten days are reported in Extended Data Fig. 6c of Hoppe et al.7. We consider only the data from day ten where the colony assay frequencies have stabilized.

Intuitively, the colony assay frequencies depend on the differentiation rate: early lineage choice will increase the fraction of homogeneous (GM or MegE) colonies whereas late lineage choice will give rise to mostly GMMegE colonies. This intuition can be formalized into a mathematical model (see Marr et al.18 for details). The model has two parameters: the differentiation rate λ(t) (related to the differentiation probability Φ(t), see Supplementary Note 1) and the lineage probabilities PGM and PMegE (PGM + PMegE = 1) for a single cell to pick either one or the other lineage upon differentiation. With these two parameters, one can derive recursive equations for the probability of observing a GM, MegE, and GMMegE colony as a function of cell generations. The frequency of a GM-colony after N generations is:

$F GM N = f GM N , 1$

where

$f GM N , i = λ i p GM + 1 - λ i f GM N - 1 , i + 1 2 N > 0 0 N = 0$

The recursion is to be understood as follows: to yield a homogeneous GM colony, either the founding cell of the colony must differentiate and choose the GM lineage (first term of the sum), or the founding cell does not differentiate, but both its daughters in turn form homogeneous GM colonies (second term in the sum). The variable i in fGM(N, i) is a mere bookkeeping device that keeps track of the current generation (due to a generation dependent λ). Note that in the case of constant λ(i) = λ, the result from Marr et al.18 is obtained. An analogous formula applies for FMegE(N) and by definition, FGemM(N) = 1 − FGM(N) − FMegE(N).

As the colony assay model operates in discrete time (cell generations), we discretize the continuous differentiation rate λ(t) obtained from the genealogies as follows:

$λ i = E s i , c i 1 - e - ∫ s i s i + c i λ τ d τ$

where we take the expectation with respect to the birth time si of a cell in generation i and the cell cycle time ci of a cell in generation i. The expression inside the expectation is the probability to differentiate in the time interval [si, si + ci]. Here, we account for the fact that cells in generation 1 and 2 tend to have longer cell cycles then cells in subsequent generations. Hence the hazard in generation 1 and 2 is increased due to prolonged cell cycle. The distribution of si, ci is readily estimated from the tracked genealogies (see Supplementary Figure 2).

Using the extracted generation-wise differentiation rate in the colony assay model, the only remaining free parameter is pGM (since pMegE = 1 − pGM). Note that we cannot use the fraction of GM and MegE genealogies in the time lapse dataset as a surrogate for pGM, since the genealogies are not guaranteed to be tracked unbiased; their proportions do not reflect the true underlying lineage probabilities.

A parameter sweep of pGM (but fixed λ(t)) creates a curve in the (FGM, FMegE, FGemM) space (see Fig. 3f), each point on the curve corresponding to a particular choice of pGM. As the curve intersects with the confidence interval of the observed colony assay frequencies, the model is capable of explaining the observed colony assay frequencies. The predicted frequencies of the best fit match the observed frequencies (Fig. 3f, g).

### Data availability

The datasets analysed during the current study are available from the corresponding author on request.

### Code availability

An implementation of the computational method is available at https://github.com/QSCD/tree-inference.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

1. 1.

Skylaki, S., Hilsenbeck, O. & Schroeder, T. Challenges in long-term imaging and quantification of single-cell dynamics. Nat. Biotech. 34, 1137–1144 (2016).

2. 2.

Hilsenbeck, O. et al. Software tools for single-cell tracking and quantification of cellular and molecular properties. Nat. Biotechnol. 34, 703–706 (2016).

3. 3.

Etzrodt, M., Endele, M. & Schroeder, T. Quantitative single-cell approaches to stem cell research. Cell Stem Cell 15, 546–558 (2014).

4. 4.

Etzrodt, M. & Schroeder, T. Illuminating stem cell transcription factor dynamics: long-term single-cell imaging of fluorescent protein fusions. Curr. Opin. Cell Biol. 49, 77–83 (2017).

5. 5.

Kaufmann, B. B., Yang, Q., Mettetal, J. T. & van Oudenaarden, A. Heritable stochastic switching revealed by single-cell genealogy. PLoS Biol. 5, e239 (2007).

6. 6.

Rieger, M. A., Hoppe, P. S., Smejkal, B., Eitelhuber, A. C. & Schroeder, T. Hematopoietic cytokines can instruct lineage choice. Science 325, 217–218 (2009).

7. 7.

Hoppe, P. S. et al. Early myeloid lineage choice is not initiated by random PU.1 to GATA1 protein ratios. Nature 535, 299–302 (2016).

8. 8.

Krumsiek, J., Marr, C., Schroeder, T. & Theis, F. J. Hierarchical differentiation of myeloid progenitors is encoded in the transcription factor network. PLoS ONE 6, e22649 (2011).

9. 9.

Huang, S., Guo, Y.-P., May, G. & Enver, T. Bifurcation dynamics in lineage-commitment in bipotent progenitor cells. Dev. Biol. 305, 695–713 (2007).

10. 10.

Roeder, I. & Glauche, I. Towards an understanding of lineage specification in hematopoietic stem cells: a mathematical model for the interaction of transcription factors GATA-1 and PU.1. J. Theor. Biol. 241, 852–865 (2006).

11. 11.

Strasser, M. K., Theis, F. J. & Marr, C. Stability and multiattractor dynamics of a toggle switch based on a two-stage model of stochastic gene expression. Biophys. J. 102, 19–29 (2012).

12. 12.

Potoyan, D. A. & Wolynes, P. G. Dichotomous noise models of gene switches. J. Chem. Phys. 143, 195101 (2015).

13. 13.

Bokes, P., King, J. R. & Loose, M. A bistable genetic switch which does not require high co-operativity at the promoter: a two-timescale model for the PU.1-GATA-1 interaction. Math. Med. Biol. 26, 117–132 (2009).

14. 14.

Duff, C., Smith-Miles, K., Lopes, L. & Tian, T. Mathematical modelling of stem cell differentiation: the PU.1-GATA-1 interaction. J. Math. Biol. 64, 449–468 (2012).

15. 15.

Graf, T. & Enver, T. Forcing cells to change lineages. Nature 462, 587–594 (2009).

16. 16.

H. Eilken et al. Continuous long-term detection of live cell surface markers by ‘in culture’ antibody staining, Nat. Protoc. Exch., 2011. https://doi.org/10.1038/protex.2011.205

17. 17.

Eilken, H. M., Nishikawa, S.-I. & Schroeder, T. Continuous single-cell imaging of blood generation from haemogenic endothelium. Nature 457, 896–900 (2009).

18. 18.

Marr, C., Strasser, M. K., Schwarzfischer, M., Schroeder, T. & Theis, F. J. Multi-scale modeling of GMP differentiation based on single-cell genealogies. FEBS. J. 279, 3488–3500 (2012).

19. 19.

Orkin, S. H. & Zon, L. I. Hematopoiesis: an evolving paradigm for stem cell biology. Cell 132, 631–644 (2008).

20. 20.

Gillespie, D. T. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comput. Phys. 22, 403–434 (1976).

21. 21.

McIvor, Z. et al. Transient expression of PU.1 commits multipotent progenitors to a myeloid fate whereas continued expression favors macrophage over granulocyte differentiation. Exp. Hematol. 31, 39–47 (2003).

22. 22.

Pevny, L. et al. Erythroid differentiation in chimaeric mice blocked by a targeted mutation in the gene for transcription factor GATA-1. Nature 349, 257–260 (1991).

23. 23.

Scott, E. W., Simon, M. C., Anastasi, J. & Singh, H. Requirement of transcription factor PU.1 in the development of multiple hematopoietic lineages. Science 265, 1573–LP-1577 (1994).

24. 24.

Gupta, P. B. et al. Stochastic state transitions give rise to phenotypic equilibrium in populations of cancer cells. Cell 146, 633–644 (2011).

25. 25.

Yang, Z. & Rannala, B. Molecular phylogenetics: principles and practice. Nat. Rev. Genet. 13, 303–314 (2012).

26. 26.

Hasegawa, M., Kishino, H. & Yano, Taki Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985).

27. 27.

Glauche, I., Lorenz, R., Hasenclever, D. & Roeder, I. A novel view on stem cell development: analysing the shape of cellular genealogies. Cell. Prolif. 42, 248–263 (2009).

28. 28.

Hawkins, E. D., Markham, J. F., McGuinness, L. P. & Hodgkin, P. D. A single-cell pedigree analysis of alternative stochastic lymphocyte fates. Proc. Natl Acad. Sci. USA 106, 13457–13462 (2009).

29. 29.

Duffy, K. & Wellard, C. & Markham, J.F. Activation-induced B cell fates are selected by intracellular stochastic competition. Science 279, 338–341 (2012).

30. 30.

Buggenthin, F. et al. Prospective identification of hematopoietic lineage choice by deep learning. Nat. Methods 14, 403–406 (2017).

31. 31.

Niederberger, T. et al. Factor graph analysis of live cell imaging data reveals mechanisms of cell fate decisions. Bioinformatics 31, 1816–1823 (2015).

32. 32.

Olariu, V. et al. Modified variational Bayes EM estimation of hidden Markov tree model of cell lineages. Bioinformatics 25, 2824–2830 (2009).

33. 33.

Hormoz, S. et al. Inferring cell-state transition dynamics from lineage trees and endpoint single-cell measurements. Cell Syst. 3, 419–433 (2016). e8.

34. 34.

Wang, L.D. & Wagers, A.J. Dynamic niches in the origination and differentiation of haematopoietic stem cells. Nat. Rev. Mol. Cell Biol 12, 643–655 (2011).

35. 35.

Strasser, M. K., Feigelman, J., Theis, F. J. & Marr, C. Inference of spatiotemporal effects on cellular state transitions from time-lapse microscopy. BMC Syst. Biol. 9, 1–17 (2015).

36. 36.

Schwanhausser, B. et al. Corrigendum: global quantification of mammalian gene expression control. Nature 495, 126–127 (2013).

37. 37.

Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

38. 38.

Velten, L. et al. Human haematopoietic stem cell lineage commitment is a continuous process. Nat. Cell Biol. 19, 271–281 (2017).

39. 39.

Kueh, H. Y., Champhekhar, A., Nutt, S. L., Elowitz, M. B. & Rothenberg, E. V. Positive feedback between PU.1 and the cell cycle controls myeloid differentiation. Science 341, 670–673 (2013).

40. 40.

Levine, J.H., Fontes, M.E., Dworkin, J. & Elowitz, M.B. Pulsed feedback defers cellular differentiation. PLoS Biol. 10, e1001252 (2012).

41. 41.

Ditadi, A. et al. Human definitive haemogenic endothelium and arterial vascular endothelium represent distinct lineages. Nat. Cell Biol. 17, 580–591 (2015).

42. 42.

Hanna, J. H. et al. Direct cell reprogramming is a stochastic process amenable to acceleration. Nature 462, 595–601 (2009).

43. 43.

Buganim, Y. et al. Single-cell expression analyses during cellular reprogramming reveal an early stochastic and a late hierarchic phase. Cell 150, 1209–1222 (2012).

44. 44.

Morris, R., Sancho-Martinez, I., Sharpee, T.O. & Izpisua Belmonte, J.C. Mathematical approaches to modeling development and reprogramming. Proc. Natl Acad. Sci. USA 111, 5076–5082 (2014).

45. 45.

Magee, Ja, Piskounova, E. & Morrison, S. J. Cancer stem cells: impact, heterogeneity, and uncertainty. Cancer Cell. 21, 283–296 (2012).

46. 46.

Filipczyk, A. et al. Network plasticity of pluripotency transcription factors in embryonic stem cells. Nat. Cell Biol. 17, 1235–1246 (2015).

47. 47.

Hocine, S., Raymond, P. & Zenklusen, D. Single-molecule analysis of gene expression using two-color RNA labeling in live yeast. Nat. Methods 10, 119–121 (2012).

48. 48.

Dietrich, J. et al. Venus trap in the mouse embryo reveals distinct molecular dynamics underlying specification of first embryonic lineages. EMBO Rep. 16, 1005–1021 (2015).

49. 49.

Coutu, D. L. & Schroeder, T. Probing cellular processes by long-term live imaging--historic problems and current solutions. J. Cell Sci. 126, 3805–3815 (2013).

50. 50.

Kaern, M., Elston, T. C., Blake, W. J. & Collins, J. J. Stochasticity in gene expression: from theories to phenotypes. Nat. Rev. Genet. 6, 451–464 (2005).

51. 51.

Lee, E. & Go, O. Survival analysis in public health research. Annu. Rev. Public Health 18, 105–134 (1997).

52. 52.

Bishop, C. M. Pattern recognition and machine learning. (New York: Springer-Verlag Inc, New York, 2006).

53. 53.

A. Aho and N. Sloane, Some doubly exponential sequences In The Fibonacci Quarterly. (Bell Laboratories, IVlurrey Hill, NJ, 1973)

54. 54.

McKay, M., Beckman, R. & Conover, W. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239–245 (1979).

## Acknowledgements

We thank Florian Büttner, Felix Buggenthin, and Jan Hasenauer for helpful discussion on the manuscript and Rene Schoeffel for computational support. This work was supported by the German Science Foundation DFG (project “Inference of Differentiation Decision Times from Blood Stem Cell Genealogies” to CM and SPP 1356 to FJT) and by the SNF to T.S.

## Author information

### Author notes

• Michael K. Strasser

Present address: Institute for Systems Biology, 401 Terry Ave N, Seattle, WA, 98109, USA

### Affiliations

1. #### Institute of Computational Biology, Helmholtz Zentrum München, 85764, Neuherberg, Germany

• Michael K. Strasser
• , Fabian J. Theis
•  & Carsten Marr
2. #### Department of Biosystems Science and Engineering (D-BSSE), ETH Zurich, 4058, Basel, Switzerland

• Philipp S. Hoppe
• , Dirk Loeffler
• , Konstantinos D. Kokkaliaris
•  & Timm Schroeder
3. #### Department of Mathematics, Technische Universität München, 85748, Garching, Germany

• Fabian J. Theis

### Contributions

T.S. and F.J.T. conceived the project. M.K.S. implemented the models and algorithms and analyzed the data with C.M. P.S.H. conducted the experiments with D.L. and K.D.K., provided experimental data and analyzed the time-lapse data. T.S. planned and supervised generation of experimental data. F.J.T. contributed to model design. T.S. and F.J.T. contributed to discussion of the method and the results and contributed to the manuscript. M.K.S. and C.M. designed the study and wrote the manuscript with T.S.

### Competing interests

The authors declare no competing interests.

### Corresponding authors

Correspondence to Timm Schroeder or Carsten Marr.

## Electronic supplementary material

### DOI

https://doi.org/10.1038/s41467-018-05037-3