A lineage tree-based hidden Markov model quantifies cellular heterogeneity and plasticity

Mohammadi, Farnaz; Visagan, Shakthi; Gross, Sean M.; Karginov, Luka; Lagarde, J. C.; Heiser, Laura M.; Meyer, Aaron S.

doi:10.1038/s42003-022-04208-9

Download PDF

Article
Open access
Published: 17 November 2022

A lineage tree-based hidden Markov model quantifies cellular heterogeneity and plasticity

Communications Biology volume 5, Article number: 1258 (2022) Cite this article

2548 Accesses
1 Citations
11 Altmetric
Metrics details

Subjects

Abstract

Individual cells can assume a variety of molecular and phenotypic states and recent studies indicate that cells can rapidly adapt in response to therapeutic stress. Such phenotypic plasticity may confer resistance, but also presents opportunities to identify molecular programs that could be targeted for therapeutic benefit. Approaches to quantify tumor-drug responses typically focus on snapshot, population-level measurements. While informative, these methods lack lineage and temporal information, which are particularly critical for understanding dynamic processes such as cell state switching. As new technologies have become available to measure lineage relationships, modeling approaches will be needed to identify the forms of cell-to-cell heterogeneity present in these data. Here we apply a lineage tree-based adaptation of a hidden Markov model that employs single cell lineages as input to learn the characteristic patterns of phenotypic heterogeneity and state transitions. In benchmarking studies, we demonstrated that the model successfully classifies cells within experimentally-tractable dataset sizes. As an application, we analyzed experimental measurements in cancer and non-cancer cell populations under various treatments. We find evidence of multiple phenotypically distinct states, with considerable heterogeneity and unique drug responses. In total, this framework allows for the flexible modeling of single cell heterogeneity across lineages to quantify, understand, and control cell state switching.

Dissecting transition cells from single-cell transcriptome data through multiscale stochastic dynamics

Article Open access 23 September 2021

Stochastic modeling of phenotypic switching and chemoresistance in cancer cell populations

Article Open access 26 July 2019

CoSpar identifies early cell fate biases from single-cell transcriptomic and lineage information

Article 21 February 2022

Introduction

Chemotherapy and targeted therapies selectively eliminate fast-proliferating or oncogene-addicted cells and are among the primary treatments for cancer. However, long-term therapeutic efficacy is inevitably limited by widespread intratumoral heterogeneity^1,2. Cell-to-cell variability in drug response can originate from cell-intrinsic factors—such as genomic alterations, epigenetic mechanisms like changes in chromatin state³, and variable protein levels^4,5—or cell-extrinsic factors such as spatial variability in the surrounding vasculature and environmental stressors^6,7,8. Moreover, cell plasticity, where cells adopt new characteristics such as those of other cell types, is observed in cancer cells, and can affect their sensitivity to therapy⁹.

Large-scale profiling studies can find molecular features that associate with drug response using population-level samples^10,11. These associations, while valuable, can miss the contribution of cell-to-cell heterogeneity, and especially stochastic changes in individual cell states that compound to effects on overall tumor drug response^3,12,13. The most common methods for quantifying drug response are metrics of tumor cell population expansion or contraction^14,15,16,17. Recent research has made efforts to track phenotypic measurements of fitness at the single-cell level^18,19, however, even single-cell measurements are typically performed with snapshots that subsequently miss the role of individual cells in the overall population response²⁰. Though population heterogeneity is usually defined through molecular measurements, studies that have explicitly linked molecular and phenotypic variation have been able to identify mechanisms that underly cell-to-cell variation that would otherwise remain hidden²¹, and studies starting with phenotypic analysis have generally found that phenotypic variability arises from a small number of molecular factors leading to the phenotypic variation^4,22,23.

Measurements accompanied by lineage relationships are uniquely valuable for studying inherited phenotypes within families of individuals. This value is evident in linkage studies wherein relatives are used to identify or refine the genetic determinants of disease^24,25,26. Notably, linkage studies can identify genetic determinants with greater power than even large association studies because relatives essentially serve as internal controls²⁷. Linkage studies also start with the phenotype of individuals, rather than grouping based on molecular differences, ensuring discoveries are phenotypically consequential. While the inherited factors are different between cells (e.g., proteins, RNA) and people (DNA), such approaches are likely to be similarly useful with populations of cells. Recently, constructing phylogenic trees of cancer cells using lineage tracing and single-cell sequencing has helped to characterize the directionality of metastatic seeding, though these methods are limited to tracking slow processes such as mutational differences²⁸. Lineage-resolved data has also demonstrated value in uncovering cell-to-cell heterogeneity due to transient differences outside of cancer^22,23. Therefore, tools to analyze and explore these data will be critical to uncovering new forms and sources of cell-to-cell variation.

Hidden Markov models (HMMs) provide an efficient strategy to infer discrete states from measurements when a series of co-dependent observations are made. An example of this is their widespread use in time series analysis, where each measurement is dependent on those that came before^29,30. Recognizing this co-dependence allows HMMs to make accurate inferences even in the presence of extremely noisy measurements since each neighboring measurement provides accumulating evidence³¹. These models derive their relative simplicity by assuming a Markov process, meaning that the current behavior of a system can be assumed to be independent of its earlier history should its current state be known. This assumption naturally applies in many contexts. In the case of cells, this assumption aptly captures cell inheritance because daughter cells inherit both molecular signals and their environment from their predecessor. Indeed, several recent examples of cell-to-cell inheritance mechanisms can be represented as a Markov process through linear chains or cycles of states^12,22,23. HMMs have been adapted to lineage trees (tHMMs) so that each measurement across the tree can similarly provide accumulating evidence for a prediction. Just like with time-series data, these models can provide very accurate predictions despite noisy measurements and limited information by recognizing the co-dependence between related measurements^32,33. tHMMs have been used in a multitude of applications, from image classification to comparative genomics^34,35. These models have been fit to lineages collected from stem cells and bacteria colonies, but have always required custom implementations^36,37. Improvements in cell tracking and high-throughput imaging promise to make tHMM models valuable techniques for studying the plasticity of heterogeneous cell populations. However, widespread use of these models still depends on more easily usable implementations, examples of successful tHMM-based discoveries, and standards for experimental application.

Here, we develop an extensible implementation of tHMMs with a defined interface for integrating diverse types of measurements on cell lineage trees. This model allows us to quantify the dynamics and phenotypic features of drug response heterogeneity. We leveraged information about the relationships between cells to analyze the cell cycle responses of populations of breast cancer cells to a panel of therapies, and how normal breast cells respond to growth factor treatment. Single-cell measurements of the cell cycle revealed extensive variation not captured by population-level measurement. Using the tHMM model, we inferred the number of phenotypically distinct subpopulations, the characteristics of those subpopulations, the transition probabilities from one state to another, and each cell’s expected state. We also confirmed that the tHMM model could use patterns of inheritance to predict cell behavior. This work, therefore, provides a flexible phenotype-driven route to discovering cell-to-cell variation in drug response, demonstrates an overall strategy for quantifying the dynamics of cell heterogeneity, and implements a very general software tool for the widespread use of tHMM models.

Results

Lineage information provides unique information about the source and structure of cell-to-cell heterogeneity

Single cells grow and then divide into two daughter cells, eventually forming a binary genealogical tree, also known as a lineage tree. We collected single-cell measurements in the form of lineage trees to track these relationships. The life cycle of each cell before division includes G1, S, G2, and M phases that must pass one after another. To illustrate the unique value of lineage measurements in analyzing intra-tumoral and drug response heterogeneity, we collected cell fates (whether cells ultimately divide or die) alongside either cell lifetimes (MCF10A) or individual cycle phase durations (AU565). Two random subsets of the tracked lineages of the breast cancer cell line AU565 are plotted in Fig. 1a. The single cell lineages reveal striking variation in cell cycle phase durations and cell division dynamics despite coming from the same sample. Population-level measurements would be unable to identify this difference as the starting and ending cell numbers are the same. Measurements that record or reflect the history of cells (e.g., CFSE staining, Luria-Delbruck experiment) can help to identify these variations within cell populations but must make assumptions about the dynamics of heterogeneity^13,23. Lineage measurements, by contrast, provide sufficiently rich temporal information to quantify the specific structure of the phenotypic heterogeneity.

**Fig. 1: Total cell number is insufficient to distinguish the structure of heterogeneous populations.**

As further exploration of the cell tracking data, we randomly sampled lineages from gemcitabine-treated AU565 cells (Fig. 1b). Gemcitabine is a chemotherapy agent that disrupts DNA replication and results in the extension of and apoptosis in S phase³⁸. We found that the S/G2 phase lengths in treated cells were noticeably extended compared to untreated cells, slowing population growth. There was generally striking variation between lineages of a single condition, including anywhere from zero to three cell divisions, but tightly shared behavior among cells and their relatives in each lineage. These observations demonstrate some of the unique advantages of collecting lineage-based measurements.

A lineage tree-based hidden Markov model infers the state of cells given measurements on lineage trees

Given the unique insights that single-cell measurements on lineage trees can provide, we implemented a strategy for classifying cells based on their phenotype and lineage relationships. We used a tree-based hidden Markov model (tHMM) to fit a set of measurements made across a lineage tree (Fig. 2a). Like a typical hidden Markov model, a tHMM can infer the hidden discrete “states” of cells given a series of measurements where a state is defined by specific phenotype distributions. The inference of these states takes place using an iterative strategy wherein the states of each cell are predicted by the phenotype of both the cell and its relatives in a lineage (expectation step), and then each distribution of phenotypes is fit to match the cells within that state (maximization step) (Fig. 2b). This expectation-maximization (EM) process repeats until convergence.

After fitting, the model can provide a variety of information (Fig. 2c). First, it infers the starting and transition probabilities of each state. Second, the distribution of cells’ phenotypes in each state are estimated and can be compared to distinguish how cells of each state behave. For instance, if we use the growth rates of cells as their phenotype, we may observe a subpopulation of cells with shorter times to division, and another with longer times. Moreover, the state of each individual cell can be predicted from the fit data or new measurements. Finally, the model provides a likelihood of each cell’s observations and therefore the data overall. This last quantity can be used, for example, to estimate the number of distinguishable cell states. When implementing these processes, we ensured that a cell’s measurements were defined through a modular interface, allowing many other forms of data to be easily integrated, such as cell morphology or molecular measurements.

Experiments of finite time necessitate corrections for experimental censorship

Modeling the duration of each cell’s lifetime is complicated by the influence of experimental parameters. Specifically, cells measured at the beginning or end of an experiment persist beyond the experiment’s duration and so, while we observe these cells, we do not know their exact lifetimes. Data censorship occurs when a measurement is systematically affected by an undesired influence. For instance, in our case, phase durations are censored because the experiment started after cells had already begun their initial cell cycle phase or the experiment ended before they had completed their last phase. Previously, this has been addressed by removing incompletely observed cells²². However, doing so results in a systematic bias, where longer-lived cells are preferentially eliminated. On the other hand, ignoring the truncation of these values also creates bias by creating an upper bound on the cells’ lifetimes (Fig. 3b, c).

**Fig. 3: Experiments of finite time necessitate data censorship corrections.**

To correct for this effect in our model, we marked cells that encountered the start or end bounds of the experiment. When estimating the properties of these cells’ lifetime we instead used a censored estimator or the survival function of the distribution³⁹. Because the labels are interchangeable in our classification, we used the adjusted rand index⁴⁰, a similarity measure that can serve as an accuracy measure for clustering results. Using synthetic data, we verified that this correction resulted in accurate phenotype estimations (Fig. 3d, Supplementary Figs. 3, 10). Thus, accounting for cells that outlive the bounds of the experiment through a censored estimator removes the contribution of this experimental confounder.

Synthetic lineage benchmarks show a tHMM can accurately infer population behavior

To evaluate how accurately a tHMM model could infer the behavior of multi-state cell populations, we used synthetic populations of cells in a wide variety of configurations, such as various populations sizes, numbers of states, and abundance of the states. In each case, we determined that the tHMM model could accurately infer the hidden states and parameters of a population given at least 100 cells. This synthetic data included uncensored (Supplementary Figs. 1, 2, 8, 9; Supplementary Tables 1, 2) or censored (Supplementary Figs. 4, 10, 3, 15; Supplementary Tables 1–3) situations. Synthetic data were created by lengthening the simulated experiment time, in effect creating deeper lineages, or by increasing the number of initial cells to have a greater number of lineages, increasing the experiment’s breadth. In addition to varying the number of cells in a population, we benchmarked populations with varied cell state percentages (Supplementary Figs. 4, 5) and varied the degree of phenotypic differences between states (Supplementary Figs. 6, 7; Fig. 5). This benchmarking consistently showed that the tHMM model would provide accurate results across a range of circumstances, and generally provided accurate results with datasets consisting of at least 10 lineages, 100 cells overall, and 10 cells from each state.

More specifically, one of the benchmarking studies we performed was with data matching our measurements of AU565, where G1 and S/G2 phase durations were represented by a gamma distribution, and their corresponding cell fate represented by a Bernoulli distribution (Fig. 4). The choice of the gamma distribution for cell cycle phase was inspired by a previous study⁴¹ and verified by evaluating a variety of distributions; the gamma distribution fit the cell lifetime data best. Although the tHMM model was fit with no information about the true underlying parameters of the simulated cells, it distinguished the pre-assigned two underlying cell states’ phenotypes (Fig. 4b–d) and member cells with >95% accuracy (Fig. 4e). The Wasserstein distance metric was used to quantify the difference between the true and estimated cell cycle phase duration distributions to show the accuracy of parameter estimation (Fig. 4d). On the population level, the difference between the true and estimated transition probabilities, as calculated by the sum of squared difference, was less than 0.1 for 100 cells or more. Starting probabilities were compared to their corresponding true values using the Euclidean distance and showed less than a 0.2 error for populations with 10 lineages or more (Fig. 4f, g). Thus, we are confident that with similar experimental data, we should derive accurate results.

**Fig. 4: Model performance on censored lineages of two states with increasing breadth and depth.**

Lineage information improves cell state identification with heritable phenotypes

Cells of even very distinct molecular states can have partly overlapping phenotypes due to non-heritable variation. Therefore, we sought to evaluate how different two states need to be for us to accurately identify them as distinct (Fig. 5a). We varied the G1 phase duration of two states from identical to very distinct (Fig. 5b) and quantified the state assignment accuracy of our model (Fig. 5c). While the phenotypic observation of a given state had to be different for our model to accurately assign cells, even moderately overlapping phenotypes (Wasserstein distance of ~20) could be distinguished by using the lineage relationships of cells. As a baseline comparison, we analytically identified the optimal classifier in the absence of lineage information (see Methods). The tHMM consistently outperformed this approach (Fig. 5c). The model performance in censored and uncensored populations was similar (Supplementary Figs. 6, 7). This shows that lineage relationships can be used to identify cell states with partially overlapping phenotypes more accurately.

**Fig. 5: Model performance versus the difference between states.**

Likelihood-based model selection can effectively identify the number of distinct states

One does not usually know the number of distinct cell states within a population. Further, the number of distinct states may vary depending upon the environmental context of the cells, particularly for phenotypic measurements^42,43. To test whether we could infer the number of phenotypically distinct states, we performed model selection using the Bayesian information criterion (BIC) while varying the number of states in synthetic data (Fig. 6). We normalized the BIC values such that zero corresponds to the state with the highest likelihood. The synthetic populations included approximately 250 to 650 cells with known cell phase fate and phase lengths (Supplementary Table 3). The inferred number of cell states was consistently correct for both uncensored (Fig. 6a–d) and censored lineages (Fig. 6e–h). This indicated that model selection can help to identify the appropriate number of cell states for a set of measurements.

**Fig. 6: Model selection effectively identifies the number of distinct states in synthetic data.**

tHMM infers several distinct subpopulations in experimental drug response data

As an application of our model, we used phenotypic measurements from two cell lines. With the first, AU565, we measured of the G1 and S/G2 phase durations and terminal cell fates of cells in a control condition and when treated with 3 concentrations of gemcitabine or lapatinib. For the second, MCF10A, we measured the overall cell lifetimes and terminal fates of cells treated with PBS or single concentrations of the growth factors EGF, HGF, or OSM. Cells were imaged every 30 minutes and then tracked over time to assemble lineage relationships. The lapatinib and gemcitabine-treated AU565 populations (including control) contained a total of 5290 and 4537 cells, respectively. The MCF10A population contained 1306 cells. Lineages included 1–5 generations of cells. The model was fit to each experiment’s data across all conditions, enforcing that the initial and transition probabilities are shared across concentrations but allowing the phenotype distributions to vary. We enforced a unidirectional phenotypic shift with drug concentration in AU565 cells, reflecting the expectation of a dose-response effect on cell phenotype within each state. The cell fate parameters were estimated without constraints. We assumed the number of states is shared across drug concentrations in AU565 cells and across growth factor treatments in MCF10A cells. To determine the number of cell states, we compared models of 1–7 states using the BIC, where the lowest BIC value across numbers of states indicates the most optimal model correcting for complexity (Fig. 7a–c). The data for each compound indicated the presence of multiple inherited states.

**Fig. 7: BIC-based model selection infers the number of phenotypically distinct states.**

To verify the model’s predictive ability, we additionally implemented a cross-validation scheme for the lineage data. Briefly, roughly 20% of the cells were chosen at random and then masked from the fitting process. The model parameters were estimated using only the unmasked cells, though all cells received state assignments through use of their relatives. At the end, the log-likelihood of the masked cells’ observations were evaluated using the fit model. We tested this cross-validation approach by creating synthetic cell populations of 2–5 true states with conditions matching the experimental data. For each scenario, we were able to identify the correct number of states based on which gave the highest log-likelihood (Supplementary Fig. 16a–c, f, g, Supplementary Table 3). Cross-validating the experimental data again confirmed the 4 and 5 phenotypic states within the lapatinib and gemcitabine data, respectively (Supplementary Fig. 16d, e). It also directly demonstrated that the inclusion of multiple states enables the tHMM model to predict unseen data, and that this prediction is dependent on inheritance; a no-inheritance model, in which all transitions were equally likely, performed relatively poorly (Supplementary Fig. 16d, e).

Lapatinib response is defined by both stable and interconverting states

We fit the lapatinib-treated data to the model with 4 states based on our BIC-based model selection, confirmed by cross-validation (Fig. 7a, Supplementary Fig. 16f). Fitting revealed states of widely varying persistence over generations, from less than a 0.01 probability of remaining in state 2 to a 0.94 probability of remaining in states 1 and 3 (Fig. 8a). Interestingly, states 2 and 4 formed a cycle wherein the most probable transition was between the two (Fig. 8a, Supplementary Fig. 11).

**Fig. 8: Lapatinib response is defined by phenotypically-distinct stable and interconverting states.**

Examining the phenotypes of each state revealed distinct drug responses. Lapatinib is an EGFR/HER2 inhibitor that induces cell cycle arrest in G1 phase⁴⁴. Every state displayed a dose-dependent increase in G1 phase lifetime with lapatinib treatment, and G1 effects were more pronounced as compared to those involving S/G2 (Fig. 8b–i, Supplementary Fig. 11). While the probability of survival at the end of the cell cycle phase decreased at higher concentrations, very few cell death events were observed (Fig. 8h, i, Supplementary Fig. 11). Consequently, the chances of cell death likely have high uncertainty at higher concentrations of lapatinib. States 2 and 4 were highly arrested in both G1 and S/G2 phase; in contrast, states 1 and 3 experienced little arrest in G1 and no arrest in S/G2 phase (Fig. 8f, g). Thus, cell states seemed to be primarily distinguished based on the degree of lapatinib response. The cycle between states 2 and 4 seems to reflect the observation that cells more highly arrested in G1 than G2/S give rise to cells that spend longer in G2/S than G1, and vice versa (Fig. 8, Supplementary Fig. 11).

Gemcitabine-treated populations are clustered into phase-specific responses

Gemcitabine is a chemotherapy agent that induces cell cycle arrest and apoptosis in S phase by disrupting DNA repair. The AU565 cells were treated with 5, 10, and 30 nM of gemcitabine; model selection, confirmed by cross-validation, inferred 5 states in the population (Fig. 7b, Supplementary Fig. 16a/g). Examining the 5-state fit revealed relatively stable states 1, 3, and 4 (Fig. 9a–e, Supplementary Fig. 12). States 2 and 5 formed a cycle with high rates of interconversion.

**Fig. 9: State-specific inferences of the gemcitabine-treated data.**

Gemcitabine modulated both G1 and S/G2 cell cycle phases and these effects were variable across the five identified states (Fig. 9f–i). State 5 showed S/G2-specific arrest and always resulted in cell death at the highest concentration (Fig. 9g, i, Supplementary Fig. 12). Meanwhile, cells in state 4 grew almost normally, with some cell death in G1 at the highest concentration (Fig. 9f–i). At the highest concentration, state 3 represents the cells arrested at S/G2 that have not divided even once, and state 5 is the representative of almost all cells undergoing cell death at S/G2 (Fig. 8i, Supplementary Fig. 12).

Lastly, we wished to explore whether the phenotypic heterogeneity we observed was limited to cancer cells or cytotoxic drug treatment. To determine this, we tracked non-tumorigenic MCF10A breast cells. These cells are normally grown in the presence of epidermal growth factor (EGF); we compared this condition to growth factor withdrawal (PBS) or rescue with hepatocyte growth factor (HGF) or oncostatin M (OSM)⁴⁵. Each growth factor consistently promoted proliferation on a population level compared to the PBS control, though with considerable inter- and intra-lineage variation (Supplementary Fig. 13). BIC-based model selection inferred the presence of 3 distinct states (Fig. 7c). Inspecting the model revealed generally more dynamic transitions between states as compared to the AU565 experiments (Supplementary Fig. 14). Due to the lack of growth factors, most cells arrested in the PBS condition; few observations of either division or death events is the reason for the division probability being 0.5 (Supplementary Fig. 14g). State 1 was distinct in being relatively less responsive to HGF and OSM treatments (Supplementary Fig. 14a/f), while state 1 displayed higher rates of cell death overall (Supplementary Fig. 14a/g). In total, the tHMM model was effective in identifying subsets of cells with divergent phenotypic responses to drug treatment alongside the relationships between cells in the population.

Discussion

Heterogeneity and plasticity in cancer cells enables them to adapt in response to therapy. Even in the absence of genetic mutations, other heritable variation serves as a substrate for selection^46,47. In this paper, we introduced a tree-based hidden Markov model that clusters single cells from heterogeneous populations based solely on their phenotypic traits and relationships. Model benchmarking showed that it can provide accurate results using feasible experimental designs. Of particular importance, the tHMM could recognize subpopulations even at lower frequencies (Supplementary Figs. 6, 7). Comparing the model to more standard clustering, the tHMM showed that lineage information helps to identify cell states more accurately (Fig. 5). Using cross-validation, we were able to show that accounting for cell inheritance allowed the model to accurately predict unseen observations (Supplementary Fig. 16). Several critical advancements in the current work are a modular interface for using tHMM models with various phenotypes (Fig. 2), proper censorship handling (Fig. 3), strategies for model evaluation (Fig. 7, Supplementary Fig. 16), and demonstrating that such a model can be applied to study cancer heterogeneity at baseline and in response to perturbation.

We used single-cell lineage tracking data of AU565 cancer cells treated with lapatinib and gemcitabine as a demonstration of the model. G1 and S/G2 cell cycle phase durations and cell fate measurements were used as relevant cell phenotypes to quantify the anti-cancer effects of these drugs. We were able to identify 4 and 5 distinct subpopulations within the lapatinib and gemcitabine-treated data, respectively (Fig. 7). The phenotypic features of each state were quantified in parallel (Figs. 8, 9). Lapatinib is known to inhibit cell proliferation by inhibiting Akt/mTOR pathway activity, which is a key regulator of G1 phase progression⁴⁸. Similarly, our analysis in the lapatinib-treated population indicated that cells, regardless of their state, experienced a prolonged G1 phase, but individual states varied in their susceptibility. In gemcitabine-treated cells, we observed that most states were highly heritable, with more varied phenotypic effects. This included cells that became arrested in S/G2 and underwent apoptosis (state 5), cells that were selectively arrested in G1 (state 1), and cells that hardly responded to drug treatment at all (state 4; Fig. 9). While gemcitabine canonically works by inducing cell arrest in G2/S, previous work has characterized its effects on G1 phase by separating the effects on both cell cycle phases⁴⁹. They similarly identified that G1 arrest was associated with cell death, which is also evident in cells of state 3 where G1 arrest is seen alongside cell death in both phases (Fig. 9f–i). Our results would further suggest that those cells with G1 effects are molecularly and heritably distinct from those that are arrested in S/G2. MCF10A cells with growth factor-induced proliferation showed a very distinct pattern of variation, suggesting that the phenotypic cell states identified by the model reflect a confluence of cell features and treatment conditions (Supplementary Figs. 13, 14).

We present several lines of evidence supporting the accuracy of the model and the existence of heritable cell states. First, across a diverse array of benchmarking experiments, we show that the model can derive accurate conclusions from synthetic data with properties like those we observed in the experimental measurements (e.g., Fig. 4). Through an informatic model selection scheme, we find statistical evidence for the existence of multiple states (Figs. 6, 7). Examining these cell states, we find patterns consistent with the biological mechanisms of the compounds we used to alter cell proliferation (Figs. 8, 9). Reassuringly, we were able to confirm that the abundance of cell states was consistent across experimental replicates, ruling out the possibility that state differences arose from day-to-day variation between experiments. Finally, we showed that the model could more effectively predict the behavior of unseen cells with the inclusion of multiple cell states, and that this prediction is dependent on allowing inheritance between cell generations (Supplementary Fig. 16). While we have considered the use of experimental control conditions, it is important to keep in mind that the variation observed here arises both through external perturbation and natural variation within the population. Consequently, we have not been able to identify a context in which one might expect to not observe multiple states, supporting the general usefulness of our approach. While experiments in which distinct cell lines are mixed can help to validate methods in which cell relationships are inferred, such as pseudotime methods⁵⁰, the cell relationships are not modeled here because they are explicitly known through the measured lineage relationships. Ultimately, experiments uncovering molecular markers and mechanisms of these cell states will provide the best independent validation for their biological significance.

Modeling advancements will further improve on our approach. Cells may express a continuum of, rather than discrete, phenotypic states²². If this is the case, a continuous latent variable model would lead to a refined view of the population-level heterogeneity. A discrete model like the one used here should, however, still provide an accurate estimate by breaking up the continuous state-space into discrete steps. Continuous latent variable models also have additional challenges in implementation and interpretation^51,52. Careful handling of each states’ phenotypic distributions might also improve the model’s accuracy and power to identify distinct states. For example, the eventual fate of cells and their cell cycle durations are likely correlated which could be handled through a multivariate distribution accounting for this covariance⁵³. This becomes even more important with the incorporation of other phenotypic information such as migration, cell shape, or other features, all of which are likely to be correlated to some extent.

Experimental advancements will improve the utility and accuracy of single-cell analysis using lineage information. Currently our experimental data is limited to 96 hours, covering up to five generations of cells. However, traits such as resistance may develop over more generations and longer timescales^13,54,55. Longer data collection becomes challenging due to factors such as phototoxicity and cell stress⁵⁶. Improved imaging modalities and experimental platforms might allow for longer tracking experiments, with reduced phototoxicity, in more physiologically representative environments such as engineered 3D extracellular matrix^57,58. Currently, the model is agnostic as to whether the heterogeneity it identifies is pre-existing or induced by drug treatment. Collecting data in which cells are tracked before and after drug treatment, and after a wash-out, would help to link pre- and post-exposure cell phenotypes⁵⁹.

While we have identified states that represent phenotypically distinct subpopulations of cells, we currently cannot comment on the molecular factors leading to these phenotypes. Molecular barcoding has been a popular approach for identifying subpopulations of cells with genetic predispositions toward unique phenotypes, but we do not expect it would identify the same subpopulations as we do here⁵⁵. Unlike in barcoding experiments, we do not see a bottleneck in the clonality of cells that survive treatment, and rapid interconversion between states should corrupt the relationship between ancestor phenotype and descendent molecular state^3,12. However, we expect that single-cell molecular analyses, such as single-cell tracking tied with transcriptional profiling of the same cells at the end of the experiment, should allow us to align molecular and phenotypic states in the same populations of cells⁶⁰. Such experiments would also provide a common baseline by which to link lineage-based phenotypic analysis and various snapshot measurements of the same cell population. In this way it should be able to pinpoint the underlying molecular mechanisms driving distinct phenotypic responses.

In total, the pipeline developed here provides a unique approach for understanding the structure of dynamic, heterogeneous tumor populations. By capturing the dynamics of state transitions, it links single-cell phenotypes to overall population behavior. Incorporating molecular measurements, and a broader set of drug interventions, will then also help to identify means of modulating state and overall population behavior. Ultimately, we expect this integrative view will help to identify treatments alone and in combination that allow for population-level control by affecting the growth of and transitions between individual cell subpopulations.

Methods

Experimental cell lineage data

Stable cell line creation, drug treatments, and tracking of AU565 and MCF10A cells were performed as described in Gross et al.⁶¹ and Gross et al.⁴⁵, respectively. Briefly, AU565 cells were co-transfected with a transposase plasmid (Addgene #34879) and a donor plasmid that drove expression of a nuclear-localized mCherry, puromycin resistance, and a fragment of HDHB fused to the clover fluorescent protein, which was used to track progression through the cell cycle⁶². Cells stably expressing the nuclear and cell cycle reporter were selected for 7 days with 0.75 µg/ml puromycin. The phase of the cells is determined based on whether the amount of fluorescence is greater within nucleus or the cytoplasm⁶². As a result, the reporter signal is invariant to changes in exposure and background. To track drug responses AU565 reporter cells were plated into 24-well plates with fluorobrite media containing 10% FBS, glutamine, and penicillin-streptomycin. 24 hours later fresh media containing escalating doses of lapatinib and gemcitabine was added. MCF10A cells were cultured in growth media (DMEM/F12, 5% horse serum, 20 ng/ml cholera toxin, 10 μg/ml insulin, and 1% Pen/Strep), grown to 50–80% confluency, and detached with 0.05% trypsin-EDTA. 7 hours after seeding 75000 cells, they were washed with PBS and the experiment media (DMEM/F12, 5% horse serum, 0.5 μg/ml hydrocortisone, 100 ng/ml cholera toxin, and 1% Pen/Strep) was added to the 8 well-plates which was followed by 18 hours of incubation. Afterward, cells were treated with growth factors 10 ng/ml EGF, 40 ng/ml HGF, and 10 ng/ml OSM in fresh experiment media. After drug addition, plates were placed in the IncuCyte S3 and four image locations per treatment were imaged every 30 minutes. AU565 were imaged for 96 hours and MCF10A cells for 48 hours. After half the experiment times, fresh media and drugs/growth factors were added. Cell lineages from the IncuCyte images were manually tracked in FIJI⁶³ to record cell division, death, and the transition from G1 to S/G2 phase (in AU565). AU565 cells are non-motile and fewer than 4% of cells were within one cell length of the image boundary, ensuring minimal sampling bias from the microscopy field of view. Three biological replicates were collected and combined in the final data set. To verify that results did not reflect batch effects, we checked that state assignments were not enriched or depleted within a replicate.

Lineage tree-based hidden Markov model

The core assumption of a Markov chain is that the next state and current observations are only dependent on the current state. Proof of the expressions below involving cell state assignment (expectation step), including the upward recursion, downward recursion, and Viterbi algorithms, can be found in Durand³³. All other model elements, including the emissions distribution fitting, model evaluation strategies, and censorship corrections were developed in this study.

Basic model structure

The initial probabilities of a cell being in state k are represented by the vector π that sums to 1:

$$\begin{array}{cc}{\pi }_{k}=P\left({z}_{1}=k\right), & k\in \left\{1,\ldots ,K\right\}\end{array}$$

(1)

where z indicates the state and K is the total number of states. The probability of state i transitioning to state j is represented by the K × K matrix, T, in which each row sums to 1:

$$\begin{array}{cc}{T}_{i,j}=T\big({z}_{i}\to {z}_{j}\big)=P\big({z}_{j}{{{{{\rm{|}}}}}}{z}_{i}\big), & i,j\in \left\{1,\ldots ,K\right\}\end{array}$$

(2)

The emission likelihood matrix, EL, is based on the cell observations. It is defined as the probability of an observation conditioned on the cell being in a specific state:

$${EL}\left(n,k\right)=P\left({x}_{n}=x{{{{{\rm{|}}}}}}{z}_{n}=k\right)$$

(3)

where x_n is the observation for cell number n, with a total of N cells in a lineage. Separate observations were assumed to be independent; for instance, cell fate is assumed to be independent from the duration of each cell phase. This facilitates calculating the likelihood of observations, because we can multiply the likelihood of all observations together for the overall likelihood.

Assigning cell states (expectation step)

Upward recursion

An upward-downward algorithm for calculating the probabilities in hidden Markov chains was proposed by Erphaim and Merhav⁶⁴ which suffered from underflow. This problem was originally solved by Levinson⁶⁵, where they adopted a heuristic-based scaling, and then was improved by Devijver⁶⁶ where they introduced smooth probabilities. Durand³³, however, revised this approach for hidden Markov trees to avoid underflow when calculating P(Z|X) probability matrices. To explain we need the following definitions:

p(n) is the parent cell of cell n, and c(n) is the children of cell n.

$\bar{X}$ is the observation of the whole tree and ${\bar{X}}_{a}$ is a subtree of $\bar{X}$ which is rooted at cell a.

$\bar{Z}$ is the complete hidden state tree.

${\bar{X}}_{a/b}$ is the subtree rooted at a except for the subtree rooted at cell b, if ${\bar{X}}_{b}$ is a subtree of ${\bar{X}}_{a}$.

For the state prediction we start by calculating the marginal state distribution (MSD) matrix. MSD is an N × K matrix that for each cell is marginalizing the transition probability over all possible current states by traversing from root to leaf cells:

$$M{SD}\left(n,k\right)=P\left({z}_{n}=k\right)=\mathop{\sum}\limits _{i}P\left({z}_{n}=k{{{{{\rm{|}}}}}}{z}_{n-1}=i\right)\times P\left({z}_{n-1}=i\right)$$

(4)

During upward recursion, the flow of upward probabilities is calculated from leaf cells to the root cells generation by generation. First, for leaf cells, the probabilities (β) are calculated by:

$${\beta }_{n}\left(k\right)=P\left({z}_{n}=k{{{{{\rm{|}}}}}}{X}_{n}={x}_{n}\right)=\frac{{EL}\left(n,k\right)\times {MSD}\left(n,k\right)}{N{F}_{l}\left(n\right)}$$

(5)

in which X_n is the leaf cell’s observation, and NF (Normalizing Factor) is an N × 1 matrix that is the marginal observation distribution. Since ${\sum }_{k}{\beta }_{n}(k)=1$, we find the NF for leaf cells using:

$$N{F}_{l}(n)=\mathop{\sum}\limits _{k}EL(n,k)\times MSD(n,k)=P({X}_{n}={x}_{n})$$

(6)

For non-leaf cells the values are given by:

$${\beta }_{n}(k)=P({z}_{n}=k|{\bar{X}}_{n}={\bar{x}}_{n})=\frac{EL(n,k)\times MSD(n,k)\times {\prod }_{v\in c(n)}{\beta }_{n,v}(k)}{N{F}_{nl}(n)}$$

(7)

where we calculate the non-leaf NF using:

$$N{F}_{nl}(n)=\mathop{\sum}\limits _{k}\left[EL(n,k)\times MSD(n,k)\mathop{\prod}\limits_{v\in c(n)}{\beta }_{n,v}(k)\right]$$

(8)

and linking β between parent-daughter cells is given by:

$${\beta }_{p(n),n}(k)=P({\bar{X}}_{n}={\bar{x}}_{n}|{z}_{p(n)}=k)=\mathop{\sum}\limits _{j}\frac{{\beta }_{n}(j)\times {T}_{k,j}}{MSD(n,j)}$$

(9)

By recursing from leaf to root cells, the β and NF matrices are calculated as upward recursion. The NF matrix gives a convenient expression for the observation log-likelihoods. For each root cell we have:

$$\begin{array}{cc}P(\bar{X}=\bar{x})=\mathop{\prod}\limits _{n}\frac{P({\bar{X}}_{n}={\bar{x}}_{n})}{{\prod }_{v\in c(n)}P({\bar{X}}_{v}={\bar{x}}_{v})}=\mathop{\sum}\limits_{n}NF(n) & n\in \{1,\ldots ,N\}\end{array}$$

(10)

The overall model log-likelihood is given by the sum over root cells:

$$logP(\bar{X}=\bar{x})=\mathop{\sum}\limits_{n}\log NF(n)$$

(11)

Downward recursion

For computing downward recursion, we need the following definition for each root cells:

$${\gamma }_{1}(k)=P({z}_{1}=k|{\bar{X}}_{1}={\bar{x}}_{1})={\beta }_{1}(k)$$

(12)

The other cells follow in an N × K matrix by writing the conditional probabilities as the summation over the joint probabilities of parent-daughter cell:

$${\gamma }_{n}(k)=P({z}_{n}=k|{\bar{X}}_{1}={\bar{x}}_{1})=\frac{{\beta }_{n}(k)}{MSD(n,k)}\mathop{\sum}\limits_{i}\frac{{T}_{i,k}{\gamma }_{p(n)}(i)}{{\beta }_{p(n),n}(i)}$$

(13)

Viterbi algorithm

Given a sequence of observations in a hidden Markov chain, the Viterbi algorithm is commonly used to find the most likely sequence of states. Equivalently, here it returns the most likely sequence of states of the cells in a lineage tree using upward and downward recursion³³.

The algorithm follows an upward recursion from leaf to root cells. We define δ, an N × K matrix:

$${\delta }_{n}\left(k\right)=\mathop{{\max }}\limits_{{\mathop{z}\limits^{-}}_{c\left(n\right)}}\big\{P\big({\bar{X}}_{n}={\bar{x}}_{n},{\bar{Z}}_{c\left(n\right)}={\bar{z}}_{c\left(n\right)}{{{{{\rm{|}}}}}}{z}_{n}=k\big)\big\}$$

(14)

and the links between parent-daughter cells as:

$${\delta }_{p\left(n\right),n}\left(k\right)=\mathop{{\max }}\limits_{{\mathop{z}\limits^{-}}_{n}}\big\{P\big({\bar{X}}_{n}={\bar{x}}_{n},{\bar{Z}}_{n}={\bar{z}}_{n}{{{{{\rm{|}}}}}}{z}_{p\left(n\right)}=k\big)\big\}=\mathop{{\max }}\limits_{k}\big\{{\delta }_{n}\left(k\right){T}_{k,k}\big\}$$

(15)

We initialize from the leaf cells as:

$${\delta }_{n}\left(k\right)=P\left({X}_{n}={x}_{n}{{{{{\rm{|}}}}}}{z}_{n}=k\right)={EL}\left(n,k\right)$$

(16)

and for non-leaf cells use:

$${\delta }_{n}(k)=\left[\mathop{\prod}\limits_{v\in c(n)}{\delta }_{n,v}(k)\right]\times EL(n,k)$$

(17)

The probability of the optimal state tree corresponding to the observations tree, assuming root cell is noted as cell 1, is then given by:

$${Z}^{* }=\mathop{{\max }}\limits_{k}\left\{{\delta }_{1}\left(k\right){\pi }_{k}\right\}$$

(18)

which arises from maximization over the conditional emission likelihood (EL) probabilities by factoring out the root cells as the outer maximizing step over all possible states.

Fitting the cell phenotypes (maximization step)

In the maximization step, we find the maximum likelihood of the hidden Markov model distribution parameters. We estimate the initial probabilities, the transition probability matrix, and the parameters of the observation distributions. The maximum likelihood estimation of the initial probabilities can be found from each state’s representation among the root cells:

$${\pi }_{k}^{* }={\gamma }_{1}\left(k\right)$$

(19)

Similarly, the transition probability matrix is estimated by calculating the prevalence of each transition across the lineage trees:

$${T}_{i,j}^{* }=\frac{{\sum }_{n=1}^{N-1}{\xi }_{n}(i,j)}{{\sum }_{n=1}^{N-1}{\gamma }_{n}(i)}$$

(20)

where

$${\xi }_{n}(i,j)={\left(\frac{{\gamma }_{p(n)}(i)}{\frac{{\beta }_{n}(i)}{{MSD}(n,i)}T(i,j)}\right)}^{T}\times \frac{{\beta }_{n}(j)}{{MSD}(n,j)}$$

(21)

Estimating emissions distribution parameters

In the current study, we used two emissions distributions; first, a Bernoulli distribution for the probability of each cell fate, either at the end of each cell cycle phase or at the end of cell’s lifetime; second, a gamma distribution for the durations of each cell cycle phase or overall cell lifetime. To estimate the distribution parameters after finding the cell state assignments, we calculated their maximum likelihood estimation weighted by their proportional assignment to that state. The initial and transition probabilities were shared across drug concentrations.

For estimating the Bernoulli distribution parameter for cell fate, we simply found the state assignment-weighted sample mean of the observations. To estimate the gamma distribution parameters, we fit all concentrations of each drug simultaneously and assumed that increasing drug concentration had a unidirectional effect on the observed phenotype within each state. This was implemented, using sequential least-squares programming (SLSQP)⁶⁷, through a linear constraint on the scaling parameter of the gamma distributions between concentrations so that higher concentrations had equal or greater average durations. The gamma distribution likelihood fitting is a convex optimization problem, indicating that local optimization can arrive at the globally optimal solution. Linear constraints do not change this property, and we confirmed fitting with different starting points arrived at the same solution. We used censored estimators to handle the effect of time censorship (explained below) in the duration distribution fitting. This was done by fitting uncensored and censored observations to the complete and survival distributions, respectively, and using the accumulated log-likelihood to estimate the distribution parameters.

Baum-Welch

Since both the hidden states and model parameters are unknown, we applied expectation-maximization (EM), known as the Baum-Welch algorithm in the case of HMMs, to find both the model parameters and cell states.

The expectation-maximization algorithm consists of two steps: expectation and maximization. During expectation, the probabilities of all cells being in specific states are calculated, such that for every cell and every state we have $P\left({z}_{n}=k{{{{{\rm{|}}}}}}{X}_{n}\right)$ and $P\left({z}_{n}=k,{z}_{n+1}=l{{{{{\rm{|}}}}}}{X}_{n}\right)$. The expectation step is calculated by the upward and downward recursion algorithms described above. In the maximization step, described above, the distribution parameters of each state, the initial (π) probabilities, and the transition probability (T) matrices are estimated, given the state assignments of each cell.

The expectation-maximization algorithm is initialized by randomly assigning the cells to states using a Dirichlet distribution. During fitting we iteratively switch between the expectation and maximization steps and then calculate the likelihood. If the likelihood improves less than a set threshold, we take that to indicate convergence.

Model evaluation

To find the most likely number of states corresponding to the observations, the Bayesian Information Criterion (BIC) was used⁶⁸. The BIC requires the number of degrees of freedom, which we calculate using the number of independent parameters. Our model estimates a k element initial probability vector, a k × k transition matrix, and a k × m matrix of state-wise parameters where k is the number of states and m is the number of parameters associated with observation distributions. For the phase-specific observation distributions we have a total of 6 parameters, including 2 Bernoulli parameters and 2 pairs of shape and scale parameters for the gamma distribution. Since the row-sums for transition and initial probability matrices must be 1, these values are not entirely independent. From distribution analysis of the phase lengths, we realized the shape parameter of the gamma distribution remains constant over different conditions, while the scale parameter changes. Therefore, the shape parameter was shared between the populations treated with 4 different concentrations of the same compound. Each condition, therefore, introduced 2 free parameters (1 Bernoulli parameter and 1 scale parameter). For the MCF10A experiments, terminal fates and cell cycle durations were also assumed to be Bernoulli- and gamma-distributed, respectively. The shape of cell lifetime was similarly shared among the four conditions (PBS, EGF, HGF, and OSM).

The Wasserstein or Kantorovich–Rubinstein metric is a measure of distance between two distributions. This metric was used to determine the difference between state emissions⁶⁹. An analytical solution, the absolute value of the difference in distribution means, was used for the gamma distribution.

Model benchmarking

We used emission distributions to represent the phenotypic characteristics of the cells within the lineages. To create our synthetic data, we considered two possible options as our set of observations throughout an experiment. In one case, we modeled the overall cell fate and cell lifetime; in the second, we modeled the phase-specific fate and duration. In both, we used a Bernoulli distribution for the fate outcomes and a gamma distribution for durations. The state assignment accuracy was calculated using the Rand Index⁴⁰. The difference between true and estimated probability matrices was assessed using the Frobenius norm, or the sum of each element squared.

Synthetic lineage data generation

We generated synthetic lineage trees with K discrete states and N total number of cells for benchmarking. Lineages were composed of two primary data structures: the state and emissions trees. The state tree was randomly seeded with a root cell determined by the starting probabilities, then expanded by randomly sampling transitions based on the transition probability matrix. The lineages were extended by either increasing the number of initial cells, resulting in a greater number of lineages (breadth), or by lengthening the experiment time resulting in each lineage containing more cells (depth). After creating the tree of states with the desired number of cells, the emission tree is built upon it. Emissions were randomly sampled from the distributions for each cell’s state. Finally, the effects of the emissions were applied to the tree when necessary. If any cells died, their progeny were marked as unobserved by making their emissions equal to NaN (Not a Number). If applicable, the effects of finite-duration experiments were also applied. Cells existing outside of the experiment duration were marked as unobserved, and those crossing the bounds of an experiment were marked as censored with duration clipped by the experiment.

Time censorship

Our phenotypic measurements include the cell fate (progression or cell death) and duration. These measurements are made for each cell cycle phase (G1 or S/G2) in the case of AU565 cells and for the entire lifetime for MCF10A cells. These measurements can contain incomplete information due to the bounds of an experiment. For instance, it is unknown when initial cells present at the start of the experiment began their cell cycle. The same is true of the cells present at the end of the experiment because we do not observe their end. Hence, a cell’s lifetime and/or fate may be partially observed. To ensure our synthetic data is a close reflection of experimental data, we incorporated this effect in our synthetic data. Cells with lifetimes that extend beyond the end of the experiment were marked as censored for the lifetime estimation.

Cell overall lifetime observations

The parameters are reflective of the cell phenotypes we observed with 5 nM lapatinib treatment. Supplementary Figs. 1–5 are based on these parameters. Each figure includes 100 trials.

Transition probability matrix: $T=\left[\begin{array}{cc}0.9 & 0.1\\ 0.1 & 0.9\end{array}\right]$

The initial probability vector is then calculated as the stationary distribution of states from transition probability matrix, satisfying $\pi =\pi * T$.

In this case, we have: $\pi =\left[\begin{array}{c}0.5\\ 0.5\end{array}\right]$

The same T and π were used for phase-specific emissions.

In Supplementary Table 1, “Bern p” refers to the Bernoulli parameter, the cell division probability at the end of its lifetime which is equal to 1—the probability of cell death at the end of its lifetime. “Shape” and “Scale” refer to the gamma distribution parameters. The cells’ lifetimes were fit to gamma distributions.

Cell cycle phase-specific observations

The synthetic data used in Figs. 3, 4, Supplementary Figs. 8–10 were created based on the following parameters. These parameters are based on estimations from AU565 cells treated with 5 nM lapatinib. Each figure includes 100 trials.

In Supplementary Table 2, “G1 bern” and “S/G2 bern” are the cell division probabilities at the end of G1 and S/G2 phase, respectively. The “G1 shape” and “G1 scale” are the gamma distribution parameters of G1 phase lengths. “S/G2 shape” and “S/G2 scale” are the gamma distribution parameters of S/G2 phase lengths.

To benchmark the model with 5 states, we simulated 25–500 lineages, each with up to 31 cells, to create a population with 5 states. Like with the experimental data, we assumed the experiment ends after 96 hours and censored the cells’ observations accordingly. The model parameters, including the transition probabilities and initial probabilities are listed below. The analysis results are shown in Supplementary Fig. 14.

$${{{T}}}=\left[\begin{array}{ccccc}{{{{0.6}}}} & {{{{0.1}}}} & {{{{0.1}}}} & {{{{0.1}}}} & {{{{0.1}}}}\\ {{{{0.05}}}} & {{{{0.8}}}} & {{{{0.05}}}} & {{{{0.05}}}} & {{{{0.05}}}}\\ {{{{0.01}}}} & {{{{0.1}}}} & {{{{0.7}}}} & {{{{0.09}}}} & {{{{0.1}}}}\\ {{{{0.1}}}} & {{{{0.1}}}} & {{{{0.05}}}} & {{{{0.7}}}} & {{{{0.05}}}}\\ {{{{0.1}}}} & {{{{0.1}}}} & {{{{0.05}}}} & {{{{0.05}}}} & {{{{0.7}}}}\end{array}\right]$$

(22)

$${{{{\pi }}}}=\left[\begin{array}{c}{{{{0.13}}}}\\ {{{{0.33}}}}\\ {{{{0.16}}}}\\ {{{{0.18}}}}\\ {{{{0.18}}}}\end{array}\right]$$

(23)

Figure 6 uses the first 4 states of Supplementary Table 3 as the parameter set for the emissions matrix to simulate varying state numbers in the BIC calculation.

Varying emission differences

To create synthetic data with subpopulations of varying dissimilarity (Fig. 5), we use the phase-specific parameters, with the values for the G1 phase gamma scale parameter for state 1 varying over [4, 20]. This results in an increase in the Wasserstein distance between the two cell states, allowing us to measure state assignment accuracy for different dissimilarity amounts between the two states. Likewise, for Supplementary Figs. 6, 7, we simulated the overall cell lifetime and varied the gamma distribution scale parameter from 1 to 8 for state 1.

Optimal baseline classifier

To compare the tHMM with a classifier that ignores heritability, we manually calculated the optimal classification boundary between the gamma distributions for state 1 and state 2. The best choice of classification boundary between two gamma distributions is the point at which the likelihood of the random variable, x, is equal between the two distributions:

$$p\left(x\big|{G}\left({{k}}_{1},{\theta }_{1}\right)\right)=p\left(x\big|{G}\left({{k}}_{2},{\theta }_{2}\right)\right)$$

(24)

where k₁, θ₁, k₂, and θ₂ are the shape and scale parameters of the gamma distribution corresponding to state 1 and 2, respectively. The shape parameter was shared between the two distributions. Consequently, this can be simplified to:

$$x=\frac{k\,{{{{{\rm{ln}}}}}}\frac{{\theta }_{2}}{{\theta }_{1}}}{\frac{1}{{\theta }_{1}}-\frac{1}{{\theta }_{2}}}$$

(25)

We assigned the classification labels to the observations using this classification boundary, which formed the baseline accuracy shown in Fig. 5c. As states 1 and 2 are identical at the very first point, we used the distribution mean (k × θ) as the threshold.

Cross-validation

To split the lineage data into train and test sets, we randomly selected 20% of cells from each condition and masked their observations such that they would not contribute to the fitting process. This was performed by setting the log-likelihood of the masked cells’ observations to be uniformly zero for all the states. During the Baum-Welch fitting, the algorithm estimates the parameters using only the training cells. However, during the expectation step, the state of masked cells is still inferred via information about their relatives. After the fitting converges, we calculate the log-likelihood of the test cells’ observations given their state assignments. This is accumulated into an overall likelihood of the held-out observations given the tHMM state assignments and fit.

To test this cross-validation scheme’s ability to determine the optimum number of states for a cell population, we created synthetic populations with 2–5 true states. States 1–n were used, where n is the number of true states, to generate data that is like the experimental data. The state observation distributions shown in Supplementary Table 2. The transition probabilities were generated by adding 0.1 elementwise to the identity matrix and then normalizing it. The initial probabilities for all states were equal. Fitting was performed with models including 1–7 states. The optimum number of states was taken to be the smallest number of states at which the log-likelihood plateaus.

Lowess trendline

Locally Weighted Scatterplot Smoothing (Lowess) was used to provide the trendlines in the figures with repeated model runs.

Statistics and reproducibility

The experiments were repeated in three independent biological replicates and yielded similar results.

Data availability

The experimental lineage data for AU565 and MCF10A cell lines can be found at https://github.com/meyer-lab/tHMM and https://doi.org/10.5281/zenodo.7195355 The synthetic data from which we plotted Figs. 3c, d, 4b–g, 5c–7 uses the code in the file named after the corresponding figure number. Data used in Figs. 7a, b, 8, 9 uses the AU565 cell line experimental lineage data, and Fig. 7c and 10 use the MCF10A lineage data. The cell lines used in this study (AU565, MCF10A) can be made available upon request.

Code availability

All analysis were implemented in Python v3.9 and can be found at https://github.com/meyer-lab/tHMM. The repository can also be found at Zenodo⁷⁰.

References

Di Maio, M. et al. Chemotherapy-induced neutropenia and treatment efficacy in advanced non-small-cell lung cancer: a pooled analysis of three randomised trials. Lancet Oncol. 6, 669–677 (2005).
Article PubMed Google Scholar
De Roock, W. et al. Effects of KRAS, BRAF, NRAS, and PIK3CA mutations on the efficacy of cetuximab plus chemotherapy in chemotherapy-refractory metastatic colorectal cancer: a retrospective consortium analysis. Lancet Oncol 11, 753–762 (2010).
Article PubMed Google Scholar
Sharma, S. V. et al. A Chromatin-Mediated Reversible Drug-Tolerant State in Cancer Cell Subpopulations. Cell 141, 69–80 (2010).
Article CAS PubMed PubMed Central Google Scholar
Spencer, S. L., Gaudet, S., Albeck, J. G., Burke, J. M. & Sorger, P. K. Non-genetic origins of cell-to-cell variability in TRAIL-induced apoptosis. Nature 459, 428–432 (2009).
Article CAS PubMed PubMed Central Google Scholar
Sigal, A. et al. Variability and memory of protein levels in human cells. Nature 444, 643–646 (2006).
Article CAS PubMed Google Scholar
Feinberg, A. P., Koldobskiy, M. A. & Göndör, A. Epigenetic modulators, modifiers and mediators in cancer aetiology and progression. Nat. Rev. Genet. 17, 284–299 (2016).
Article CAS PubMed PubMed Central Google Scholar
Falkenberg, K. J. & Johnstone, R. W. Histone deacetylases and their inhibitors in cancer, neurological diseases and immune disorders. Nat. Rev. Drug Discov. 13, 673–691 (2014).
Article CAS PubMed Google Scholar
Inde, Z. & Dixon, S. J. The impact of non-genetic heterogeneity on cancer cell death. Crit. Rev. Biochem. Mol. Biol. 53, 99–114 (2018).
Article CAS PubMed Google Scholar
Pisco, A. O. & Huang, S. Non-genetic cancer cell plasticity and therapy-induced stemness in tumour relapse: ‘What does not kill me strengthens me’. Br. J. Cancer 112, 1725–1732 (2015).
Article CAS PubMed PubMed Central Google Scholar
Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019).
Article CAS PubMed PubMed Central Google Scholar
Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nat. 483, 603–607 (2012).
Article CAS Google Scholar
Gupta, P. B. et al. Stochastic State Transitions Give Rise to Phenotypic Equilibrium in Populations of Cancer Cells. Cell 146, 633–644 (2011).
Article CAS PubMed Google Scholar
Shaffer, S. M. et al. Rare cell variability and drug-induced reprogramming as a mode of cancer drug resistance. Nature 546, 431–435 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gett, A. V., Sallusto, F., Lanzavecchia, A. & Geginat, J. T cell fitness determined by signal strength. Nat. Immunol. 4, 355–360 (2003).
Article CAS PubMed Google Scholar
Arai, T. et al. Tumor Doubling Time and Prognosis in Lung Cancer Patients: Evaluation from Chest Films and Clinical Follow-up Study. Jpn. J. Clin. Oncol. 24, 199–204 (1994).
CAS PubMed Google Scholar
Bourhis, J. et al. Potential doubling time and clinical outcome in head and neck squamous cell carcinoma treated with 70 GY in 7 weeks. Int. J. Radiat. Oncol. 35, 471–476 (1996).
Article CAS Google Scholar
Yachida, S. et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 467, 1114–1117 (2010).
Article CAS PubMed PubMed Central Google Scholar
Huang, D. et al. High-Speed Live-Cell Interferometry: A New Method for Quantifying Tumor Drug Resistance and Heterogeneity. Anal. Chem. 90, 3299–3306 (2018).
Article CAS PubMed Google Scholar
Tyson, D. R., Garbett, S. P., Frick, P. L. & Quaranta, V. Fractional proliferation: A method to deconvolve cell population dynamics from single-cell data. Nat. Methods https://doi.org/10.1038/nmeth.2138 (2012).
Article PubMed PubMed Central Google Scholar
O’Connor, J. P. B. et al. Imaging intratumor heterogeneity: role in therapy response, resistance, and clinical outcome. Clin. Cancer Res 21, 249–257 (2015).
Article PubMed Google Scholar
Chen, K. et al. Phenotypically supervised single-cell sequencing parses within-cell-type heterogeneity. iScience 24, 101991 (2020).
Article PubMed PubMed Central Google Scholar
Kuchen, E. E., Becker, N. B., Claudino, N. & Höfer, T. Hidden long-range memories of growth and cycle speed correlate cell cycles in lineage trees. Elife 9, e51002 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mitchell, S., Roy, K., Zangle, T. A. & Hoffmann, A. Nongenetic origins of cell-to-cell variability in B lymphocyte proliferation. Proc. Natl. Acad. Sci. USA. 115, E2888–E2897 (2018).
Article CAS PubMed PubMed Central Google Scholar
Young, A. I., Benonisdottir, S., Przeworski, M. & Kong, A. Deconstructing the sources of genotype-phenotype associations in humans. Science 365, 1396–1400 (2019).
Article CAS PubMed PubMed Central Google Scholar
Brumpton, B. et al. Within-family studies for Mendelian randomization: avoiding dynastic, assortative mating, and population stratification biases. https://doi.org/10.1101/602516 (2019).
Concannon, P. et al. Genome-wide scan for linkage to type 1 diabetes in 2,496 multiplex families from the Type 1 Diabetes Genetics Consortium. Diabetes 58, 1018–1022 (2009).
Article CAS PubMed PubMed Central Google Scholar
Concannon, P., Rich, S. S. & Nepom, G. T. Genetics of Type 1A Diabetes. N. Engl. J. Med. 360, 1646–1654 (2009).
Article CAS PubMed Google Scholar
Quinn, J. J. et al. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science 371, eabc1944 (2021).
Article CAS PubMed PubMed Central Google Scholar
Choo, K. H., Tong, J. C. & Zhang, L. Recent applications of Hidden Markov Models in computational biology. Genomics. Proteomics Bioinformatics 2, 84–96 (2004).
Article CAS PubMed PubMed Central Google Scholar
Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989).
Article Google Scholar
Yanagawa, M. et al. Single-molecule diffusion-based estimation of ligand effects on G protein–coupled receptors. Sci. Signal. 11, eaao1917 (2018).
Crouse, M. S., Nowak, R. D. & Baraniuk, R. G. Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Signal Process 46, 886–902 (1998).
Article Google Scholar
Durand, J. B., Gonçalvès, P. & Guédon, Y. Computational methods for hidden Markov tree models-An application to wavelet trees. IEEE Trans. Signal Process 52, 2551–2560 (2004).
Article Google Scholar
Choi, H. & Baraniuk, R. G. Multiscale image segmentation using wavelet-domain hidden Markov models. IEEE Trans. Image Process 10, 1309–1321 (2001).
Article CAS PubMed Google Scholar
Bykova, N. A., Favorov, A. V. & Mironov, A. A. Hidden Markov models for evolution and comparative genomics analysis. PLoS One 8, e65012–e65012 (2013).
Article CAS PubMed PubMed Central Google Scholar
Olariu, V. et al. Modified variational Bayes EM estimation of hidden Markov tree model of cell lineages. Bioinformatics 25, 2824–2830 (2009).
Article CAS PubMed Google Scholar
Nakashima, S., Sughiyama, Y. & Kobayashi, T. J. Lineage EM algorithm for inferring latent states from cellular lineage trees. Bioinformatics 36, 2829–2838 (2020).
Article CAS PubMed Google Scholar
Lund, B., Kristjansen, P. E. G. & Hansen, H. H. Clinical and preclinical activity of 2′,2′- difluorodeoxycytidine (gemcitabine). Cancer Treat. Rev. https://doi.org/10.1016/0305-7372(93)90026-N (1993).
Article PubMed Google Scholar
Bolstad, B. M. Comparing some iterative methods of parameter estimation for censored gamma data. (1998).
Rand, W. M. Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Chao, H. X. et al. Evidence that the human cell cycle is a series of uncoupled, memoryless phases. Mol. Syst. Biol. 15, e8604–e8604 (2019).
Article PubMed PubMed Central Google Scholar
Lee, J. A. et al. Microbial phenotypic heterogeneity in response to a metabolic toxin: Continuous, dynamically shifting distribution of formaldehyde tolerance in Methylobacterium extorquens populations. PLoS Genet 15, e1008458–e1008458 (2019).
Article PubMed PubMed Central Google Scholar
van Boxtel, C., van Heerden, J. H., Nordholt, N., Schmidt, P. & Bruggeman, F. J. Taking chances and making mistakes: non-genetic phenotypic heterogeneity and its consequences for surviving in dynamic environments. J. R. Soc. Interface 14, 20170141 (2017).
Article PubMed PubMed Central Google Scholar
Tang, L., Wang, Y., Strom, A., Gustafsson, J. A. & Guan, X. Lapatinib induces p27Kip1-dependent G1 arrest through both transcriptional and post-translational mechanisms. Cell Cycle https://doi.org/10.4161/cc.25728 (2013).
Article PubMed PubMed Central Google Scholar
Gross, S. M. et al. A LINCS microenvironment perturbation resource for integrative assessment of ligand-mediated molecular and phenotypic responses. https://doi.org/10.1101/2021.08.06.455429 (2021).
Brock, A., Chang, H. & Huang, S. Non-genetic heterogeneity—a mutation-independent driving force for the somatic evolution of tumours. Nat. Rev. Genet. 10, 336–342 (2009).
Article CAS PubMed Google Scholar
Fan, Y. & Meyer, T. Molecular control of cell density-mediated exit to quiescence. Cell Rep. 36, 109436 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhu, X. et al. Autophagy stimulates apoptosis in HER2-overexpressing breast cancers treated by lapatinib. J. Cell. Biochem. 114, 2643–2653 (2013).
Article CAS PubMed Google Scholar
Johnson, T. I. et al. Quantifying cell cycle-dependent drug sensitivities in cancer using a high throughput synchronisation and screening approach. EBioMedicine 68, 103396 (2021).
Article CAS PubMed PubMed Central Google Scholar
Campbell, K. R. & Yau, C. Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data. Nat. Commun. 9, 1–12 (2018). 2018 91.
Article Google Scholar
Lartillot, N. A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. Bioinformatics 30, 488–496 (2013).
Article PubMed Google Scholar
Ding, J. et al. Reconstructing differentiation networks and their regulation from time series single-cell expression data. Genome Res. 28, 383–395 (2018).
Article CAS PubMed Central Google Scholar
Tentner, A. R. et al. Combined experimental and computational analysis of DNA damage signaling reveals context-dependent roles for Erk in apoptosis and G1/S arrest after genotoxic stress. Mol. Syst. Biol. 8, 568 (2012).
Article PubMed PubMed Central Google Scholar
Hata, A. N. et al. Tumor cells can follow distinct evolutionary paths to become resistant to epidermal growth factor receptor inhibition. Nat. Med. 22, 262–269 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bhang, H. C. et al. Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat. Med. 21, 440–448 (2015).
Article CAS PubMed Google Scholar
Burke, R. T. & Orth, J. D. Through the Looking Glass: Time-lapse Microscopy and Longitudinal Tracking of Single Cells to Study Anti-cancer Therapeutics. J. Vis. Exp. 14, 53994 (2016).
Han, K. et al. CRISPR screens in cancer spheroids identify 3D growth-specific vulnerabilities. Nature 580, 136–141 (2020).
Article CAS PubMed PubMed Central Google Scholar
Schwartz, A. D. et al. A biomaterial screening approach reveals microenvironmental mechanisms of drug resistance. Integr. Biol. (Camb). 9, 912–924 (2017).
Article CAS PubMed PubMed Central Google Scholar
Emert, B. L. et al. Variability within rare cell states enables multiple paths toward drug resistance. Nat. Biotechnol. 39, 865–876 (2021).
Article CAS PubMed PubMed Central Google Scholar
Papalexi, E. & Satija, R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. 18, 35–45 (2017).
Article PubMed Google Scholar
Gross, S. M. et al. Analysis and modeling of cancer drug responses using cell cycle phase-specific rate effects. bioRxiv 2020.07.24.219907 (2021).
SL, S. et al. The proliferation-quiescence decision is controlled by a bifurcation in CDK2 activity at mitotic exit. Cell 155, 369 (2013).
Article Google Scholar
Meijering, E., Dzyubachyk, O. & Smal, I. Methods for cell and particle tracking. Methods in Enzymology https://doi.org/10.1016/B978-0-12-391857-4.00009-4 (2012).
Article PubMed Google Scholar
Ephraim, Y. & Merhav, N. Hidden Markov processes. IEEE Trans. Inf. Theory 48, 1518–1569 (2002).
Article Google Scholar
Levinson, S. E., Rabiner, L. R. & Sondhi, M. M. An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition. Bell Syst. Tech. J 62, 1035–1074 (1983).
Article Google Scholar
Devijver, P. A. Baum’s forward-backward algorithm revisited. Pattern Recognit. Lett. 3, 369–373 (1985).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020). 2020 173.
Article CAS PubMed PubMed Central Google Scholar
Neath, A. A. & Cavanaugh, J. E. The Bayesian information criterion: background, derivation, and applications. WIREs Comput. Stat. 4, 199–203 (2011).
Article Google Scholar
Vallender, S. S. Calculation of the Wasserstein Distance Between Probability Distributions on the Line. Theory Probab. & Its Appl 18, 784–786 (1974).
Article Google Scholar
Mohammadi, F., Visagan, S., Lagarde, J. & Meyer, A. S. Meyer-lab/tHMM. https://doi.org/10.5281/ZENODO.7195355 (2022).

Download references

Acknowledgements

This work was supported by the Jayne Koskinas Ted Giovanis Foundation for Health and Policy, NIH U01-CA215709 (A.S.M.), NIH U54-CA209988 (L.M.H.), NIH U54-HG008100 (L.M.H.). The authors thank Scott Taylor for his critical feedback that helped to improve the manuscript. The authors thank Ali Farhat, Adam Weiner, and Nikan Namiri for early exploratory work.

Author information

Authors and Affiliations

Department of Bioengineering, University of California, Los Angeles, CA, USA
Farnaz Mohammadi, Shakthi Visagan, J. C. Lagarde & Aaron S. Meyer
Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
Sean M. Gross & Laura M. Heiser
Department of Bioengineering, University of Illinois, Urbana Champaign, IL, USA
Luka Karginov
Department of Bioinformatics, University of California, Los Angeles, CA, USA
Aaron S. Meyer
Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA
Aaron S. Meyer
Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, University of California, Los Angeles, CA, USA
Aaron S. Meyer

Authors

Farnaz Mohammadi
View author publications
You can also search for this author in PubMed Google Scholar
Shakthi Visagan
View author publications
You can also search for this author in PubMed Google Scholar
Sean M. Gross
View author publications
You can also search for this author in PubMed Google Scholar
Luka Karginov
View author publications
You can also search for this author in PubMed Google Scholar
J. C. Lagarde
View author publications
You can also search for this author in PubMed Google Scholar
Laura M. Heiser
View author publications
You can also search for this author in PubMed Google Scholar
Aaron S. Meyer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.S.M. and L.M.H. conceived of the study; A.S.M. conceived of the model; A.S.M, F.M., S.V. designed model; A.S.M., F.M., J.L., L.K., S.V. performed computational experiments; S.M.G. performed the experiments; F.M., J.L., L.K., S.M.G. conducted data analysis; A.S.M. and L.M.H. supervised the research; all authors wrote the paper.

Corresponding author

Correspondence to Aaron S. Meyer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Barbara Bravi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: George Inglis. This article has been peer reviewed as part of Springer Nature’s Guided Open Access initiative.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Transparent Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mohammadi, F., Visagan, S., Gross, S.M. et al. A lineage tree-based hidden Markov model quantifies cellular heterogeneity and plasticity. Commun Biol 5, 1258 (2022). https://doi.org/10.1038/s42003-022-04208-9

Download citation

Received: 09 August 2021
Accepted: 01 November 2022
Published: 17 November 2022
DOI: https://doi.org/10.1038/s42003-022-04208-9

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.