Abstract
The ability to quantify differentiation potential of single cells is a task of critical importance. Here we demonstrate, using over 7,000 single-cell RNA-Seq profiles, that differentiation potency of a single cell can be approximated by computing the signalling promiscuity, or entropy, of a cell’s transcriptome in the context of an interaction network, without the need for feature selection. We show that signalling entropy provides a more accurate and robust potency estimate than other entropy-based measures, driven in part by a subtle positive correlation between the transcriptome and connectome. Signalling entropy identifies known cell subpopulations of varying potency and drug resistant cancer stem-cell phenotypes, including those derived from circulating tumour cells. It further reveals that expression heterogeneity within single-cell populations is regulated. In summary, signalling entropy allows in silico estimation of the differentiation potency and plasticity of single cells and bulk samples, providing a means to identify normal and cancer stem-cell phenotypes.
Similar content being viewed by others
Introduction
One of the most important tasks in single-cell RNA-sequencing studies is the identification and quantification of ‘intercellular transcriptomic heterogeneity’, that is, variation between the transcriptomes of single cells that is of biological relevance1,2,3,4. Although some of the observed intercellular transcriptomic variation represents stochastic noise, a substantial component has been shown to be of functional importance1,5,6,7,8. Very often, this biologically relevant heterogeneity can be attributed to cells occupying states of different potency or plasticity. Thus, quantification of differentiation potency, or more generally functional plasticity, at the single-cell level is of paramount importance. However, currently there is no concrete theoretical and computational model for estimating such plasticity at the single-cell level.
Here we make significant progress towards addressing this challenge. We propose a very general model for estimating cellular plasticity. A key feature of this model is the computation of signalling entropy9, which quantifies the degree of uncertainty, or promiscuity, of a cell’s gene expression levels in the context of a cellular interaction network. In effect, signalling entropy uses the transcriptomic profile of a cell to quantify the relative activation levels of its molecular pathways, and more generally that of biological processes, as defined over an a priori specified protein interaction network. We show that signalling entropy provides an excellent and robust proxy to the differentiation potential of a cell in Waddington’s epigenetic landscape10, and further provides a framework in which to understand the overall differentiation potency and transcriptomic heterogeneity of a cell population in terms of single-cell potencies. Attesting to its general nature and broad applicability, we compute and validate signalling entropy in over 7,000 single cells of variable degrees of differentiation potency and phenotypic plasticity, including time-course differentiation data, neoplastic cells and circulating tumour cells (CTCs). This extends entropy concepts that we have previously demonstrated to work on bulk tissue data9,11,12,13 to the single-cell level. On the basis of signalling entropy, we develop a novel algorithm called single-cell entropy (SCENT), which can be used to identify and quantify biologically relevant expression heterogeneity in single-cell populations, as well as to reconstruct cell-lineage trajectories from time-course data. In this regard, SCENT differs substantially from other single-cell algorithms like Monocle14, MPath15, SCUBA16, Diffusion Pseudotime17 or StemID18, in that it uses single-cell entropy to independently order single cells in pseudo-time (that is, differentiation potency), without the need for feature selection or clustering.
Results
The signalling entropy framework
A pluripotent cell (by definition endowed with the capacity to differentiate into effectively all major cell-lineages) does not express a preference for any particular lineage, thus requiring a similar basal activity of all lineage-specifying transcription factors9,19. Viewing a cell’s choice to commit to a particular lineage as a probabilistic process, pluripotency can therefore be characterized by a state of high uncertainty, or entropy, because all lineage choices are equally likely (Fig. 1a). In contrast, for a differentiated cell, or for a cell committed to a particular lineage, signalling uncertainty/entropy is reduced, as this requires activation of a specific signalling pathway reflecting that lineage choice (Fig. 1a). Thus, a measure of global signalling entropy, if computable, could provide us with a relatively good proxy of a cell’s overall differentiation potential. Here we propose that differentiation potential can be estimated in silico by integrating a cell’s transcriptomic profile with a high quality protein–protein interaction (PPI) network to define a cell-specific probabilistic signalling process (in effect, a random walk) on the network (Methods). Mathematically, this random walk is described by a stochastic matrix whose entries reflect the relative interaction probabilities. Underlying the construction of these probabilities is the assumption that two genes, which can interact at the protein level, are more likely to do so if both are highly expressed (Fig. 1a, Methods). Given this stochastic matrix, global signalling entropy is then computed as the entropy rate (abbreviated as SR) of this probabilistic signalling process on the network20 (Fig. 1b, Methods), and can be thought of as quantifying the overall level of signalling promiscuity of biological processes within the network. In effect, this quantifies the efficiency, or speed, with which signalling can diffuse over the whole network, and therefore measures the number of separate biological processes which are in some sense ‘active’. Since a committed, or differentiated cell, preferentially activates and deactivates specific processes (pathways) in the network, the expectation is that this would manifest itself as a lower entropy rate since signalling cannot diffuse to the regions of the network describing inactive processes.
Signalling entropy approximates differentiation potency
To test that signalling entropy correlates with differentiation potency, we first estimated it for 1,018 single-cell RNA-Seq profiles generated by Chu et al.21, which included pluripotent human embryonic stem cells (hESCs) and hESC-derived progenitor cells representing the three main germ layers (endoderm, mesoderm and ectoderm) (‘Chu et al. set’, Supplementary Table 1, Methods). In detail, these were 374 cells from two hESC lines (H1 & H9), 173 neural progenitor cells (NPCs), 138 definite endoderm progenitors (DEPs), 105 endothelial cells (ECs) representing mesoderm derivatives, as well as 69 trophoblast cells (TB) and 148 human foreskin fibroblasts (HFFs). Confirming our hypothesis, pluripotent hESCs attained the highest signalling entropy values, followed by multipotent cells (NPCs, DEPs), and with less multipotent HFFs, TBs and ECs attaining the lowest values (Fig. 2a). Differences were highly statistically significant, with DEPs exhibiting significantly lower entropy values than hESCs (Wilcoxon rank-sum P<1e−50) (Fig. 2a). Likewise, TBs exhibited lower entropy than hESCs (P<1e−50), but higher than HFFs (P<1e−7) (Fig. 2a). Importantly, signalling entropy correlated very strongly with a pluripotency score obtained using a previously published pluripotency gene expression signature22 (Spearman correlation=0.91, P<1e−500, Fig. 2b, Methods). In all, signalling entropy provided a highly accurate discriminator of pluripotency versus non-pluripotency at the single-cell level (AUC=0.96, Wilcoxon test P<1e−300, Fig. 2c). We note that in contrast with pluripotency expression signatures, this strong association with pluripotency was obtained without the need for any feature selection or training.
To further test the general validity and robustness of signalling entropy we computed it for scRNA-Seq profiles of 3,256 non-malignant cells derived from the microenvironment of 19 melanomas (Melanoma set23, Supplementary Table 1). Cells profiled included T-cells, B-cells, natural killer (NK) cells, macrophages, fully differentiated ECs and cancer-associated fibroblasts (CAFs). For a given cell-type and individual, variation between single cells was substantial and similar to the variation seen between individuals (Supplementary Fig. 1). Mean entropy values however, were generally stable, showing little inter-individual variation, except for T-cells from 4 out of 15 patients, which exhibited a distinctively different distribution (Supplementary Fig. 1). To assess overall trends, we pooled the single-cell entropy data from all patients together, which confirmed that all lymphocytes (T-cells, B-cells and NK cells) had similar average signalling entropy values (Fig. 2d). Intra-tumour macrophages, which are derived from monocytes, exhibited a marginally higher signalling entropy (Fig. 2d). The highest signalling entropy values were attained by ECs and CAFs (Fig. 2d), consistent with their known high phenotypic plasticity24,25,26,27. Importantly, the entropy values for all of these non-malignant differentiated cell-types were distinctively lower compared to those of hESCs and progenitor cells from Chu et al. (Fig. 2a,d), consistent with the fact that hESCs and progenitors have much higher differentiation potency. To test this formally, we compared hESCs, mesoderm progenitors, and terminally differentiated cells within the mesoderm lineage (which included all ECs and lymphocytes), which revealed a consistent decrease in signalling entropy between all three potency states (Wilcoxon rank test P<1e−50, Fig. 2e). Of note, signalling entropy could discriminate progenitor and differentiated cells better than the score derived from the pluripotency gene expression signature22, attesting to its increased robustness as a general measure of differentiation potency (Fig. 2f, Supplementary Fig. 2).
Next, we assessed signalling entropy in the context of a time-course differentiation experiment, whereby hESCs were induced to differentiate into DEPs via the mesoendoderm intermediate28. scRNA-Seq for a total of 758 single cells, obtained at six time points, including origin, 12, 24, 36, 72 and 96 h post induction were available (Methods)28. We observed that single-cell entropies exhibited a particular large decrease only after 72 h (Fig. 2g), consistent with previous knowledge that differentiation into definite endoderm occurs around 3–4 days after induction28. To demonstrate the validity of signalling entropy in another species, we next considered a scRNA-Seq data of cells sampled at different embryonic stages in the development of the mouse lung epithelium29 (‘Treutlein set’, Supplementary Table 1, Methods). Signalling entropy decreased continuously until adulthood in line with a gradual increase in differentiation (Fig. 2h). Moreover, at embryonic day 18, it could discriminate alveolar type cells from a recently discovered bipotent progenitor subgroup29, albeit with marginal significance due to small cell numbers (Supplementary Fig. 3A).
To demonstrate the critical importance of the interaction network, we recomputed signalling entropy in the Chu and Treutlein data sets after randomly reshuffling gene expression values over the network (100 and 1,000 permutations, respectively). As expected, upon reshuffling, signalling entropy lost its power to discriminate pluripotent from non-pluripotent cells (Fig. 2i), and did not exhibit a consistent decrease with developmental stage in Treutlein’s set (Supplementary Fig. 3B).
Robustness to choice of PPI network and NGS platform
Given the importance of the PPI network, it is therefore equally important to verify that signalling entropy is robust to the choice of network. Results were largely unchanged using a different version of a PPI network (Supplementary Fig. 4). To test the robustness of signalling entropy across independent studies, we analysed scRNA-Seq data for an independent set of single-cell hESCs derived from the primary outgrowth of the inner cell mass (‘hESC set’30, Supplementary Table 1). Obtained signalling entropy values were most similar to those of single cells derived from the H1 and H9 hESC lines, confirming the robustness of signalling entropy across different studies and next-generation sequencing platforms (Fig. 2j, Supplementary Table 1).
Comparison of signalling entropy to StemID and SLICE
To further highlight the importance of the PPI network, we decided to compare Signalling Entropy to two other entropy-based potency measures, proposed as part of the StemID18 and SLICE31 algorithms, which we note do not use any network information. To provide an objective evaluation, we compared the entropy measures of single cells from well-separated differentiation stages, or by comparing start and end points in time-course differentiation experiments, as these cells ought to differ substantially in terms of potency. Adopting this strategy in four scRNA-Seq and one bulk RNA-Seq data set, we observed that signalling entropy was able to provide high discriminative power in each data set (Table 1). In contrast, we did not find StemID and SLICE to be as accurate or robust (Table 1).
Correlation with potency is independent of cell-cycle phase
A major source of variation in scRNA-Seq data is cell-cycle phase23,32. We explored the relation between signalling entropy and cell-cycle phase in a large scRNA-Seq data set encompassing 3,256 non-malignant and 1,257 cancer cells derived from the microenvironment of melanomas (Melanoma set23, Supplementary Table 1). A cycling score for both G1-S and G2-M phases and for each cell was obtained using a validated procedure23,32,33, and compared to signalling entropy, which revealed a strong yet highly non-linear correlation (Supplementary Fig. 5). Specifically, we observed that cells with a low signalling entropy were never found in either the G1-S or G2-M phase (Supplementary Fig. 5). In contrast, cells with high signalling entropy could be found in either a cycling or non-cycling phase. These results are consistent with the view that cycling cells must increase expression of promiscuous signalling proteins and hence exhibit an increased signalling entropy. Thus, we next asked if signalling entropy correlates with potency when restricting to non-cycling cells. Using the Chu et al. data set, we observed that, although discrimination accuracies were reduced upon correction for cell-cycle phase, signalling entropy could still accurately classify pluripotent from non-pluripotent cell-types (AUC>0.9, P<1e−5, Supplementary Fig. 6, Supplementary Table 2). Consistent with this (and now using both cycling and non-cycling cells), the correlation between signalling entropy and potency remained significant when adjusted for cell-cycle scores (Supplementary Table 2).
Correlation of expression with degree partly drives potency
To gain further biological insight into signalling entropy, we derived an approximation for signalling entropy in terms of the three-way correlation between the transcriptome, connectome and local signalling entropies (Methods). This approximation implies that if, on average, network hubs are more highly expressed than low-degree nodes and if they exhibit an increase in their local signalling entropy, then this should generally lead to a more efficient distribution of signalling over the network, and hence to an increased global signalling entropy12. We thus posited that in cells with a demand for high phenotypic plasticity (for example, pluripotent cells), hubs tend to be overexpressed and exhibit increased signalling promiscuity. Using scRNA-Seq data from Chu et al.21, we were able to confirm a weak (Pearson correlation of ∼0.2) but significant (P<1e−50) positive correlation of differential gene expression (between hESCs and multipotent cells) with connectivity (Supplementary Fig. 7A). Importantly, the differential local signalling entropy between hESCs and multipotent cells correlated more strongly with connectivity (Pearson correlation of ∼0.64, P<1e−100, Supplementary Fig. 7A), thus confirming the notion that the increased SR in pluripotent cells is also driven by a more distributed signalling (that is, increased local entropy) at network hubs. To demonstrate that the Pearson correlation between transcriptome and connectome can be used to approximate signalling entropy (SR), we computed it for all 1,018 single cells in Chu et al., obtaining an excellent agreement with SR (R2=0.96, Supplementary Fig. 7B), and hence also with potency (Supplementary Fig. 7C). However, we stress that this Pearson correlation approximation is not a substitute for SR, since the definition of SR includes the local signalling entropies (Fig. 1b), from which important biological information can be extracted. To demonstrate this, we ranked genes in the network according to their differential local signalling entropy (Methods) and performed gene set enrichment analysis (GSEA)34 on the genes exhibiting the most significant increases in local entropy between pluripotent (hESCs) and multipotent cells. Top-ranked enriched biological terms included, besides stemness, genes implicated in mRNA splicing and encoding mitochondrial ribosomal proteins (Supplementary Table 3, Supplementary Data 1). This is consistent with recent studies demonstrating that mitochondrial activity influences the global transcription and splicing rate of cells35,36,37, and that variations in such activity may influence stemness and differentiation38,39,40,41,42. Finally, we also point out that signalling entropy and its Pearson correlation approximation are not equivalent, as there exist networks where both measures yield very different answers (Methods). For instance, in networks where hubs are not connected to each other (unlike our PPI networks where hubs are generally connected to each other), a positive correlation could lead to a lower signalling entropy (Supplementary Fig. 7D).
Quantifying single-cell expression heterogeneity with SCENT
Given that signalling entropy correlates with differentiation potency, we used it to develop the SCENT algorithm (Fig. 1c). Briefly, SCENT uses the estimated single-cell entropies to infer the distribution of discrete potency states across the cell population (Fig. 1c, Methods). Thus, SCENT can be used to quantify expression heterogeneity at the level of potency. In addition, SCENT can be used to directly order single cells in pseudo-time14 to facilitate reconstruction of lineage trajectories. A key feature of SCENT is the assignment of each cell to a unique potency state and co-expression cluster, which results in the identification of potency clusters (which we call ‘landmarks’), through which lineage trajectories are then inferred (Methods).
We first tested SCENT on the scRNA-Seq data from Chu et al., which profiled pluripotent and multipotent cells (Supplementary Table 1). SCENT correctly predicted a parsimonious two-state model, with a high potency pluripotent state and a lower potency non-pluripotent progenitor-like state (Fig. 3a). Interestingly, a small fraction (∼4%) of hESCs were deemed to be non-pluripotent cells (Fig. 3b), consistent with previous observations that pluripotent cell populations contain cells that are already primed for differentiation into specific lineages5,6. Supporting this, these non-pluripotent ‘hESCs’ exhibited lower cycling scores and higher expression levels of neural (HES1/SOX2) and mesoderm (PECAM1) stem-cell markers, compared to the pluripotent hESCs (Supplementary Fig. 8). Whereas all HFFs and ECs were deemed non-pluripotent, DEPs, TBs and NPCs exhibited mixed proportions, with NPCs exhibiting approximately equal numbers of pluripotent and non-pluripotent cells (Fig. 3b). Correspondingly, the Shannon index (SI), which quantifies the level of heterogeneity in potency, was highest for the NPC population (Fig. 3c). In total, SCENT predicted six co-expression clusters, which combined with the two potency states, resulted in a total of seven landmark clusters (Fig. 3d). These landmarks correlated very strongly with cell-type, with only NPCs being distributed across two landmarks of different potency (Fig. 3e). SCENT correctly inferred a lineage trajectory between the high potency NPC subpopulation and its lower potency counterpart, as well as a trajectory between hESCs and DEPs (Fig. 3f). The other cell-types exhibited lower entropies (Fig. 2b, Fig. 3f), and correspondingly did not exhibit a direct trajectory to hESCs, suggesting several intermediate states which were not sampled in this experiment.
To ascertain the biological significance of the two NPC subpopulations (Fig. 3b,e,f), we first verified that the NPCs deemed pluripotent did indeed have a higher pluripotency score (Supplementary Fig. 9A), as assessed using the independent pluripotency gene expression signature from Palmer et al.22 We further reasoned that well-known transcription factors marking neural stem/progenitor cells, such as HES1, would be expressed at a much lower level in the NPCs deemed pluripotent compared to the non-pluripotent ones, since the latter are more likely to represent bona fide NPCs. Confirming this, NPCs with low HES1 expression exhibited higher differentiation potential than NPCs with high HES1 expression (Wilcoxon rank-sum test P<0.0001, Fig. 3g). Similar results were evident for other neural progenitor/stem cell markers such as PAX6 and SOX2 (Supplementary Fig. 9B). Of note, NPCs expressing the lowest levels of PAX6, HES1 or SOX2 were generally always classified by SCENT into a pluripotent-like state (Fig. 3g, Supplementary Fig. 9B). Thus, these results indicate that SCENT provides a biologically meaningful characterization of intercellular transcriptomic heterogeneity.
SCENT reconstructs lineage trajectories in differentiation
We next tested SCENT in the context of a differentiation experiment of human myoblasts14, involving skeletal muscle myoblasts which were first expanded under high mitogen conditions and later induced to differentiate by switching to a low serum medium (Trapnell et al. set, Supplementary Table 1). A total of 96 cells were profiled with RNA-Seq at differentiation induction, as well as at 24 and 48 h after medium switch, with a remaining 84 cells profiled at 72 h. As expected, signalling entropy was highest in the myoblasts, with a switch to lower entropy occurring at 24 h (Fig. 4a). No further decrease in entropy was observed between 24 and 72 h, indicating that commitment of cells to become differentiated skeletal muscle cells already happens early in the differentiation process. Over the whole time course, SCENT predicted a total of 3 potency states, with a distribution consistent with the time of sampling (Fig. 4b). Cells sampled at differentiation induction were made up primarily of two potency states (Fig. 4c, PS1 & PS2), which differed in terms of CDK1 expression, consistent with one subset (PS1) defining a highly proliferative subpopulation and with the rest (PS2) representing cells that have exited the cell cycle (Supplementary Fig. 10). In total, SCENT predicted four landmarks, with one landmark defining undifferentiated (t=0) myoblasts of high potency (Fig. 4d). Another landmark of lower potency contained cells at all time points, with cells expressing markers of mesenchymal cells (for example, PDFGRA and FN1/LTBP2) (Fig. 4d). Cells from this landmark which were present at differentiation induction exhibited intermediate potency expressing low levels of CDK1 (Supplementary Fig. 10, Fig. 4d), suggesting that these are ‘contaminating’ interstitial mesenchymal cells that were already present at the start of the time course, in line with previous observations14,15. Importantly, SCENT correctly predicts that the potency of all these mesenchymal cells in this landmark does not change during the time-course, consistent with the fact that these cells are not primed to differentiate into skeletal muscle cells, but which nevertheless aid the differentiation process14,15. Another landmark of intermediate potency predicted by SCENT defined a trajectory made up of cells expressing high levels of myogenic markers (MYOG & IGF2) from 24 h onwards (Fig. 4d). Thus, this landmark corresponds to cells that are effectively committed to becoming fully mature skeletal muscle cells. The final landmark consisted of cells exhibiting the lowest level of potency and emerged only at 48 h, becoming most prominent at 72 h (Fig. 4d). As with the previous landmark, cells in this group also expressed myogenic markers, and likely represent a terminally differentiated and more mature state of skeletal muscle cells. In summary, SCENT inferred lineage trajectories that are highly consistent with known biology and with those obtained by previous algorithms such as Monocle14 and MPath15. However, in contrast to Monocle and MPath, SCENT inferred these reconstructions without the explicit need of knowing the time-point at which samples were collected.
SCENT detects drug resistant cancer stem cell phenotypes
Cancer cells are known to be less differentiated and to acquire a more plastic phenotype compared to non-malignant cells. Hence their signalling entropy should be higher than that of non-malignant cell-types. We confirmed this using scRNA-Seq data from 12 melanomas (Melanoma set23, Supplementary Table 1), for which sufficient normal and cancer cells had been profiled (Fig. 5a, Supplementary Fig. 11). Although there was some variation in the signalling entropy of cancer cells between individuals, this variation was relatively small in comparison to the difference in entropy between cancer and normal cells. Combining data across all 12 patients, demonstrated a dramatic increase in the signalling entropy of single cancer cells compared to non-malignant ones (Wilcoxon rank-sum test P<1e−500, Fig. 5b).
Since signalling entropy is increased in cancer and correlates with stemness, it could, in principle, be used to identify putative cancer stem cells (CSC) or drug resistant cells. To test this, we first computed and compared signalling entropy values for 38 acute myeloid leukaemia (AML) bulk samples from 19 AML patients, consisting of 19 diagnostic/relapse pairs43. Confirming that signalling entropy marks drug resistant cell populations, we observed a higher entropy in the relapsed samples (paired Wilcox test P=0.004, Fig. 5c). For one relapsed sample, scRNA-Seq for 96 single-AML cells was available (AML set, Supplementary Table 1). We posited that comparing the signalling entropy values of these 96 cells would allow us to identify a CSC-like subset responsible for relapse. Since in AML there are well accepted CSC markers (CD34, CD96), we tested whether expression of these markers in high entropy AML single cells is higher than in low entropy AML single cells (Fig. 5d). Both CD34 and CD96 were more highly expressed in the high entropy AML single cells (Wilcox test P=0.008 and 0.032, respectively, Fig. 5d).
We next computed signalling entropies for 73 CTCs derived from 11 castration resistant prostate cancer patients (CTC-PrCa set, Supplementary Table 1), of which five patients exhibited progression under treatment with enzalutamide (an androgen receptor (AR) inhibitor) (n=36 CTCs), with the other six patients not having received treatment (n=37 CTCs)44. Although of marginal significance, signalling entropy was higher in the CTCs from patients exhibiting resistance (Wilcox test P=0.047, Fig. 5e). Among putative prostate CSC markers (for example, CD44, CD133, KLF4 and ALDH7A1)44, we observed a positive association of signalling entropy with ALDH7A1 expression, suggesting that ADLH7A1 (and not other markers such as CD44) may mark specific prostate CSCs which are resistant to enzalutamide treatment (Fig. 5f).
Regulation of single-cell expression heterogeneity
It has been proposed that expression heterogeneity of cell populations is regulated in the sense that the transcriptomes of individual cells within the population differ in a manner which optimizes an objective function, such as pluripotency or homeostasis3. To test whether signalling entropy can predict such regulated expression heterogeneity, we compared the distribution of single-cell entropies to the signalling entropy of the bulk population. Specifically, we devised a ‘measure of regulated heterogeneity’ (MRH), which measures the likelihood that the signalling entropy of the cell population could have been observed from picking a single cell at random from that population (online Methods, Fig. 6a). We first estimated MRH for the data from Chu et al., for which matched bulk and scRNA-Seq data is available. We first note that although for bulk samples entropy differences between cell-types were smaller, that they were nevertheless consistent with the trends seen at the single-cell level (Supplementary Fig. 12, Fig. 2c). The MRH for each of the six cell-types (hESCs, NPCs, DEPs, TBs, HFFs and ECs) in Chu et al., revealed evidence of regulated heterogeneity, with the entropy values of bulk samples being significantly higher than that of single cells (Fig. 6b). As a negative control, the signalling entropy of the average expression over bulk samples did not exhibit regulated heterogeneity (normal deviation test P=0.30, Fig. 6b), as required since bulk samples are not linked in space or time and represent non-interacting cell populations.
We note that for the previous analysis, matched bulk RNA-Seq data is not absolutely required since bulk samples can be approximated by averaging the expression profiles of individual cells in the population. We verified this, although, as expected, the entropy values for the true bulk samples were always marginally higher, in line with the fact that single-cell assays only capture a subpopulation of the bulk sample (Fig. 6c). We also verified that MRH results were not driven by the larger number of dropouts in scRNA-Seq data. Specifically, we simulated bulk samples by aggregating single cells representing the same cell-type and then resampling transcript counts matching to the average number of transcripts seen in single cells (Methods). We observed that signalling entropy of the simulated bulk did not alter appreciably upon downsampling and that results were unchanged (Supplementary Fig. 13).
Next, we repeated the MRH analysis for T-cells and B-cells found in melanomas (Melanoma set, Supplementary Table 1), for which sufficient numbers of single cells had been profiled. In all cases, signalling entropies of the bulk were much higher than expected based on the distribution of single-cell entropies (Supplementary Fig. 14). Evidence for regulated expression heterogeneity was also seen among the melanoma cancer cells from each of 12 patients (combined Fisher test P<1e−6, Supplementary Fig. 15). We also analysed RNA-Seq data for 96 single cancer cells from a relapsed patient with acute myeloid leukaemia(AML set43, Supplementary Table 1). The signalling entropy for the AML cell population was 0.88, significantly larger than the maximal value over the 96 cells (SR=0.82, Normal deviation test P<0.001, Fig. 6d). Again, as a negative control we analysed all 19 bulk AML samples at relapse and diagnosis, treating bulk samples from independent AML patients as if they were single cells from a common population. Estimating the signalling entropy of the average expression profile over all 19 bulk samples did not reveal a value significantly higher than that of the individual bulk samples (normal deviation test P=0.32, Fig. 6d).
Discussion
Although Waddington proposed his famous epigenetic landscape of cellular differentiation many decades ago10, it has proved challenging to construct a robust molecular correlate of a cell’s elevation in this landscape. Here we have made significant progress, demonstrating that the differentiation potency and phenotypic plasticity of single cells, be they normal or malignant, can be estimated in silico from their RNA-Seq profile using signalling entropy. As we have seen, signalling entropy can accurately discriminate pluripotent from multipotent and differentiated cells, without the need for feature selection or training, outperforming a pluripotency gene expression signature and providing a more general measure of differentiation potency.
Importantly, signalling entropy should not be confused with other transcriptional entropy measures, which are estimated over populations of single cells45,46. For instance, the ‘transcriptional entropy’ of Richard et al.45 is estimated for single genes across single cells, and therefore reflects the amount of intercellular heterogeneity in the expression of a given gene. Our signalling entropy measure is estimated for a single cell across genes in the context of a large gene network, which therefore incorporates systems-level information and is genome-wide (Fig. 1a,b). While the signalling entropy of single cells will influence the amount of transcriptional heterogeneity and entropy as defined by Richard et al., the precise relation between the two entropies is non-trivial. Indeed, we have here shown how we can assign single cells into potency states, from which a SI over the whole cell population (that is, using the distribution of potency states over single cells) can then be estimated (Fig. 1c). This SI is more analogous to the transcriptional entropy of Richard et al. Indeed, we have shown how this SI is higher in a population of NPCs than in a population of hESCs (Fig. 3c). Thus, the SI has nothing to do with potency as such, that is, it does not measure the average differentiation potency of single cells in a cell population. In contrast, our signalling entropy does measure potency of single cells in a cell population. Thus, there is no requirement for our single-cell signalling entropy measure to exhibit a peak before a critical cell-fate transition occurs45,46. In contrast, the SI of a cell population derived from signalling entropy may exhibit the expected hallmarks of criticality. It will be interesting in future to test this with upcoming high-resolution time course and genome-wide scRNA-Seq data.
The ability of signalling entropy to independently order single cells according to differentiation potency is a central component of the SCENT algorithm, which, as shown here, can help quantify and identify biologically relevant intercellular expression heterogeneity and cell subpopulations. Indeed, key findings which strongly support the validity of SCENT are the following: (i) using SCENT we were able to correctly predict that a hESC population contains a small fraction of cells of lower potency which are primed for differentiation, (ii) SCENT inferred that an assayed NPC population was made up two distinct subsets, correctly predicting that only the lower potency subset represents bona fide NPCs (as determined by expression of known neural stem cell markers) and (iii) in a time-course differentiation experiment of human myoblasts, SCENT correctly identified a contaminating interstitial mesenchymal cell population, whose potency did not change appreciably during the experiment. We note that this particular insight is not readily obtainable using other algorithms such as Monocle or MPath14,15. Thus, the ability of SCENT to assign single cells and cell subpopulations to specific potency states thus adds novel insight and functionality over what can be achieved with other existing algorithms. Alternatively, signalling entropy could be combined with existing algorithms like Monocle14 or DPT17,47 to empower their inference, since signalling entropy provides a more unbiased, independent, approach to ordering single cells in pseudo-time, that is, it constitutes an approach which does not need prior knowledge such as the time point or markers of specific cell-types.
In a proof of principle analysis, we further demonstrated the ability of SCENT to identify putative drug resistant CSCs, encompassing two different cancer types (AML and prostate cancer), including CTCs. The ability to quantify stemness in cancer cell populations, either in tissue or in circulation, is a task of enormous importance. As shown here, as well as in our previous work on bulk cancer tissue9,11,13, signalling entropy is, so far, the only single sample measure to have been conclusively demonstrated to robustly correlate with stemness in both normal and cancer cells. Indeed, a recent study by Gruen et al.18 explored a very different measure of transcriptome entropy, but which was not demonstrated to correlate well with differentiation potency or cancer. Likewise, signalling entropy is a more general measure of stemness/plasticity outperforming existing pluripotency expression signatures, as shown here and previously11.
Importantly, signalling entropy also provides a computational framework in which to understand differentiation potency at the macroscopic (cell population) level from the corresponding potencies of single cells. As shown here, signalling entropy of cell populations, be they normal or malignant cells, exhibit synergy, with the entropy of the bulk being substantially higher than the entropy values of single cells. While no existing assay can measure all single cells in a population, we nevertheless demonstrated that our result is non-trivial, since mixing up bulk samples (to serve as a negative control) did not reveal such synergy. We also showed that these results were not confounded by the larger number of dropouts in scRNA-Seq data. Biologically, increased potency of a cell population as a result of synergistic cell–cell interactions, supports the view that features such as pluripotency are best understood at the cellular population level3.
Finally, it is important to discuss the technical and biological properties of signalling entropy that underlie its robustness as a measure of differentiation potency. First of all, gene expression values enter the computation of signalling entropy only as gene ratios. Taking ratios of gene expression values and introducing a regularization term to offset dropouts, makes the resulting inference much less sensitive to the sequencing depth, absolute scale and normalization procedure of scRNA-Seq data. Second, signalling entropy is estimated over a fairly large number of genes (8,000–10,000), making it naturally robust to single gene dropouts. Third, its biological robustness stems in part from differentiation potency being encoded by a subtle positive correlation between the transcriptome and connectome, similar to our previous observations in the context of cancer12. Since there is no reason to expect that technical dropouts in scRNA-Seq should correlate with the connectivity of the corresponding protein in a PPI network, such technical effects are expected to average out. Finally, it is worth emphasizing in this context that signalling entropy provided a more accurate and robust measure of differentiation potency than other transcriptomic entropy-based measures (those used in StemID and SLICE) which do not use network information.
To conclude, signalling entropy and the SCENT algorithm provide a computational framework to advance our understanding of single-cell biology. We envisage that SCENT will be of great value for quantifying biologically relevant intercellular heterogeneity and for identifying putative normal and cancer stem-cells from scRNA-Seq data.
Methods
Single cell and bulk RNA-Seq data sets
The main data sets analysed here, the NGS platform used and their public accession numbers are listed in Supplementary Table 1. Below is a more detailed description of the samples in each data set:
Chu et al . set. This RNA-Seq data set derives from Chu et al.28 This set consisted of four experiments. Experiment-1 generated scRNA-Seq data for 1,018 single cells, composed of 374 hESCs (212 single cells from H1 and 162 from H9 cell line), 173 NPCs, 138 DEPs, 105 mesoderm-derived ECs, 69 TB cells and 159 HFFs. Experiment-2 is a time course differentiation of single cells, specifically of hESCs induced to differentiate into the definite endoderm, via a mesoendoderm intermediate. Time points assayed were before induction (t=0 h, n=92), 12 h after induction (12 h, n=102), 24 h (n=66), 36 h (n=172), 72 h (n=138) and 96 h (n=188). Experiment-3 matches experiment-1 and consists of RNA-Seq data from 19 bulk samples: 7 representing hESCs, 2 representing NPCs, 2 TBs, 3 HFFs, 3 ECs and 2 DEPs. Experiment-4 consists of 15 RNA-Seq profiles from bulk samples, profiled as part of the time-course differentiation experiment (Experiment-2), with three samples per time-point (12 h, 24 h, 36 h, 72 h, 96 h).
Melanoma set. This scRNA-Seq data set derives from Tirosh et al.23, and consists of 4,645 single cells derived from the tumour microenvironment of 19 melanoma patients. Of these, 3,256 are non-malignant cells, encompassing T-cells (n=2,068), B-cells (n=515), NK cells (n=52), Macrophages (n=126), endothelial cells (EndC, n=65) and CAFs (n=61). The rest of single cells profiled were malignant melanoma cells (n=1,257).
AML set. This set derives from Li et al.43 A total of 96 single cells from a relapsed AML patient (patient ID=130) were profiled. In addition, 38 paired bulk AML samples were profiled from 19 patients (all experiencing relapse), with 19 samples obtained at diagnosis and with the other matched 19 samples obtained at relapse.
hESC set. This set derives from Yan et al.30 It consists of 124 single-cell profiles, of which 90 are from different stages of embryonic development, with 34 cells representing hESCs. These 34 hESCs were derived from the inner cell mass, with eight cells profiled at primary outgrowth and 26 profiled at passage-10. The 90 single cells from the pre-implantation embryo were distributed as follows: Oocyte (n=3), Zygote (n=3), 2-cell embryo (n=6), 4-cell embryo (n=12), 8-cell embryo (n=20), morulae (n=16) and late blastocyst (n=30).
Trapnell et al . set. This scRNA-Seq set derives from Trapnell et al.14 It consists of a time-course differentiation experiment of human myoblasts, which profiled a total of 372 single cells: 96 cells at t=0 (time at which differentiation was induced), 96 at t=24 h after induction, another 96 at t=48 h after induction and 84 cells at 72 h post induction.
CTC-PrCa set. This scRNA-Seq data set derives from Miyamoto et al.44 We focused on a subset of 73 single cells from castration resistant prostate cancers, of which 36 derived from patients who developed resistance to enzulatamide treatment, with the remaining 37 derived from treatment-naïve patients.
Treutlein set. This scRNA-Seq data set derives from Treutlein et al.29 There are a total of 201 single cells assayed at four different stages in the developing mouse epithelium, including embryonic day 14, 16, 18 and adulthood. At E18, a subset of single cells were characterized into alveolar type-1 and type-2 cells (AT1 & AT2), as well as a putative bipotent (BP) subgroup.
The single-cell entropy algorithm
There are five steps to the SCENT algorithm: (1) estimation of the differentiation potency of single cells via computation of signalling entropy, (2) inference of the potency state distribution across the single-cell population, (3) quantification of the intercellular heterogeneity of potency states, (4) inference of single-cell landmarks, representing the major potency-coexpression clusters of single cells and (5) lineage trajectory (or dependency network) reconstruction between landmarks. We now describe each of these steps:
Computation of signalling entropy. The computation of signalling entropy for a given sample proceeds using the same prescription as used in our previous publications9,11. Briefly, the normalized genome-wide gene expression profile of a sample (this can be a single cell or a bulk sample) is used to assign weights to the edges of a highly curated PPI network. The construction of the PPI network itself is described in detail elsewhere11, and is obtained by integrating various interaction databases which form part of Pathway Commons (www.pathwaycommons.org)48. The weighting of the network via the transcriptomic profile of the sample provides the biological context. The weight of an edge between protein i and protein j, denoted by wij, is assumed to be proportional to the normalized expression levels of the coding genes in the sample, that is, we assume that wij∼xixj. We interpret these weights (if normalized) as interaction probabilities. The above construction of the weights is based on the assumption that in a sample with high expression of i and j, that the two proteins are more likely to interact than in a sample with low expression of i and/or j. Viewing the edges generally as signalling interactions, we can thus define a random walk on the network, assuming we normalize the weights so that the sum of outgoing weights of a given node i is 1. This results in a stochastic matrix, P, over the network, with entries
where N(i) denotes the neighbours of protein i, and where A is the adjacency matrix of the PPI network (Aij=1 if i and j are connected, 0 otherwise, and with Aii=0). The signalling entropy is then defined as the entropy rate (denoted Sr) over the weighted network, that is,
where π is the invariant measure, satisfying πP=π and the normalization constraint πT1=1. The invariant measure, also known as steady-state probability, represents the relative probability of finding the random walker at a given node in the network (under steady-state conditions that is, long after the walk is initiated). Nodes with high values thus represent nodes that are particularly influential in distributing signalling flux in the network. In the steady-state we can assume detailed balance (conservation of signalling flux, that is, ), and it can be shown9 that πi=xi(Ax)i/(xTAx). Given a fixed adjacency matrix A (that is, fixing the topology), it can also be shown9 that the maximum possible Sr among all compatible stochastic matrices P, is the one with where ⊗ denotes product of matrix entries and where v is the dominant eigenvector of A, that is, Av=λv with λ the largest eigenvalue of A. We denote this maximum entropy rate by maxSr, and define the normalized entropy rate (with range of values between 0 and 1) as
Throughout this work, we always display this normalized entropy rate.
Inference of potency states. In this work, we show that signalling entropy (that is, the entropy rate SR) provides a proxy to the differentiation potential of single cells. We can model a cell population as a statistical mechanical model, in which each single cell has access to a number of different potency states. For a large collection of single cells we can estimate their signalling entropies, and infer from this distribution of signalling entropies the number of underlying potency states using a mixture modelling framework. Since SR is bounded between 0 and 1, we first conveniently transform the SR value of each single cell into their logit scale, that is, y(SR)=log2(SR/(1−SR)). Subsequently, we fit a mixture of Gaussians to the y(SR) values of the whole cell population, and use the Bayesian information criterion (as implemented in the mclust R-package)49 to estimate the optimal number K of potency states, as well as the state-membership probabilities of each individual cell. Thus, for each single cell, this results in its assignment to a specific potency state.
Quantifying intercellular heterogeneity of potency states. For a population of N cells, we can then define a probability distribution pk over the inferred potency states. For K inferred potency states, one can then define a normalized SI:
which measures the amount of heterogeneity in potency within the single-cell population (1=high heterogeneity in potency, 0=no heterogeneity in potency).
Inference of co-expression clusters and landmarks. With each cell assigned to a potency state, we next perform clustering (using the scRNA-seq profiles) of the single cells. We use the partitioning-around-medoids (PAM) algorithm with the average silhouette width to estimate the optimal number of clusters, a combination which was found to be among the most optimal clustering algorithms in applications to omic data50. Clustering of the cells is performed over a filtered set of genes that are identified as those driving most variation in the complete data set, as assessed using singular value decomposition (SVD). In detail, we perform a SVD on the full z-scored normalized RNA-Seq profiles of the cells, selecting the significant components using random matrix theory (RMT)51 and picking the top 5% genes with largest absolute weights in each significant component. The final set of genes is obtained by the union of those identified from each significant component. PAM clustering (with a Pearson distance correlation metric) of all cells results in the assignment of each cell into a co-expression cluster, with a total number of np cell clusters. Thus, each cell is assigned to a unique potency state and co-expression cluster. Finally, landmarks are identified by selecting potency-state cluster combinations containing at least 1–5% of all single cells. Importantly, each of these landmarks has a specific potency state and mean signalling entropy value, allowing ordering of these landmarks according to potency.
Inference of lineage trajectories. For each landmark in step-4, we compute centroids of gene expression using only cells that are contained within that landmark and defined only over the genes used in the PAM clustering. Partial correlations52,53 between the centroid landmarks are then estimated to infer trajectories/dependencies between landmarks. Significant positive partial correlations may indicate transitions between landmarks. Since each landmark has a signalling entropy value associated with it, directionality is inferred by comparing their respective potency states.
A fast Pearson correlation approximation
Under certain assumptions (to be discussed below), there is a useful approximation to signalling entropy, which also provides important biological insight. It entails first using an approximation for the steady-state probability (invariant measure) π. As before, in the steady-state, we can assume the detailed balance condition (conservation of signalling flux: that is, ), so that the invariant measure satisfies πi∼xi(Ax)i (ref. 9). If we now take a global mean field approximation, that is, if we replace the expression values of the neighbours of gene i, with the mean expression value over all genes in the network, it then follows that πi∼xiki, where ki is the connectivity of gene/protein i in the network. Hence, , which is effectively the three-way correlation between the transcriptome, connectome and local signalling entropies. If we assume further that the dynamic range of local signalling entropies is small (which for realistic PPI networks is often the case12), and also assuming that the local entropies correlate positively with node-degree, we obtain that SR∼ xiki, that is, the signalling entropy is approximately the Pearson correlation of the cell´s transcriptome and the connectome from the PPI network.
Importantly, we stress that (i) this approximation is an empirical one which works reasonably well for the realistic PPI networks considered here, and (ii) that the signalling entropy and its Pearson correlation approximation are not equivalent, since there exist networks where the two measures give widely different answers. In particular, if a network has scale-free topology, but with the hubs not connected to each other, then a positive correlation between expression and connectivity may not lead to a higher signalling entropy. For instance, if the low-degree nodes (‘bottlenecks’) linking the hubs have very low expression then signalling flux cannot be distributed over the network, leading to a lower entropy rate compared to an expression configuration where all genes have similar expression values (Supplementary Fig. 7). For realistic PPI networks, hubs are generally connected to each other and for these type of networks, the Pearson approximation works well. We note that for a 8,393 node network with 300,916 edges, the computation of SR for 100 samples takes ∼370 s on an Intel Xeon CPU E3-1575M 3.00 GHz, whereas that of its Pearson correlation approximation only takes 1/10 s, thus although the approximation is computationally much faster, the computation of SR for 1 sample only takes about 4 s.
Ranking genes according to differential local entropy
Since signalling entropy is obtained as a weighted average over local signalling entropies (that is, ) with the local entropies defined by , the latter can be used to identify genes in the network where the signalling flux distribution differs between two phenotypes. Specifically, we use the normalized version of the local signalling entropy, defined by , which is bounded between 0 and 1, thus allowing genes of different connectivity to be compared. Thus, for each gene and each sample, we can compute a local entropy and genes can then be ranked according to the difference in local entropy using an empirical Bayes framework11,54 to derive moderated t-statistics which reflect the significance in differential local entropy. Adjustment for multiple-testing was performed using the Benjamini–Hochberg procedure.
Gene set enrichment analysis
We performed GSEA on the top-ranked genes, ranked according to differential local entropy between pluripotent and non-pluripotent cells. Specifically, we focused on the genes exhibiting increased local signalling entropy in pluripotent cells, and focused on a range of thresholds (top 500, 600, 700, 800, 900 and 1,000) to assess robustness. Enrichment was performed using a one-tailed Fisher’s exact test, as implemented by us previously55. Enrichment was assessed against the Molecular Signatures Database (http://software.broadinstitute.org/gsea/msigdb)34.
Application to mouse scRNA-Seq data
In our application to mouse scRNA-Seq data, we first converted mouse gene Ensembl IDs into their human homologues using the AnnotationTools Bioconductor package56. Only those mapping to a unique human homologue were considered. The resulting set of genes were then integrated with our human PPI network.
Estimation of cell-cycle and TPSC pluripotency scores
To identify single cells in either the G1-S or G2-M phases of the cell-cycle we followed the procedure described in ref. 23. Briefly, genes whose expression is reflective of G1-S or G2-M phase were obtained from refs 32, 33. A given normalized scRNA-Seq data matrix is then z-score normalized for all genes present in these signatures. Finally, a cycling score for each phase and each cell is obtained as the average z-scores over all genes present in each signature.
To obtain an independent estimate of pluripotency we used the pluripotency gene expression signature of Palmer et al.22, which we have used extensively before11. This signature consists of 118 genes that are overexpressed and 39 genes that are underexpressed in pluripotent cells. The TPSC score for each cell with scRNA-Seq data is obtained as the t-statistic of the gene expression levels between the overexpressed and underexpressed gene categories. Optionally, the scRNA-Seq is z-score normalized beforehand and the t-statistic is obtained by comparing expression z-scores. However, we note that the z-score procedure uses information from all single cells, so the fairest comparison to signalling entropy means we ought to compare expression levels. We note that the TPSC scores obtained from z-scores or expression levels were highly correlated and did not affect any of the conclusions in this paper.
Comparison analysis of bulk and single-cell RNA-Seq data
Since SR can be computed for each single cell, one can compare the predicted entropies of bulk samples (cell population) to those of the single cells making up that population. To test whether the entropy of the bulk deviates markedly from that of single cells, we computed a z-score, by comparing the entropy of the bulk to that of the single cells where the latter distribution is modelled as a Gaussian. This z-score is called MRH, since it assesses whether the transcriptomes of single cells differ in a regulated synergistic manner, increasing entropy (potency) well above that of single cells. In the case where matched bulk samples were not available, we simulated bulk samples in two distinct ways. In one approach, we simply averaged the single-cell transcriptomes before computing SR. In a second approach, which corrects for the large number of dropouts present in scRNA-Seq data, by first aggregate the transcript counts of all single cells, and then downsample counts so as to match to the average number of transcripts per single-cell. Robustness to the specific downsampling draw was tested by performing 100 Monte-Carlo samplings.
Other entropy measure proxies for differentiation potency
Briefly, we describe two other entropy-based measures for approximating differentiation potency in a single-cell context, but which do not make use of a PPI network. One measure is part of the StemID algorithm18. However, the original StemID algorithm does not estimate differentiation potency of single cells. Instead it provides estimates for single-cell clusters, which are inferred by clustering the expression profiles of single cells. Thus, for a given cluster
k, StemID computes a potency which is proportional to δEk, where
where Ec is the information entropy of cell c, defined by (where N is the number of genes and where qgc is the normalized number of reads mapping to gene g in cell c). Thus, to objectively compare to our signalling entropy measure, which does not use information of other cells when estimating potency of a given cell, we here use Ec as the potency estimate from StemID. Another information entropy-based measure is part of the SLICE algorithm, proposed by Guo et al.31 Briefly, in this approach, genes are first clustered into related GO-terms to define m functional gene clusters. For a given cell c, relative activity of each functional cluster k is estimated from the average expression of genes mapping to that cluster. These activity scores are then normalized so that they can be interpreted as probabilities qkc, and subsequently the potency of cell c is estimated as the information entropy where the expectation is taken over a number of bootstraps over genes. We compute this information entropy using the R-script provided in Guo et al.31
Code availability
Signalling entropy is available as part of the Single Cell Entropy (SCENT) R-package and is freely available from github: https://github.com/aet21/SCENT.
Data availability
All data analysed in this manuscript is already publicly available from the following GEO (www.ncbi.nlm.nih.gov/geo/) accession numbers: GSE72056, GSE83533, GSE75748, GSE36552, GSE52529, GSE67980 and GSE52583. All data is also available on request from the authors.
Additional information
How to cite this article: Teschendorff, A. E. et al. Single-cell entropy for accurate estimation of differentiation potency from a cell’s transcriptome. Nat. Commun. 8, 15599 doi: 10.1038/ncomms15599 (2017).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Buettner, F. et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 33, 155–160 (2015).
Levsky, J. M., Shenoy, S. M., Pezo, R. C. & Singer, R. H. Single-cell gene expression profiling. Science 297, 836–840 (2002).
MacArthur, B. D. & Lemischka, I. R. Statistical mechanics of pluripotency. Cell 154, 484–489 (2013).
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Pina, C. et al. Single-cell network analysis identifies DDIT3 as a nodal lineage regulator in hematopoiesis. Cell Rep. 11, 1503–1510 (2015).
Pina, C. et al. Inferring rules of lineage commitment in haematopoiesis. Nat. Cell Biol. 14, 287–294 (2012).
Kalmar, T. et al. Regulated fluctuations in nanog expression mediate cell fate decisions in embryonic stem cells. PLoS Biol. 7, e1000149 (2009).
Chambers, I. et al. Nanog safeguards pluripotency and mediates germline development. Nature 450, 1230–1234 (2007).
Teschendorff, A. E., Sollich, P. & Kuehn, R. Signalling entropy: a novel network-theoretical framework for systems analysis and interpretation of functional omic data. Methods 67, 282–293 (2014).
Waddington, C. R. Principles of Development and Differentiation Macmillan Company (1966).
Banerji, C. R. et al. Cellular network entropy as the energy potential in Waddington's differentiation landscape. Sci. Rep. 3, 3039 (2013).
Teschendorff, A. E., Banerji, C. R., Severini, S., Kuehn, R. & Sollich, P. Increased signaling entropy in cancer requires the scale-free property of protein interaction networks. Sci. Rep. 5, 9646 (2015).
Banerji, C. R., Severini, S., Caldas, C. & Teschendorff, A. E. Intra-tumour signalling entropy determines clinical outcome in breast and lung cancer. PLoS Comput. Biol. 11, e1004115 (2015).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Chen, J., Schlitzer, A., Chakarov, S., Ginhoux, F. & Poidinger, M. Mpath maps multi-branching single-cell trajectories revealing progenitor cell progression during development. Nat. Commun. 7, 11988 (2016).
Marco, E. et al. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. Proc. Natl Acad. Sci. USA 111, E5643–E5650 (2014).
Haghverdi, L., Buttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Grun, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
Lee, T. I. et al. Control of developmental regulators by Polycomb in human embryonic stem cells. Cell 125, 301–313 (2006).
Gomez-Gardenes, J. & Latora, V. Entropy rate of diffusion processes on complex networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 78, 065102 (2008).
Chu, L. F. et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 17, 173 (2016).
Palmer, N. P., Schmid, P. R., Berger, B. & Kohane, I. S. A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers. Genome Biol. 13, R71 (2012).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
Lacorre, D. A. et al. Plasticity of endothelial cells: rapid dedifferentiation of freshly isolated high endothelial venule endothelial cells outside the lymphoid tissue microenvironment. Blood 103, 4164–4172 (2004).
Oliver, G. & Srinivasan, R. S. Endothelial cell plasticity: how to become and remain a lymphatic endothelial cell. Development 137, 363–372 (2010).
Kalluri, R. The biology and function of fibroblasts in cancer. Nat. Rev. Cancer 16, 582–598 (2016).
Chen, W. J. et al. Cancer-associated fibroblasts regulate the plasticity of lung cancer stemness via paracrine signalling. Nat. Commun. 5, 3472 (2014).
Chu, L.-F. et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definite endoderm. Genome Biol. 17, 173 (2016).
Treutlein, B. et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509, 371–375 (2014).
Yan, L. et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat. Struct. Mol. Biol. 20, 1131–1139 (2013).
Guo, M., Bao, E. L., Wagner, M., Whitsett, J. A. & Xu, Y. SLICE: determining cell differentiation and lineage based on single cell entropy. Nucleic Acids Res. 45, e54 (2017).
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Whitfield, M. L. et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell 13, 1977–2000 (2002).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
das Neves, R. P. et al. Connecting variability in global transcription rate to mitochondrial variability. PLoS Biol. 8, e1000560 (2010).
Johnston, I. G. et al. Mitochondrial variability as a source of extrinsic cellular noise. PLoS Comput. Biol. 8, e1002416 (2012).
Guantes, R. et al. Global variability in gene expression and alternative splicing is modulated by mitochondrial content. Genome Res. 25, 633–644 (2015).
Schieke, S. M. et al. Mitochondrial metabolism modulates differentiation and teratoma formation capacity in mouse embryonic stem cells. J. Biol. Chem. 283, 28506–28512 (2008).
Wanet, A., Arnould, T., Najimi, M. & Renard, P. Connecting mitochondria, metabolism, and stem cell fate. Stem Cells Dev. 24, 1957–1971 (2015).
Sukumar, M. et al. Mitochondrial membrane potential identifies cells with enhanced stemness for cellular therapy. Cell Metab. 23, 63–76 (2016).
Hu, C. et al. Energy metabolism plays a critical role in stem cell maintenance and differentiation. Int. J. Mol. Sci. 17, 253 (2016).
Folmes, C. D. & Terzic, A. Energy metabolism in the acquisition and maintenance of stemness. Semin. Cell Dev. Biol. 52, 68–75 (2016).
Li, S. et al. Distinct evolution and dynamics of epigenetic and genetic heterogeneity in acute myeloid leukemia. Nat. Med. 22, 792–799 (2016).
Miyamoto, D. T. et al. RNA-Seq of single prostate CTCs implicates noncanonical Wnt signaling in antiandrogen resistance. Science 349, 1351–1356 (2015).
Richard, A. et al. Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation process. PLoS Biol. 14, e1002585 (2016).
Mojtahedi, M. et al. Cell fate decision as high-dimensional critical state transition. PLoS Biol. 14, e2000640 (2016).
Angerer, P. et al. Destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2016).
Cerami, E. G. et al. Pathway commons, a web resource for biological pathway data. Nucleic Acids Res. 39, D685–D690 (2011).
Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. & Ruzzo, W. L. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001).
Wiwie, C., Baumbach, J. & Rottger, R. Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).
Teschendorff, A. E., Zhuang, J. & Widschwendter, M. Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27, 1496–1505 (2011).
Schafer, J. & Strimmer, K. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21, 754–764 (2005).
Barzel, B. & Barabasi, A. L. Network link prediction by global silencing of indirect correlations. Nat. Biotechnol. 31, 720–725 (2013).
Smyth, G. K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, (2004).
Teschendorff, A. E. et al. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS ONE 4, e8274 (2009).
Kuhn, A., Luthi-Carter, R. & Delorenzi, M. Cross-species and cross-platform gene expression studies with the Bioconductor-compliant R package 'annotationTools’. BMC Bioinformatics 9, 26 (2008).
Acknowledgements
This work was supported by NSFC (National Science Foundation of China) grants, grant numbers 31571359 and 31401120 by a Royal Society Newton Advanced Fellowship (NAF project number: 522438, NAF award number: 164914) and by a Medical Research Council grant (number 519159). The author also wishes to thank Guo-Cheng Yuan for stimulating discussions.
Author information
Authors and Affiliations
Contributions
Manuscript was conceived and written by A.E.T. Statistical analyses were performed by A.E.T. T.E. contributed useful feedback.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Information
Supplementary Figures, Supplementary Tables and Supplementary References (PDF 885 kb)
Supplementary Data
Table of enriched biological terms (from the Molecular Signatures Database http://software.broadinstitute.org/gsea/msigdb ) with columns labeling the biological term, the odds ratio (OR) of enrichment, the Benjamini-Hochberg adjusted P-value (adjP) and all the genes exhibiting increased local signaling entropy in pluripotent vs non-pluripotent cells and which also map to that biological term. The table represents a consensus result over 6 different top-ranked numbers of genes (500,600,700,800,900,1000). The P-value was derived from a one-tailed Fisher's exact test. We only list terms which achieved a significant adjusted P-value and with an OR>5. They have been ranked according to OR. (XLS 78 kb)
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Teschendorff, A., Enver, T. Single-cell entropy for accurate estimation of differentiation potency from a cell’s transcriptome. Nat Commun 8, 15599 (2017). https://doi.org/10.1038/ncomms15599
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/ncomms15599
This article is cited by
-
Identification of female-enriched and disease-associated microglia (FDAMic) contributes to sexual dimorphism in late-onset Alzheimer’s disease
Journal of Neuroinflammation (2024)
-
Transcriptomic and proteomic profiles of fetal versus adult mesenchymal stromal cells and mesenchymal stromal cell-derived extracellular vesicles
Stem Cell Research & Therapy (2024)
-
Single-cell integrative analysis reveals consensus cancer cell states and clinical relevance in breast cancer
Scientific Data (2024)
-
Charting cellular differentiation trajectories with Ricci flow
Nature Communications (2024)
-
Single-cell transcriptome analysis of epithelial, immune, and stromal signatures and interactions in human ovarian cancer
Communications Biology (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.