Single-cell entropy for accurate estimation of differentiation potency from a cell's transcriptome

The ability to quantify differentiation potential of single cells is a task of critical importance. Here we demonstrate, using over 7,000 single-cell RNA-Seq profiles, that differentiation potency of a single cell can be approximated by computing the signalling promiscuity, or entropy, of a cell's transcriptome in the context of an interaction network, without the need for feature selection. We show that signalling entropy provides a more accurate and robust potency estimate than other entropy-based measures, driven in part by a subtle positive correlation between the transcriptome and connectome. Signalling entropy identifies known cell subpopulations of varying potency and drug resistant cancer stem-cell phenotypes, including those derived from circulating tumour cells. It further reveals that expression heterogeneity within single-cell populations is regulated. In summary, signalling entropy allows in silico estimation of the differentiation potency and plasticity of single cells and bulk samples, providing a means to identify normal and cancer stem-cell phenotypes.

O ne of the most important tasks in single-cell RNA-sequencing studies is the identification and quantification of 'intercellular transcriptomic heterogeneity', that is, variation between the transcriptomes of single cells that is of biological relevance [1][2][3][4] . Although some of the observed intercellular transcriptomic variation represents stochastic noise, a substantial component has been shown to be of functional importance 1,[5][6][7][8] . Very often, this biologically relevant heterogeneity can be attributed to cells occupying states of different potency or plasticity. Thus, quantification of differentiation potency, or more generally functional plasticity, at the single-cell level is of paramount importance. However, currently there is no concrete theoretical and computational model for estimating such plasticity at the single-cell level.
Here we make significant progress towards addressing this challenge. We propose a very general model for estimating cellular plasticity. A key feature of this model is the computation of signalling entropy 9 , which quantifies the degree of uncertainty, or promiscuity, of a cell's gene expression levels in the context of a cellular interaction network. In effect, signalling entropy uses the transcriptomic profile of a cell to quantify the relative activation levels of its molecular pathways, and more generally that of biological processes, as defined over an a priori specified protein interaction network. We show that signalling entropy provides an excellent and robust proxy to the differentiation potential of a cell in Waddington's epigenetic landscape 10 , and further provides a framework in which to understand the overall differentiation potency and transcriptomic heterogeneity of a cell population in terms of single-cell potencies. Attesting to its general nature and broad applicability, we compute and validate signalling entropy in over 7,000 single cells of variable degrees of differentiation potency and phenotypic plasticity, including time-course differentiation data, neoplastic cells and circulating tumour cells (CTCs). This extends entropy concepts that we have previously demonstrated to work on bulk tissue data 9,11-13 to the single-cell level. On the basis of signalling entropy, we develop a novel algorithm called single-cell entropy (SCENT), which can be used to identify and quantify biologically relevant expression heterogeneity in single-cell populations, as well as to reconstruct cell-lineage trajectories from time-course data. In this regard, SCENT differs substantially from other single-cell algorithms like Monocle 14 , MPath 15 , SCUBA 16 , Diffusion Pseudotime 17 or StemID 18 , in that it uses single-cell entropy to independently order single cells in pseudo-time (that is, differentiation potency), without the need for feature selection or clustering.

Results
The signalling entropy framework. A pluripotent cell (by definition endowed with the capacity to differentiate into effectively all major cell-lineages) does not express a preference for any particular lineage, thus requiring a similar basal activity of all lineage-specifying transcription factors 9,19 . Viewing a cell's choice to commit to a particular lineage as a probabilistic process, pluripotency can therefore be characterized by a state of high uncertainty, or entropy, because all lineage choices are equally likely (Fig. 1a). In contrast, for a differentiated cell, or for a cell committed to a particular lineage, signalling uncertainty/entropy is reduced, as this requires activation of a specific signalling pathway reflecting that lineage choice (Fig. 1a). Thus, a measure of global signalling entropy, if computable, could provide us with a relatively good proxy of a cell's overall differentiation potential. Here we propose that differentiation potential can be estimated in silico by integrating a cell's transcriptomic profile with a high quality protein-protein interaction (PPI) network to define a cell-specific probabilistic signalling process (in effect, a random walk) on the network (Methods). Mathematically, this random walk is described by a stochastic matrix whose entries reflect the relative interaction probabilities. Underlying the construction of these probabilities is the assumption that two genes, which can interact at the protein level, are more likely to do so if both are highly expressed (Fig. 1a, Methods). Given this stochastic matrix, global signalling entropy is then computed as the entropy rate (abbreviated as SR) of this probabilistic signalling process on the network 20 (Fig. 1b, Methods), and can be thought of as quantifying the overall level of signalling promiscuity of biological processes within the network. In effect, this quantifies the efficiency, or speed, with which signalling can diffuse over the whole network, and therefore measures the number of separate biological processes which are in some sense 'active'. Since a committed, or differentiated cell, preferentially activates and deactivates specific processes (pathways) in the network, the expectation is that this would manifest itself as a lower entropy rate since signalling cannot diffuse to the regions of the network describing inactive processes.
To further test the general validity and robustness of signalling entropy we computed it for scRNA-Seq profiles of 3,256 nonmalignant cells derived from the microenvironment of 19 melanomas (Melanoma set 23 , Supplementary Table 1). Cells profiled included T-cells, B-cells, natural killer (NK) cells, macrophages, fully differentiated ECs and cancer-associated fibroblasts (CAFs). For a given cell-type and individual, variation between single cells was substantial and similar to the variation seen between individuals ( Supplementary Fig. 1). Mean entropy values however, were generally stable, showing little inter-individual variation, except for T-cells from 4 out of 15 patients, which exhibited a distinctively different distribution ( Supplementary Fig. 1). To assess overall trends, we pooled the single-cell entropy data from all patients together, which confirmed that all lymphocytes (T-cells, B-cells and NK cells) had similar average signalling entropy values (Fig. 2d).
1. Compute signaling entropy rate: Intra-tumour macrophages, which are derived from monocytes, exhibited a marginally higher signalling entropy (Fig. 2d). The highest signalling entropy values were attained by ECs and CAFs ( Fig. 2d), consistent with their known high phenotypic plasticity [24][25][26][27] . Importantly, the entropy values for all of these non-malignant differentiated cell-types were distinctively lower compared to those of hESCs and progenitor cells from Chu et al. (Fig. 2a,d), consistent with the fact that hESCs and progenitors have much higher differentiation potency. To test this formally, we compared hESCs, mesoderm progenitors, and terminally differentiated cells within the mesoderm lineage (which included all ECs and lymphocytes), which revealed a consistent decrease in signalling entropy between all three potency states (Wilcoxon rank test Po1e À 50, Fig. 2e). Of note, signalling entropy could discriminate progenitor and differentiated cells better than the score derived from the pluripotency gene expression signature 22 , attesting to its increased robustness as a general measure of differentiation potency (Fig. 2f, Supplementary Fig. 2). Next, we assessed signalling entropy in the context of a time-course differentiation experiment, whereby hESCs were induced to differentiate into DEPs via the mesoendoderm intermediate 28 . scRNA-Seq for a total of 758 single cells, obtained at six time points, including origin, 12, 24, 36, 72 and 96 h post induction were available (Methods) 28 . We observed that single-cell entropies exhibited a particular large decrease only after 72 h (Fig. 2g), consistent with previous knowledge that differentiation into definite endoderm occurs around 3-4 days after induction 28 . To demonstrate the validity of signalling entropy in another species, we next considered a scRNA-Seq data of cells sampled at different embryonic stages in the development of the mouse lung epithelium 29 ('Treutlein set', Supplementary Table 1, Methods). Signalling entropy decreased continuously until adulthood in line with a gradual increase in differentiation (Fig. 2h). Moreover, at embryonic day 18, it could discriminate alveolar type cells from a recently discovered bipotent progenitor subgroup 29 , albeit with marginal significance due to small cell numbers ( Supplementary Fig. 3A).
To demonstrate the critical importance of the interaction network, we recomputed signalling entropy in the Chu and Treutlein data sets after randomly reshuffling gene expression values over the network (100 and 1,000 permutations, respectively). As expected, upon reshuffling, signalling entropy lost its power to discriminate pluripotent from non-pluripotent cells ( Fig. 2i), and did not exhibit a consistent decrease with developmental stage in Treutlein's set ( Supplementary Fig. 3B).
Robustness to choice of PPI network and NGS platform. Given the importance of the PPI network, it is therefore equally important to verify that signalling entropy is robust to the choice of network. Results were largely unchanged using a different version of a PPI network ( Supplementary Fig. 4). To test the robustness of signalling entropy across independent studies, we analysed scRNA-Seq data for an independent set of single-cell hESCs derived from the primary outgrowth of the inner cell mass ('hESC set' 30  Comparison of signalling entropy to StemID and SLICE. To further highlight the importance of the PPI network, we decided to compare Signalling Entropy to two other entropy-based potency measures, proposed as part of the StemID 18 and SLICE 31 algorithms, which we note do not use any network information. To provide an objective evaluation, we compared the entropy measures of single cells from well-separated differentiation stages, or by comparing start and end points in time-course differentiation experiments, as these cells ought to differ substantially in terms of potency. Adopting this strategy in four scRNA-Seq and one bulk RNA-Seq data set, we observed that signalling entropy was able to provide high discriminative power in each data set (Table 1). In contrast, we did not find StemID and SLICE to be as accurate or robust (Table 1).
Correlation with potency is independent of cell-cycle phase. A major source of variation in scRNA-Seq data is cell-cycle phase 23,32 . We explored the relation between signalling entropy and cell-cycle phase in a large scRNA-Seq data set encompassing 3,256 non-malignant and 1,257 cancer cells derived from the microenvironment of melanomas (Melanoma set 23 , Supplementary Table 1). A cycling score for both G1-S and G2-M phases and for each cell was obtained using a validated procedure 23,32,33 , and compared to signalling entropy, which revealed a strong yet highly non-linear correlation  Table lists one-tailed Wilcoxon rank-sum test P values and associated (one-tailed) AUCs, testing whether entropy is higher in the pluripotent or multipotent cells compared to the less potent cells in various scRNA-Seq and bulk RNA-Seq data sets. In Chu1, the comparison is between pluripotent (hESCs, n ¼ 374, Pl) and non-pluripotent (n ¼ 644, NonPl) single cells. In Chu2, the comparison is between hESCs (0 h, n ¼ 92) and definite endoderm progenitors sampled 96 h later (n ¼ 188). In Trapnell, the comparison is between human myoblasts (0 h, n ¼ 96) and differentiated skeletal muscle cells (72 h, n ¼ 84). In Treutlein, the comparison is between early lung progenitors (E14, n ¼ 45) and mature alveolar cells (n ¼ 46). In Chu3, the comparison is between bulk hESCs (n ¼ 7) and non-pluripotent samples (n ¼ 12  Fig. 5). Specifically, we observed that cells with a low signalling entropy were never found in either the G1-S or G2-M phase ( Supplementary Fig. 5). In contrast, cells with high signalling entropy could be found in either a cycling or non-cycling phase. These results are consistent with the view that cycling cells must increase expression of promiscuous signalling proteins and hence exhibit an increased signalling entropy. Thus, we next asked if signalling entropy correlates with potency when restricting to non-cycling cells. Using the Chu et al. data set, we observed that, although discrimination accuracies were reduced upon correction for cell-cycle phase, signalling entropy could still accurately classify pluripotent from non-pluripotent cell-types (AUC40.9, Po1e À 5, Supplementary Fig. 6, Supplementary Table 2). Consistent with this (and now using both cycling and non-cycling cells), the correlation between signalling entropy and potency remained significant when adjusted for cell-cycle scores (Supplementary Table 2).
Correlation of expression with degree partly drives potency.
To gain further biological insight into signalling entropy, we derived an approximation for signalling entropy in terms of the three-way correlation between the transcriptome, connectome and local signalling entropies (Methods). This approximation implies that if, on average, network hubs are more highly expressed than low-degree nodes and if they exhibit an increase in their local signalling entropy, then this should generally lead to a more efficient distribution of signalling over the network, and hence to an increased global signalling entropy 12 . We thus posited that in cells with a demand for high phenotypic plasticity (for example, pluripotent cells), hubs tend to be overexpressed and exhibit increased signalling promiscuity. Using scRNA-Seq data from Chu et al. 21 , we were able to confirm a weak (Pearson correlation of B0.2) but significant (Po1e À 50) positive correlation of differential gene expression (between hESCs and multipotent cells) with connectivity ( Supplementary Fig. 7A). Importantly, the differential local signalling entropy between hESCs and multipotent cells correlated more strongly with connectivity (Pearson correlation of B0.64, Po1e À 100, Supplementary Fig. 7A), thus confirming the notion that the increased SR in pluripotent cells is also driven by a more distributed signalling (that is, increased local entropy) at network hubs. To demonstrate that the Pearson correlation between transcriptome and connectome can be used to approximate signalling entropy (SR), we computed it for all 1,018 single cells in Chu et al., obtaining an excellent agreement with SR (R 2 ¼ 0.96, Supplementary Fig. 7B), and hence also with potency ( Supplementary Fig. 7C). However, we stress that this Pearson correlation approximation is not a substitute for SR, since the definition of SR includes the local signalling entropies (Fig. 1b), from which important biological information can be extracted.
To demonstrate this, we ranked genes in the network according to their differential local signalling entropy (Methods) and performed gene set enrichment analysis (GSEA) 34 on the genes exhibiting the most significant increases in local entropy between pluripotent (hESCs) and multipotent cells. Top-ranked enriched biological terms included, besides stemness, genes implicated in mRNA splicing and encoding mitochondrial ribosomal proteins (Supplementary Table 3, Supplementary Data 1). This is consistent with recent studies demonstrating that mitochondrial activity influences the global transcription and splicing rate of cells [35][36][37] , and that variations in such activity may influence stemness and differentiation [38][39][40][41][42] . Finally, we also point out that signalling entropy and its Pearson correlation approximation are not equivalent, as there exist networks where both measures yield very different answers (Methods). For instance, in networks where hubs are not connected to each other (unlike our PPI networks where hubs are generally connected to each other), a positive correlation could lead to a lower signalling entropy ( Supplementary Fig. 7D).
Quantifying single-cell expression heterogeneity with SCENT. Given that signalling entropy correlates with differentiation potency, we used it to develop the SCENT algorithm (Fig. 1c). Briefly, SCENT uses the estimated single-cell entropies to infer the distribution of discrete potency states across the cell population ( Fig. 1c, Methods). Thus, SCENT can be used to quantify expression heterogeneity at the level of potency.
In addition, SCENT can be used to directly order single cells in pseudo-time 14 to facilitate reconstruction of lineage trajectories.
A key feature of SCENT is the assignment of each cell to a unique potency state and co-expression cluster, which results in the identification of potency clusters (which we call 'landmarks'), through which lineage trajectories are then inferred (Methods). We first tested SCENT on the scRNA-Seq data from Chu et al., which profiled pluripotent and multipotent cells (Supplementary Table 1). SCENT correctly predicted a parsimonious two-state model, with a high potency pluripotent state and a lower potency non-pluripotent progenitor-like state (Fig. 3a). Interestingly, a small fraction (B4%) of hESCs were deemed to be non-pluripotent cells (Fig. 3b), consistent with previous observations that pluripotent cell populations contain cells that are already primed for differentiation into specific lineages 5,6 . Supporting this, these non-pluripotent 'hESCs' exhibited lower cycling scores and higher expression levels of neural (HES1/SOX2) and mesoderm (PECAM1) stem-cell markers, compared to the pluripotent hESCs ( Supplementary Fig. 8). Whereas all HFFs and ECs were deemed non-pluripotent, DEPs, TBs and NPCs exhibited mixed proportions, with NPCs exhibiting approximately equal numbers of pluripotent and non-pluripotent cells (Fig. 3b). Correspondingly, the Shannon index (SI), which quantifies the level of heterogeneity in potency, was highest for the NPC population (Fig. 3c). In total, SCENT predicted six co-expression clusters, which combined with the two potency states, resulted in a total of seven landmark clusters (Fig. 3d). These landmarks correlated very strongly with cell-type, with only NPCs being distributed across two landmarks of different potency (Fig. 3e). SCENT correctly inferred a lineage trajectory between the high potency NPC subpopulation and its lower potency counterpart, as well as a trajectory between hESCs and DEPs (Fig. 3f). The other cell-types exhibited lower entropies (Fig. 2b, Fig. 3f), and correspondingly did not exhibit a direct trajectory to hESCs, suggesting several intermediate states which were not sampled in this experiment.
To ascertain the biological significance of the two NPC subpopulations (Fig. 3b,e,f), we first verified that the NPCs deemed pluripotent did indeed have a higher pluripotency score ( Supplementary Fig. 9A), as assessed using the independent pluripotency gene expression signature from Palmer et al. 22 We further reasoned that well-known transcription factors marking neural stem/progenitor cells, such as HES1, would be expressed at a much lower level in the NPCs deemed pluripotent compared to the non-pluripotent ones, since the latter are more likely to represent bona fide NPCs. Confirming this, NPCs with low HES1 expression exhibited higher differentiation potential than NPCs with high HES1 expression (Wilcoxon rank-sum test Po0.0001, Fig. 3g). Similar results were evident for other neural progenitor/stem cell markers such as PAX6 and SOX2 ( Supplementary Fig. 9B). Of note, NPCs expressing the lowest levels of PAX6, HES1 or SOX2 were generally always classified by SCENT into a pluripotent-like state (Fig. 3g, Supplementary  Fig. 9B). Thus, these results indicate that SCENT provides a biologically meaningful characterization of intercellular transcriptomic heterogeneity.
SCENT reconstructs lineage trajectories in differentiation. We next tested SCENT in the context of a differentiation experiment of human myoblasts 14  were profiled with RNA-Seq at differentiation induction, as well as at 24 and 48 h after medium switch, with a remaining 84 cells profiled at 72 h. As expected, signalling entropy was highest in the myoblasts, with a switch to lower entropy occurring at 24 h (Fig. 4a). No further decrease in entropy was observed between 24 and 72 h, indicating that commitment of cells to become differentiated skeletal muscle cells already happens early in the differentiation process. Over the whole time course, SCENT predicted a total of 3 potency states, with a distribution consistent with the time of sampling (Fig. 4b). Cells sampled at differentiation induction were made up primarily of two  Fig. 10). In total, SCENT predicted four landmarks, with one landmark defining undifferentiated (t ¼ 0) myoblasts of high potency (Fig. 4d). Another landmark of lower potency contained cells at all time points, with cells expressing markers of mesenchymal cells (for example, PDFGRA and FN1/LTBP2) (Fig. 4d). Cells from this landmark which were present at differentiation induction exhibited intermediate potency expressing low levels of CDK1 ( Supplementary Fig. 10, Fig. 4d), suggesting that these are 'contaminating' interstitial mesenchymal cells that were already present at the start of the time course, in line with previous observations 14,15 . Importantly, SCENT correctly predicts that the potency of all these mesenchymal cells in this landmark does not change during the time-course, consistent with the fact that these cells are not primed to differentiate into skeletal muscle cells, but which nevertheless aid the differentiation process 14  non-malignant cell-types. We confirmed this using scRNA-Seq data from 12 melanomas (Melanoma set 23 , Supplementary  Table 1), for which sufficient normal and cancer cells had been profiled (Fig. 5a, Supplementary Fig. 11). Although there was some variation in the signalling entropy of cancer cells between individuals, this variation was relatively small in comparison to the difference in entropy between cancer and normal cells. Combining data across all 12 patients, demonstrated a dramatic increase in the signalling entropy of single cancer cells compared to non-malignant ones (Wilcoxon rank-sum test Po1e À 500, Fig. 5b). Since signalling entropy is increased in cancer and correlates with stemness, it could, in principle, be used to identify putative cancer stem cells (CSC) or drug resistant cells. To test this, we first computed and compared signalling entropy values for 38 acute myeloid leukaemia (AML) bulk samples from 19 AML patients, consisting of 19 diagnostic/relapse pairs 43 . Confirming that signalling entropy marks drug resistant cell populations, we observed a higher entropy in the relapsed samples (paired Wilcox test P ¼ 0.004, Fig. 5c). For one relapsed sample, scRNA-Seq for 96 single-AML cells was available (AML set, Supplementary  Table 1). We posited that comparing the signalling entropy values of these 96 cells would allow us to identify a CSC-like subset responsible for relapse. Since in AML there are well accepted CSC markers (CD34, CD96), we tested whether expression of these markers in high entropy AML single cells is higher than in low entropy AML single cells (Fig. 5d). Both CD34 and CD96 were more highly expressed in the high entropy AML single cells (Wilcox test P ¼ 0.008 and 0.032, respectively, Fig. 5d   progression under treatment with enzalutamide (an androgen receptor (AR) inhibitor) (n ¼ 36 CTCs), with the other six patients not having received treatment (n ¼ 37 CTCs) 44 . Although of marginal significance, signalling entropy was higher in the CTCs from patients exhibiting resistance (Wilcox test P ¼ 0.047, Fig. 5e). Among putative prostate CSC markers (for example, CD44, CD133, KLF4 and ALDH7A1) 44 , we observed a positive association of signalling entropy with ALDH7A1 expression, suggesting that ADLH7A1 (and not other markers such as CD44) may mark specific prostate CSCs which are resistant to enzalutamide treatment (Fig. 5f).
Regulation of single-cell expression heterogeneity. It has been proposed that expression heterogeneity of cell populations is regulated in the sense that the transcriptomes of individual cells within the population differ in a manner which optimizes an objective function, such as pluripotency or homeostasis 3 . To test whether signalling entropy can predict such regulated expression heterogeneity, we compared the distribution of single-cell entropies to the signalling entropy of the bulk population. Specifically, we devised a 'measure of regulated heterogeneity' (MRH), which measures the likelihood that the signalling entropy of the cell population could have been observed from picking a single cell at random from that population (online Methods, Fig. 6a). We first estimated MRH for the data from Chu et al., for which matched bulk and scRNA-Seq data is available. We first note that although for bulk samples entropy differences between cell-types were smaller, that they were nevertheless consistent with the trends seen at the single-cell level ( Supplementary  Fig. 12, Fig. 2c). The MRH for each of the six cell-types (hESCs, NPCs, DEPs, TBs, HFFs and ECs) in Chu et al., revealed evidence of regulated heterogeneity, with the entropy values of bulk samples being significantly higher than that of single cells (Fig. 6b). As a negative control, the signalling entropy of the average expression over bulk samples did not exhibit regulated heterogeneity (normal deviation test P ¼ 0.30, Fig. 6b), as required since bulk samples are not linked in space or time and represent non-interacting cell populations.
We note that for the previous analysis, matched bulk RNA-Seq data is not absolutely required since bulk samples can be approximated by averaging the expression profiles of individual cells in the population. We verified this, although, as expected, the entropy values for the true bulk samples were always marginally higher, in line with the fact that single-cell assays only capture a subpopulation of the bulk sample (Fig. 6c). We also verified that MRH results were not driven by the larger number of dropouts in scRNA-Seq data. Specifically, we simulated bulk samples by aggregating single cells representing the same cell-type and then resampling transcript counts matching to the average number of transcripts seen in single cells (Methods). We observed that signalling entropy of the simulated bulk did not alter appreciably upon downsampling and that results were unchanged ( Supplementary Fig. 13).
Next, we repeated the MRH analysis for T-cells and B-cells found in melanomas (Melanoma set, Supplementary Table 1), for which sufficient numbers of single cells had been profiled. In all cases, signalling entropies of the bulk were much higher than expected based on the distribution of single-cell entropies ( Supplementary Fig. 14). Evidence for regulated expression heterogeneity was also seen among the melanoma cancer cells from each of 12 patients (combined Fisher test Po1e À 6, Supplementary Fig. 15). We also analysed RNA-Seq data for 96 single cancer cells from a relapsed patient with acute myeloid leukaemia(AML set 43 , Supplementary Table 1). The signalling entropy for the AML cell population was 0.88, significantly larger than the maximal value over the 96 cells (SR ¼ 0.82, Normal deviation test Po0.001, Fig. 6d). Again, as a negative control we analysed all 19 bulk AML samples at relapse and diagnosis, treating bulk samples from independent AML patients as if they were single cells from a common population. Estimating the signalling entropy of the average expression profile over all 19 bulk samples did not reveal a value significantly higher than that of the individual bulk samples (normal deviation test P ¼ 0.32, Fig. 6d).

Discussion
Although Waddington proposed his famous epigenetic landscape of cellular differentiation many decades ago 10 , it has proved challenging to construct a robust molecular correlate of a cell's elevation in this landscape. Here we have made significant progress, demonstrating that the differentiation potency and phenotypic plasticity of single cells, be they normal or malignant, can be estimated in silico from their RNA-Seq profile using signalling entropy. As we have seen, signalling entropy can accurately discriminate pluripotent from multipotent and differentiated cells, without the need for feature selection or training, outperforming a pluripotency gene expression signature and providing a more general measure of differentiation potency.
Importantly, signalling entropy should not be confused with other transcriptional entropy measures, which are estimated over populations of single cells 45,46 . For instance, the 'transcriptional entropy' of Richard et al. 45 is estimated for single genes across single cells, and therefore reflects the amount of intercellular heterogeneity in the expression of a given gene. Our signalling entropy measure is estimated for a single cell across genes in the context of a large gene network, which therefore incorporates systems-level information and is genome-wide (Fig. 1a,b). While the signalling entropy of single cells will influence the amount of transcriptional heterogeneity and entropy as defined by Richard et al., the precise relation between the two entropies is non-trivial. Indeed, we have here shown how we can assign single cells into potency states, from which a SI over the whole cell population (that is, using the distribution of potency states over single cells) can then be estimated (Fig. 1c). This SI is more analogous to the transcriptional entropy of Richard et al. Indeed, we have shown how this SI is higher in a population of NPCs than in a population of hESCs (Fig. 3c). Thus, the SI has nothing to do with potency as such, that is, it does not measure the average differentiation potency of single cells in a cell population. In contrast, our signalling entropy does measure potency of single cells in a cell population. Thus, there is no requirement for our single-cell signalling entropy measure to exhibit a peak before a critical cell-fate transition occurs 45,46 . In contrast, the SI of a cell population derived from signalling entropy may exhibit the expected hallmarks of criticality. It will be interesting in future to test this with upcoming high-resolution time course and genomewide scRNA-Seq data.
The ability of signalling entropy to independently order single cells according to differentiation potency is a central component of the SCENT algorithm, which, as shown here, can help quantify and identify biologically relevant intercellular expression heterogeneity and cell subpopulations. Indeed, key findings which strongly support the validity of SCENT are the following: (i) using SCENT we were able to correctly predict that a hESC population contains a small fraction of cells of lower potency which are primed for differentiation, (ii) SCENT inferred that an assayed NPC population was made up two distinct subsets, correctly predicting that only the lower potency subset represents bona fide NPCs (as determined by expression of known neural stem cell markers) and (iii) in a time-course differentiation experiment of human myoblasts, SCENT correctly identified a contaminating interstitial mesenchymal cell population, whose potency did not change appreciably during the experiment. We note that this particular insight is not readily obtainable using other algorithms such as Monocle or MPath 14,15 . Thus, the ability of SCENT to assign single cells and cell subpopulations to specific potency states thus adds novel insight and functionality over what can be achieved with other existing algorithms. Alternatively, signalling entropy could be combined with existing algorithms like Monocle 14 or DPT 17,47 to empower their inference, since signalling entropy provides a more unbiased, independent, approach to ordering single cells in pseudo-time, that is, it constitutes an approach which does not need prior knowledge such as the time point or markers of specific cell-types.
In a proof of principle analysis, we further demonstrated the ability of SCENT to identify putative drug resistant CSCs, encompassing two different cancer types (AML and prostate cancer), including CTCs. The ability to quantify stemness in cancer cell populations, either in tissue or in circulation, is a task of enormous importance. As shown here, as well as in our previous work on bulk cancer tissue 9,11,13 , signalling entropy is, so far, the only single sample measure to have been conclusively demonstrated to robustly correlate with stemness in both normal and cancer cells. Indeed, a recent study by Gruen et al. 18 explored a very different measure of transcriptome entropy, but which was not demonstrated to correlate well with differentiation potency or cancer. Likewise, signalling entropy is a more general measure of stemness/plasticity outperforming existing pluripotency expression signatures, as shown here and previously 11 .
Importantly, signalling entropy also provides a computational framework in which to understand differentiation potency at the macroscopic (cell population) level from the corresponding potencies of single cells. As shown here, signalling entropy of cell populations, be they normal or malignant cells, exhibit synergy, with the entropy of the bulk being substantially higher  than the entropy values of single cells. While no existing assay can measure all single cells in a population, we nevertheless demonstrated that our result is non-trivial, since mixing up bulk samples (to serve as a negative control) did not reveal such synergy. We also showed that these results were not confounded by the larger number of dropouts in scRNA-Seq data. Biologically, increased potency of a cell population as a result of synergistic cell-cell interactions, supports the view that features such as pluripotency are best understood at the cellular population level 3 . Finally, it is important to discuss the technical and biological properties of signalling entropy that underlie its robustness as a measure of differentiation potency. First of all, gene expression values enter the computation of signalling entropy only as gene ratios. Taking ratios of gene expression values and introducing a regularization term to offset dropouts, makes the resulting inference much less sensitive to the sequencing depth, absolute scale and normalization procedure of scRNA-Seq data. Second, signalling entropy is estimated over a fairly large number of genes (8,000-10,000), making it naturally robust to single gene dropouts. Third, its biological robustness stems in part from differentiation potency being encoded by a subtle positive correlation between the transcriptome and connectome, similar to our previous observations in the context of cancer 12 . Since there is no reason to expect that technical dropouts in scRNA-Seq should correlate with the connectivity of the corresponding protein in a PPI network, such technical effects are expected to average out. Finally, it is worth emphasizing in this context that signalling entropy provided a more accurate and robust measure of differentiation potency than other transcriptomic entropy-based measures (those used in StemID and SLICE) which do not use network information.
To conclude, signalling entropy and the SCENT algorithm provide a computational framework to advance our understanding of single-cell biology. We envisage that SCENT will be of great value for quantifying biologically relevant intercellular heterogeneity and for identifying putative normal and cancer stem-cells from scRNA-Seq data.

Methods
Single cell and bulk RNA-Seq data sets. The main data sets analysed here, the NGS platform used and their public accession numbers are listed in Supplementary  Table 1 CTC-PrCa set. This scRNA-Seq data set derives from Miyamoto et al. 44 We focused on a subset of 73 single cells from castration resistant prostate cancers, of which 36 derived from patients who developed resistance to enzulatamide treatment, with the remaining 37 derived from treatment-naïve patients.
Treutlein set. This scRNA-Seq data set derives from Treutlein et al. 29 There are a total of 201 single cells assayed at four different stages in the developing mouse epithelium, including embryonic day 14, 16, 18 and adulthood. At E18, a subset of single cells were characterized into alveolar type-1 and type-2 cells (AT1 & AT2), as well as a putative bipotent (BP) subgroup.
The single-cell entropy algorithm. There are five steps to the SCENT algorithm: (1) estimation of the differentiation potency of single cells via computation of signalling entropy, (2) inference of the potency state distribution across the single-cell population, (3) quantification of the intercellular heterogeneity of potency states, (4) inference of single-cell landmarks, representing the major potency-coexpression clusters of single cells and (5) lineage trajectory (or dependency network) reconstruction between landmarks. We now describe each of these steps: Computation of signalling entropy. The computation of signalling entropy for a given sample proceeds using the same prescription as used in our previous publications 9,11 . Briefly, the normalized genome-wide gene expression profile of a sample (this can be a single cell or a bulk sample) is used to assign weights to the edges of a highly curated PPI network. The construction of the PPI network itself is described in detail elsewhere 11 , and is obtained by integrating various interaction databases which form part of Pathway Commons (www.pathwaycommons.org) 48 . The weighting of the network via the transcriptomic profile of the sample provides the biological context. The weight of an edge between protein i and protein j, denoted by w ij , is assumed to be proportional to the normalized expression levels of the coding genes in the sample, that is, we assume that w ij Bx i x j . We interpret these weights (if normalized) as interaction probabilities. The above construction of the weights is based on the assumption that in a sample with high expression of i and j, that the two proteins are more likely to interact than in a sample with low expression of i and/or j. Viewing the edges generally as signalling interactions, we can thus define a random walk on the network, assuming we normalize the weights so that the sum of outgoing weights of a given node i is 1. This results in a stochastic matrix, P, over the network, with entries where N(i) denotes the neighbours of protein i, and where A is the adjacency matrix of the PPI network (A ij ¼ 1 if i and j are connected, 0 otherwise, and with A ii ¼ 0). The signalling entropy is then defined as the entropy rate (denoted Sr) over the weighted network, that is, where p is the invariant measure, satisfying pP ¼ p and the normalization constraint p T 1 ¼ 1. The invariant measure, also known as steady-state probability, represents the relative probability of finding the random walker at a given node in the network (under steady-state conditions that is, long after the walk is initiated). Nodes with high values thus represent nodes that are particularly influential in distributing signalling flux in the network. In the steady-state we can assume detailed balance (conservation of signalling flux, that is, p i p ij ¼ p j p ji ), and it can be shown 9 that . Given a fixed adjacency matrix A (that is, fixing the topology), it can also be shown 9 that the maximum possible Sr among all compatible stochastic matrices P, is the one with P ¼ 1 g v À 1 A v where # denotes product of matrix entries and where v is the dominant eigenvector of A, that is, Av ¼ lv with l the largest eigenvalue of A. We denote this maximum entropy rate by maxSr, and define the normalized entropy rate (with range of values between 0 and 1) as Throughout this work, we always display this normalized entropy rate. Inference of potency states. In this work, we show that signalling entropy (that is, the entropy rate SR) provides a proxy to the differentiation potential of single cells. We can model a cell population as a statistical mechanical model, in which each single cell has access to a number of different potency states. For a large collection of single cells we can estimate their signalling entropies, and infer from this distribution of signalling entropies the number of underlying potency states using a mixture modelling framework. Since SR is bounded between 0 and 1, we first conveniently transform the SR value of each single cell into their logit scale, that is, y(SR) ¼ log 2 (SR/(1 À SR)). Subsequently, we fit a mixture of Gaussians to the y(SR) values of the whole cell population, and use the Bayesian information criterion (as implemented in the mclust R-package) 49 to estimate the optimal number K of potency states, as well as the state-membership probabilities of each individual cell. Thus, for each single cell, this results in its assignment to a specific potency state.
Quantifying intercellular heterogeneity of potency states. For a population of N cells, we can then define a probability distribution p k over the inferred potency states. For K inferred potency states, one can then define a normalized SI: which measures the amount of heterogeneity in potency within the single-cell population (1 ¼ high heterogeneity in potency, 0 ¼ no heterogeneity in potency).
Inference of co-expression clusters and landmarks. With each cell assigned to a potency state, we next perform clustering (using the scRNA-seq profiles) of the single cells. We use the partitioning-around-medoids (PAM) algorithm with the average silhouette width to estimate the optimal number of clusters, a combination which was found to be among the most optimal clustering algorithms in applications to omic data 50 . Clustering of the cells is performed over a filtered set of genes that are identified as those driving most variation in the complete data set, as assessed using singular value decomposition (SVD). In detail, we perform a SVD on the full z-scored normalized RNA-Seq profiles of the cells, selecting the significant components using random matrix theory (RMT) 51 and picking the top 5% genes with largest absolute weights in each significant component. The final set of genes is obtained by the union of those identified from each significant component. PAM clustering (with a Pearson distance correlation metric) of all cells results in the assignment of each cell into a co-expression cluster, with a total number of n p cell clusters. Thus, each cell is assigned to a unique potency state and co-expression cluster. Finally, landmarks are identified by selecting potency-state cluster combinations containing at least 1-5% of all single cells. Importantly, each of these landmarks has a specific potency state and mean signalling entropy value, allowing ordering of these landmarks according to potency.
Inference of lineage trajectories. For each landmark in step-4, we compute centroids of gene expression using only cells that are contained within that landmark and defined only over the genes used in the PAM clustering. Partial correlations 52,53 between the centroid landmarks are then estimated to infer trajectories/dependencies between landmarks. Significant positive partial correlations may indicate transitions between landmarks. Since each landmark has a signalling entropy value associated with it, directionality is inferred by comparing their respective potency states.
A fast Pearson correlation approximation. Under certain assumptions (to be discussed below), there is a useful approximation to signalling entropy, which also provides important biological insight. It entails first using an approximation for the steady-state probability (invariant measure) p. As before, in the steady-state, we can assume the detailed balance condition (conservation of signalling flux: that is, p i p ij ¼ p j p ji ), so that the invariant measure satisfies p i Bx i (Ax) i (ref. 9). If we now take a global mean field approximation, that is, if we replace the expression values of the neighbours of gene i, with the mean expression value over all genes in the network, it then follows that p i Bx i k i , where k i is the connectivity of gene/protein i in the network. Hence, SR ¼ P i p i S i $ P i x i k i S i , which is effectively the three-way correlation between the transcriptome, connectome and local signalling entropies. If we assume further that the dynamic range of local signalling entropies S i ¼ À P j2NðiÞ p ij log p ij is small (which for realistic PPI networks is often the case 12 ), and also assuming that the local entropies correlate positively with node-degree, we obtain that SRB x i k i , that is, the signalling entropy is approximately the Pearson correlation of the celĺs transcriptome and the connectome from the PPI network.
Importantly, we stress that (i) this approximation is an empirical one which works reasonably well for the realistic PPI networks considered here, and (ii) that the signalling entropy and its Pearson correlation approximation are not equivalent, since there exist networks where the two measures give widely different answers. In particular, if a network has scale-free topology, but with the hubs not connected to each other, then a positive correlation between expression and connectivity may not lead to a higher signalling entropy. For instance, if the low-degree nodes ('bottlenecks') linking the hubs have very low expression then signalling flux cannot be distributed over the network, leading to a lower entropy rate compared to an expression configuration where all genes have similar expression values (Supplementary Fig. 7). For realistic PPI networks, hubs are generally connected to each other and for these type of networks, the Pearson approximation works well. We note that for a 8,393 node network with 300,916 edges, the computation of SR for 100 samples takes B370 s on an Intel Xeon CPU E3-1575M 3.00 GHz, whereas that of its Pearson correlation approximation only takes 1/10 s, thus although the approximation is computationally much faster, the computation of SR for 1 sample only takes about 4 s.
Ranking genes according to differential local entropy. Since signalling entropy is obtained as a weighted average over local signalling entropies (that is, SR ¼ P i p i S i ) with the local entropies defined by S i ¼ À P j2NðiÞ p ij log p ij , the latter can be used to identify genes in the network where the signalling flux distribution differs between two phenotypes. Specifically, we use the normalized version of the local signalling entropy, defined by NS i ¼ À 1 logki P j2NðiÞ p ij log p ij , which is bounded between 0 and 1, thus allowing genes of different connectivity to be compared. Thus, for each gene and each sample, we can compute a local entropy and genes can then be ranked according to the difference in local entropy using an empirical Bayes framework 11,54 to derive moderated t-statistics which reflect the significance in differential local entropy. Adjustment for multiple-testing was performed using the Benjamini-Hochberg procedure.
Gene set enrichment analysis. We performed GSEA on the top-ranked genes, ranked according to differential local entropy between pluripotent and non-pluripotent cells. Specifically, we focused on the genes exhibiting increased local signalling entropy in pluripotent cells, and focused on a range of thresholds (top 500, 600, 700, 800, 900 and 1,000) to assess robustness. Enrichment was performed using a one-tailed Fisher's exact test, as implemented by us previously 55 . Enrichment was assessed against the Molecular Signatures Database (http:// software.broadinstitute.org/gsea/msigdb) 34 .
Application to mouse scRNA-Seq data. In our application to mouse scRNA-Seq data, we first converted mouse gene Ensembl IDs into their human homologues using the AnnotationTools Bioconductor package 56 . Only those mapping to a unique human homologue were considered. The resulting set of genes were then integrated with our human PPI network.
Estimation of cell-cycle and TPSC pluripotency scores. To identify single cells in either the G1-S or G2-M phases of the cell-cycle we followed the procedure described in ref. 23. Briefly, genes whose expression is reflective of G1-S or G2-M phase were obtained from refs 32,33. A given normalized scRNA-Seq data matrix is then z-score normalized for all genes present in these signatures. Finally, a cycling score for each phase and each cell is obtained as the average z-scores over all genes present in each signature.
To obtain an independent estimate of pluripotency we used the pluripotency gene expression signature of Palmer et al. 22 , which we have used extensively before 11 . This signature consists of 118 genes that are overexpressed and 39 genes that are underexpressed in pluripotent cells. The TPSC score for each cell with scRNA-Seq data is obtained as the t-statistic of the gene expression levels between the overexpressed and underexpressed gene categories. Optionally, the scRNA-Seq is z-score normalized beforehand and the t-statistic is obtained by comparing expression z-scores. However, we note that the z-score procedure uses information from all single cells, so the fairest comparison to signalling entropy means we ought to compare expression levels. We note that the TPSC scores obtained from z-scores or expression levels were highly correlated and did not affect any of the conclusions in this paper.
Comparison analysis of bulk and single-cell RNA-Seq data. Since SR can be computed for each single cell, one can compare the predicted entropies of bulk samples (cell population) to those of the single cells making up that population. To test whether the entropy of the bulk deviates markedly from that of single cells, we computed a z-score, by comparing the entropy of the bulk to that of the single cells where the latter distribution is modelled as a Gaussian. This z-score is called MRH, since it assesses whether the transcriptomes of single cells differ in a regulated synergistic manner, increasing entropy (potency) well above that of single cells. In the case where matched bulk samples were not available, we simulated bulk samples in two distinct ways. In one approach, we simply averaged the single-cell transcriptomes before computing SR. In a second approach, which corrects for the large number of dropouts present in scRNA-Seq data, by first aggregate the transcript counts of all single cells, and then downsample counts so as to match to the average number of transcripts per single-cell. Robustness to the specific downsampling draw was tested by performing 100 Monte-Carlo samplings.
Other entropy measure proxies for differentiation potency. Briefly, we describe two other entropy-based measures for approximating differentiation potency in a single-cell context, but which do not make use of a PPI network. One measure is part of the StemID algorithm 18 . However, the original StemID algorithm does not estimate differentiation potency of single cells. Instead it provides estimates for single-cell clusters, which are inferred by clustering the expression profiles of single cells. Thus, for a given cluster k, StemID computes a potency which is proportional to dE k , where dE k median c2k E c ð ÞÀmin l median c2l E c ð Þ ð Þ ; where E c is the information entropy of cell c, defined by E c ¼ À P N g¼1 q gc log q gc (where N is the number of genes and where q gc is the normalized number of reads mapping to gene g in cell c). Thus, to objectively compare to our signalling entropy measure, which does not use information of other cells when estimating potency of a given cell, we here use E c as the potency estimate from StemID. Another information entropy-based measure is part of the SLICE algorithm, proposed by Guo et al. 31 Briefly, in this approach, genes are first clustered into related GO-terms to define m functional gene clusters. For a given cell c, relative NATURE COMMUNICATIONS | DOI: 10.1038/ncomms15599 ARTICLE NATURE COMMUNICATIONS | 8:15599 | DOI: 10.1038/ncomms15599 | www.nature.com/naturecommunications activity of each functional cluster k is estimated from the average expression of genes mapping to that cluster. These activity scores are then normalized so that they can be interpreted as probabilities q kc , and subsequently the potency of cell c is estimated as the information entropy H c ¼E B ½ À P m k¼1 q kc log q kc where the expectation is taken over a number of bootstraps over genes. We compute this information entropy using the R-script provided in Guo et al. 31