## Introduction

Single-cell RNA-sequencing (scRNA-Seq) studies are revolutionizing our understanding of cellular development, helping us elucidate the hierarchical organization of cell-types within complex tissues and how this organization may be altered in diseases like cancer1,2,3,4,5,6,7,8,9,10,11,12,13,14. An outstanding challenge is how best to identify progenitor or stem-like cells within the large single-cell populations. This task is particularly important for understanding oncogenesis, since the prevailing view is that it is the adult progenitor/stem-like cells that give rise to cancer15,16,17,18. Specifically, it is believed that inherited molecular alterations, as well as somatic ones that accrue in these cells as a function of age and exposure to risk factors, may eventually predispose these cells to oncogenic transformation.

So far, the most common approach to identify progenitor/stem-like states in scRNA-Seq data, has been to use prior knowledge of specific progenitor or stemness markers, which however may inevitably introduce bias19,20,21. In certain circumstances, this bias can be substantial, specially if knowledge of suitable markers is not available or at best controversial, as is the case for the mammary epithelium22,23. Moreover, the high technical dropout rate of scRNA-Seq data means that reliance on well-established markers may not be possible24. In this regard, it is worth emphasizing that lineage-trajectory inference algorithms4,14,25,26,27, including recent state-of-the-art ones such as Monocle-328,29, still require specification of a “root-state”, in order to give the trajectories a “temporal” direction, or to define differentiation potency gradients. In the absence of temporal data, the specification of this root state may rely on existing biological knowledge and therefore equally subject to bias. Or the high-dropout rate of scRNA-Seq data may preclude the use of traditional stemness markers to assign this root-state. Another related and key problem is that cell-types are typically inferred as clusters of relatively high cell density in a two-dimensional reduced space, a procedure which does not necessarily allow for the identification of cellular states19. Cellular states such as cell-cycle phase or differentiation potency represent additional dimensions of variation, which are generally not well captured or observed by single-cell dimensional reduction and clustering methods. For instance, a single-cell cluster may typically include cells from different cell-cycle stages. Or how to identify novel progenitor or stem-like states within a cell-type may not be possible, using two-dimensional clustering alone, since potency/stemness may be defined by additional latent dimensions.

Here, we show that these outstanding challenges can be overcome with a marker-free system biology approach, called LandSCENT (Landscape of Single Cell Entropy), which builds upon our SCENT framework30 to assign each cell, not only to a specific cell-type, but also to a specific potency/entropy state. We stress that the assignment of cells to potency states is achieved without the need for prior knowledge or assumptions, using a potency model that has been extensively validated across many independent scRNA-Seq and bulk RNA-Seq data sets, irrespective of cell lineage, technology, or species30,31. LandSCENT combines the inferred cell-types and potency states into a multilayered single-cell landscape, where cell-states are defined by clusters of single cells within a potency state. This allows cells to be placed into specific cellular states, thus allowing novel cellular phenotypes to be identified, for instance novel progenitor or stem-like states within complex epithelial tissues. Importantly, this also allows a natural and unbiased assignment of a root-state, as the one of highest potency, from which lineage trajectories and bifurcation patterns can be subsequently learned using appropriate algorithms such as Diffusion Maps27,32,33. We illustrate LandSCENT in the context of the breast epithelium, constructing a combined cell-type and potency landscape at the single-cell level, which, in conjunction with diffusion maps, predicts a novel bipotent progenitor or stem-like cell-state. We provide extensive validation of the bipotent stem-like nature of this state in many orthogonal bulk expression data sets, as well as in scRNA-Seq assays from two different technologies, encompassing altogether data from six different women. We point out that all these results would not have been obtained, had we used competing state-of-the-art clustering or lineage-trajectory inference methods, highlighting the importance of the LandSCENT/SCENT paradigm.

## Results

### Rationale for a marker-free approach to identify stem-like cells

We reanalyzed scRNA-Seq data from a previous study that used the 10X Genomics Chromium assay to profile over 25,000 mammary epithelial cells from four nulliparous healthy women34. We note that due to the high dropout rate of the 10X data, this study had not been able to use the 10X data to confidently identify a stem-like state34. We verified that the median dropout rate per cell was over 90% for each of the four women, affecting some of the proposed stemness markers like ALDH1A1, ZEB1, and TCF434,35 (Supplementary Fig. 1A, B). For instance, for ZEB1 and TCF4, the two stemness markers proposed by Nguyen et al.34, the number of cells with a read count larger than 2 in each of the four women was only 1, 0, 0, and 0 for ZEB1 and only 1, 2, 0, and 1 for TCF4, despite thousands of cells having been measured in each woman. Thus, in the absence of stemness marker expression, and to avoid potential biases associated with picking ab initio other markers like CD44 or ITGA6, we decided to apply our marker-free single-cell signaling entropy (SCENT)30,31 model, which provides robust estimates of cell potency14,36,37. We posited that exploring the distribution of inferred potency values across single-cell clusters may help to identify novel cell-states, including a putative bipotent progenitor or stem-like state. LandSCENT achieves this by combining maps of cell potency and single-cell clusters within a novel “cell-density” visualization framework, which could naturally reveal novel single-cell states (Fig. 1a, the “Methods” section). Importantly, the estimation of cell potency for each single cell allows potency gradients to be naturally inferred, therefore allowing unbiased assignment of “root-states” (i.e., states of highest potency), which can be subsequently used as input for lineage-trajectory inference algorithms (Fig. 1b, the “Methods” section).

### LandSCENT predicts a high-potency state enriched in basal cells

We observed that only for one of the four women (denoted “Ind-4”) did the top principal component of variation correlate with expression of basal and luminal markers (Supplementary Fig. 2). For the other three women, the top PC correlated with total read count and coverage, accounting for twice as much variance as lower ranked biological components (Supplementary Fig. 2), suggesting that these scRNA-Seq assays were not particularly successful. Thus, we decided to apply LandSCENT to the 3473 single epithelial cells that survived quality control from Ind-4. Performing t-SNE38 followed by density-based spatial clustering39 revealed three main single-cell clusters (Fig. 2a, the “Methods” section), in line with previous observations34, and consistent with known biology: one cluster expressed high levels of KRT14, a well-known basal marker, whereas the other two expressed KRT18, a well-known luminal marker (Fig. 2b). Consistent with the report of Nguyen et al.34, the two luminal clusters were distinguished by expression of lactotransferin (LTF) and luminal differentiation markers (GATA3/FOXA1), as well as hormone receptors (ESR1/PGR) (Fig. 2b), suggesting that the higher LTF-expressing cluster represents a more immature (alveolar-like) luminal phenotype. Next, we estimated the differentiation potency of each single cell using our Signaling Entropy Rate (SR) measure (“Methods” section), which revealed the existence of three main potency states (Fig. 2c, the “Methods” section). Of note, using known luminal and basal differentiation markers, we were able to validate potency-state assignments within the basal and luminal clusters separately (Supplementary Notes). We observed that the highest potency state represented a minority population, with approximately only 169 single cells (i.e., 5%) falling into this putative progenitor or stem-like state (Fig. 2c). To explore the biological characteristics of this state, we assessed the distribution of potency states across the three main single-cell clusters, as well as across those cells not assigned to any cluster (“peripheral cells”) (Fig. 2d, e). Cells in the high potency state were found primarily within the basal compartment, but also mapped preferentially to the common peripheral area between the basal and immature luminal clusters, and were therefore also relatively overrepresented among peripheral cells (Fig. 2d, e).

### LandSCENT diffusion map analysis predicts a bipotent state

To explore the high-potency state in more detail, we first used LandSCENT to create cell-density elevation maps for all cells, and separately also for all highly potent cells, within the two-dimensional t-SNE landscape, which confirmed that the maximum density of the highly potent cells defined a peak within the basal cluster, but with a ridge connecting it to another peak within the immature luminal (L1) cluster (Fig. 3a), suggestive of a bipotent cell population. In line with this, we observed that among all cells categorized into the high potency (PS3) state, those falling within this density peak also exhibited the highest levels of signaling entropy (i.e., cell potency) (Supplementary Fig. 3). To exclude the possibility that these putative bipotent cells may be doublets, we estimated doublet scores for all cells using a novel simulation approach40. In line with the expected doublet rate for 10X technology, this analysis revealed that 2% of assayed cells are potential doublets (Supplementary Fig. 4A). As expected, most of these mapped to the peripheral area between the major luminal and basal clusters, yet they clearly also did not substantially overlap with the most highly potent cells within the basal and luminal clusters (Supplementary Fig. 4B–D): in fact, 108 of the 169 highly potent cells (i.e., 64%) had zero doublet scores, and only 17 of the 169 highly potent cells, i.e., as few as 10%, attained high doublet scores (Supplementary Fig. 4C), clearly indicating that a substantial majority of the highly potent cells are not doublets. We verified that similar results were obtained had we used another method for estimating doublet scores (Supplementary Fig. 5, the “Methods” section).

Although investigation of specific marker expression is difficult in this high dropout 10X data (Supplementary Fig. 1C), we nevertheless explored the variation in expression of proposed markers for bipotent, luminal-restricted progenitor, and myoepithelial-restricted progenitor cells41. Focusing on the highly potent cells, we first observed that although all these cells expressed EPCAM, that those falling within the luminal clusters exhibited higher levels of EPCAM expression compared to those mapping to the basal compartment (Fig. 3b), consistent with the view that luminal-restricted progenitors express higher levels of EPCAM41. Next, we plotted the expression of MUC1 versus CD10 for all the highly potent cells, as EPCAMhi/MUC1+ and EPCAMlow/CD10+ cells have been proposed to be luminal-restricted and myoepithelial-restricted progenitors, respectively, whilst EPCAMlow/MUC1- cells are enriched for bipotent progenitors41. This scatterplot revealed three substates: an exclusively basal cluster (n = 18) expressing high levels of CD10 but MUC1-, a predominantly luminal CD10-/MUC1+ cluster (n = 44), and a larger double negative CD10-/MUC1- cluster (n = 85) (Fig. 3b). The CD10-/MUC1- cluster was made up 38 basal cells, 11 Lum-1 cells, 4 Lum-2 cells, in addition to 32 peripheral cells, i.e., cells mapping in-between the basal and immature luminal clusters. Thus, these data are highly consistent with the prevailing view that the CD10+ subpopulation correlates with a basal-restricted progenitor subtype, that the MUC1+ subpopulation associates with a luminal restricted progenitor subtype, and that the MUC1-/CD10- cluster contains a bipotent subtype.

In order to substantiate the above findings, we next applied Diffusion Maps, a powerful tool for inferring bifurcation points and lineage trajectories in scRNA-Seq data27,32. We observed that while diffusion components 1 and 2 correlated strongly with the three main clusters (basal, Lum-1, and Lum-2) (Fig. 3c), that diffusion component 3 was highly correlated with our SR cell potency measure (Fig. 3d). Defining as root state the cell of highest SR (i.e., potency), this cell mapped to the periphery of the basal cluster and the resulting diffusion map naturally predicts a bifurcation from this root state into marginally lower but still high-potency basal and luminal states (Fig. 3d). Differentiated basal and luminal clusters emerge from these restricted progenitor states along their respective basal and luminal lineages, as required (Fig. 3d). Confirming this, diffusion pseudotime (DPT) analysis predicted two major terminal tip-points, one in the basal cluster and another in the mature luminal-2 state, with no direct transition between the basal and luminal-2 clusters (Fig. 3e), i.e., DPT analysis correctly predicts that the mature luminal-2 state is only reached after passing through the immature luminal-1 cluster, consistent with it containing the luminal progenitor population.

### Validation of the single-cell stem-like state

If the bipotent cell cluster identified by LandSCENT is stem-like, the expectation would be that these cells may be transcriptionally similar to previously characterized mammary stem cells. To explore this, we performed differential expression analysis between high and low potent cells. The great majority of genes were downregulated in the more potent cells, with only 72 exhibiting overexpression (Bonferroni adjusted P < 0.05, Fig. 4a, Supplementary Table 1). Remarkably, performing rank-based GSEA42 on the 72 overexpressed genes revealed strong enrichment for genes upregulated in mammary stem-cells (Fig. 4b). In particular, we observed a relatively strong enrichment (12 gene overlap, OR = 39, BH-adjusted Fisher-test P < 1e−10) with a previously characterized mammary stem-cell signature43. Of note, among the 12 overlapping genes, 9 (RPS2, RPS7, RPS10, RPL8, RPS18, RPS3, RPL10A) were ribosomal proteins or ubiquitin ribosomal fusion proteins (UBA2 and FAU), consistent with recent findings that expression of ribosomal proteins may be a universal marker of stemness and potency30,44 (Supplementary Fig. 6). We stress that the higher mRNA expression levels of ribosomal genes with increased cell potency is also observed in bulk samples30,36, thus excluding the observed association as an artifact of single-cell analysis. Among the other three genes, we observed NACA, a protein that associates with the upregulated transcription factor BTF3, and TXN (thioredoxin), a protein involved in the response to intracellular nitric oxide.

To confirm the results of the GSEA, we obtained and normalized mRNA expression data from43, consisting of FACS sorted pools representing quiescent mammary stem-cells and transit-amplifying progenitors, as derived from mammosphere-growing assays (“Methods” section). Confirming the association with stemness, the 12 overlapping genes exhibited increased expression in three separate pools of quiescent mammary stem-cells compared to their derived transit-amplifying progenitors (Fig. 4c, d, Wilcox test P = 0.001, the “Methods” section), a result which remained significant compared to randomly selected genes (Fig. 4d, Monte Carlo P = 0.0001). Results remained significant had we used all 72 genes (63 genes had representation on the Affymetrix platform used in Pece et al.43) from the upregulated stem-like signature (Fig. 4e, Supplementary Fig. 7). Although this validation uses data generated in vitro, and therefore ignores in vivo effects, the data nevertheless support the view that the cells deemed to be stem-like according to our LandSCENT algorithm, are indeed related to mammary stemness. Of note, the identification of the stem-like state was not possible using other state-of-the-art lineage-inference trajectory algorithms such as e.g., Monocle-228 (Supplementary Notes).

### Validation in independent 10X and Fluidigm C1 data

While the quality of the 10X scRNA-Seq assay from the other three women is questionable (Supplementary Fig. 2), we nevertheless aimed to further validate the single-cell stem-like transcriptomic signature in these data. We reasoned that the average expression of the identified 72 upregulated genes should be a stemness marker in the 10X data from these three women. Confirming this, for each woman we observed a significant increased expression of these 72 genes in the single cells deemed to be of highest potency according to our highly validated SR measure (Fig. 5a, Wilcox test P < 1e−30).

As a further validation, we would expect the identified stem-like cells to preferentially overexpress previously characterized stemness markers. Despite the high dropout rate of the 10X data (Supplementary Fig. 1), we nevertheless first assessed correlations between the 72 upregulated genes and a panel of 6 stemness markers (ALDH1A1, ALDH1A3, CD44, ITGA6, ZEB1, and TCF4)34,35 in the 10X data, finding small but significant positive correlations for ALDH1A3, CD44, and ITGA6 (Supplementary Fig. 8, Fisher Z, P < 1e−5). We further tested for correlations between our upregulated signature genes and expression of the stemness markers in three independent higher-coverage scRNA-Seq datasets from the mammary epithelium generated with the Fluidigm C1 platform34 (“Methods” section). We observed a statistically significant correlation with ALDH1A3 and CD44 expression (Fig. 5b, Fisher Z, P < 1e−10). Thus, while the stem-like state identified in the 10X data from Ind-4 is clearly not identifiable via single stemness marker expression, we observed partial but significant correlations with ALDH1A3 and CD44 in both 10X and Fluidigm C1 data.

### Single-cell stem-like signature is increased in luminal progenitors

Having validated the stem-like nature of the highly potent cell cluster, we next asked if the transcriptome of these cells may also mark luminal progenitors (LPs). This is reasonable, because although the highly potent cells were mostly enriched in the basal cluster, a considerable number did map to the more immature luminal cluster, occupying a topologically central position close to those in the basal cluster (Fig. 3a). To test our hypothesis, we analyzed bulk expression data from four FACS sorted cell populations, three representing putative LP subclasses and one representing differentiated luminal cells45. We observed that the average expression of the 72 upregulated genes was highest for the EpCAM + /ITGA6 + /ALDH + luminal progenitor population (Fig. 6a, Wilcox test P = 0.004), consistent with the view that it is the ALDH + cells that are most likely to represent LPs45. Studying the individual genes in the 12-gene and 72-gene signatures, revealed that the great majority were overexpressed in the EpCAM + /ITGA6 + /ALDH + population compared to all other luminal/LP populations, a result which was highly significant as assessed using 100,000 Monte-Carlo randomizations (Fig. 6b, P < 1e−5). These data further support the view that the identified stem-like state may be bipotent, as it shares similarity with both basal and luminal progenitors.

### Bipotent-like cells are marked by YBX1 and ENO1 overexpression

As noted earlier, the great majority of genes were downregulated in the stem-like cell cluster, with only 72 exhibiting overexpression. Correspondingly, among the 1369 transcription factors, 582 exhibited differential expression (Bonferroni adjusted P < 0.05) with only 3 TFs (ENO1, YBX1, and BTF3) exhibiting higher expression in the more potent cells (Fig. 4a). Remarkably, YBX1 and ENO1 are two transcription factors whose targets are highly enriched for breast cancer GWAS eQTLs46, thus implicating them in breast cancer risk. In addition, siRNA against YBX1 in a normal ER- cell-line (MCF10A) resulted in significantly reduced cell-confluence and growth, even when compared to other breast cancer risk TFs46. We confirmed that the associations of YBX1 and ENO1 expression with potency remained after adjustment for cell-cycle phase (Supplementary Fig. 9, the “Methods” section), and that their expression correlated with cell potency in the 10X scRNA-Seq data from each of the four women (Supplementary Fig. 10). We note that the correlation of YBX1 expression with potency was particularly evident in the luminal compartment (Supplementary Fig. 11). Moreover, YBX1 expression was also higher in the more immature luminal alveolar-like phenotype, in line with the fact that these alveolar luminal cells should be more enriched for progenitors, and that YBX1 expression was also highest in the FACS-sorted ALDH+ luminal progenitor population (Supplementary Fig. 12).

Of note, both YBX1 and ENO1 also exhibited significant positive correlations with the ALDH1A3 and CD44 stemness markers in the Fluidigm C1 data (Supplementary Fig. 13, Fisher Z, P < 1e−10), but were not upregulated in the quiescent mammary stem cells compared to the transit-amplifying progenitor cells (Supplementary Fig. 7), suggesting that YBX1 and ENO1 expression may be associated with an amplifying (bipotent) progenitor state.

### Bipotent signature marks basal breast cancer and poor clinical outcome

Given that our stem-like signature was derived from single cells and is therefore free from the confounding effect of cell-type heterogeneity, we decided to test it in primary breast cancer tissue. Since the stem-like state was enriched within the basal compartment, we hypothesized that the signature may mark basal breast cancer and be prognostic within this subtype. We confirmed the association with basal breast cancer using 2000 primary breast cancers profiled as part of the METABRIC study47 (Supplementary Fig. 14). The average expression over the 72 genes was also associated with clinical outcome, although only marginally so in the basal subtype (Supplementary Fig. 15). In order to construct a single-cell derived stemness score, we also considered an expanded 144-gene expression signature which, besides the 72 upregulated genes, included the 72 most significantly downregulated genes within the high potency single-cell cluster (Supplementary Table 1). This strategy allowed us to compute a Pearson correlation between the 144-gene signed signature and the expression profile of each METABRIC sample, which should yield a more robust “stemness/bipotency score” (“Methods” section). This score was also significantly higher in the basal subtype (Wilcox test P < 1e−50, Fig. 7a), and correlated with poor clinical outcome (HR = 1.46 (95%CI: 1.32–1.62), P = 6e−13, Fig. 7b), which remained significant in a multivariate analysis adjusted for ER-status, grade, age, stage, and tumor size (HR = 1.26 (95%CI: 1.10–1.43), P = 0.0006, Supplementary Table 2). Importantly, the association with overall survival remained significant within the basal subtype (HR = 1.28 (95% CI: 1.05–1.56), P = 0.02, Fig. 7c) even when adjusted for age, stage, and tumor size (HR = 1.30 (95%CI: 1.02–1.66), P= 0.03, Supplementary Table 3). The difference in the 3-year overall survival rate between the lowest and highest quartiles was substantial: while those with the lowest stemness score exhibited a 90% 3-year survival rate, those in the highest quartile showed a 30% reduction (Fig. 7c).

## Discussion

Here we have demonstrated “proof-of-concept” that our signaling entropy based cell potency measure can identify rare subpopulations representing novel progenitor or stem-like cells. Indeed, application to almost 4000 single cells from the mammary epithelium identified a minor (<5%) high potency subpopulation, which we argue likely represents a mammary bipotent progenitor or stem-like state. These high-potency cells were not randomly distributed: they were over-represented within the basal compartment, but also mapped preferentially to the periphery of the basal and immature alveolar luminal clusters, with a smaller fraction of marginally lower potency also being exclusive to this luminal cluster. Using a novel visualization technique based on generating and comparing cell-density surface maps for all inferred potency states, confirmed that cells in the high-potency state clustered most strongly at the periphery of the basal cluster, with others defining a distinctive bi-modal ridge between the basal and alveolar luminal clusters. Of note, cells defining the peak of maximum cellular density were also the ones attaining the highest potency values. Without having to invoke any prior assumptions, this topologically central position predicts that these highly potent cells may represent a bipotent stem-like state that gives rise not only to basal cells but also to luminal progenitors, in direct analogy with the topologically central positions observed for e.g., hematopoietic stem cells in the hematopoietic system48.

Many analyses substantiate this view. First, using only high-potency cells, a scatterplot of expression of CD10 and MUC1, two markers that have been proposed to differentiate bipotent progenitors from basal-restricted and luminal-restricted progenitors41, revealed three states: a CD10+/MUC1- population composed only of basal cells, a CD10-/MUC1+ population composed almost exclusively of immature luminal cells, and a double negative CD10-/MUC1- population which was composed mainly of basal cells, but which also included a number of peripheral “ridge-defining” cells as well as a few immature luminal cells. Thus, consistent with previous literature41, the CD10+/MUC1- and CD10-/MUC1+ cells likely represent basal-restricted and luminal-restricted progenitor populations, respectively, with the basal and peripheral CD10-/MUC1- cells defining a bipotent-like state. Second, we used our entropy potency measure to define a natural root-state as the cell attaining the highest potency, from which a diffusion map process was then inferred. This predicted a bifurcation, with one lineage giving rise to basal-restricted progenitors and fully differentiated basal cells, and with the other giving rise to luminal-restricted progenitors and differentiated luminal cells. Third, we found that among the top overexpressing genes in the bipotent stem-like state there was strong enrichment for genes that mark quiescent mammary stem cells43 and stemness generally30,44. We stress that this validation of the single-cell stem-like state was obtained in bulk mRNA expression data comparing quiescent mammary stem-cells to transit-amplifying progenitors, which therefore strongly reinforces the validity of our potency assignments. Fourth, the stem-like single cell signature, which was derived from the 10X scRNA-Seq assay from one woman, also exhibited variability in the 10X scRNA-Seq assays from another three women, in each case correlating with our highly validated potency measure. Fifth, we found that our stem-like single-cell signature also correlated significantly with the expression of ALDH1A3 and CD44, two well-known putative mammary stem-cell markers in independent higher coverage C1 Fluidigm data from another three women. We stress that although significant correlations with these two markers were also observed in the 10X data, that these correlations were relatively weak and only significant due to the larger number of cells. This is important because we note that using ALDH1A3 or CD44 expression itself did not allow identification of the novel stem-like state, even if used in conjunction with a state-of-the-art tool like Monocle-2. Sixth, the single-cell expression signature characterizing this stem-like state was also found to be elevated in FACS sorted ALDH+ luminal progenitor cells compared to differentiated luminal and other less differentiated luminal subtypes. This suggests that the signature is not only marking basal progenitors but also luminal progenitors, further supporting a bipotent interpretation.

Of note, a recent scRNA-Seq study performed in the mouse mammary gland which also used diffusion maps49, reached the conclusion that basal and luminal lineages were separate without evidence of a bifurcation, therefore questioning the existence of a bipotent state. Interestingly, this is in line with a recent neutral lineage study in mice50, which did not find evidence for bipotent cells in the mammary gland. However, if the bipotent cells are in a highly quiescent state, they may not have been found in such lineage tracing studies50. Moreover, a likely explanation for the discrepancy with the mouse scRNA-Seq study is the fact that this previous study did not use an independent potency measure to define a reliable root state. Indeed, reliance on stemness or progenitor marker expression alone to define such a root state does not allow reliable identification of stem-like cells in high dropout rate scRNA-Seq data, as evidenced here but also in this previous study. It is clear that the prediction or not of specific bifurcation points using diffusion maps will depend critically on the identification of a reliable root state, specially since cells transiting between bipotent and lineage-restricted progenitor states are sparse. Thus, it will be necessary to profile even larger numbers of cells and at higher read-depth (average read depth of the 10X data considered here was 60,000 reads per cell) to conclusively address this question. Higher-read depth would allow full characterization of the transcriptome of this bipotent stem-like state, which may in turn help pinpoint specific surface markers.

The putative bipotent state as revealed by LandSCENT may have important implications for basal breast cancer. It is indeed striking that of the three TFs overexpressed in the stem-like state, two (YBX1 and ENO1) have been implicated in basal breast cancer risk46. Specifically, it has been observed that genes within the YBX1 and ENO1 regulons are strongly enriched for GWAS breast cancer eQTLs46. The third TF (BTF3) has been shown to be necessary for proliferation and EMT in gastric cancer51. YBX1 merits further study as it has been shown to play a key role in maintaining the self-renewal and proliferative capacity of basal cells46. There is also substantial evidence demonstrating that YBX1 transforms mammary epithelial cells, via binding to the BMI1 promoter and chromatin remodeling, leading to basal breast cancer52. In line with this, YBX1 is also more highly expressed in basal breast cancer compared to all other breast cancer subtypes (Supplementary Fig. 14). Interestingly, YBX1 and the associated stem-like signature was also highly expressed in luminal progenitors, which is important because a subset of basal breast cancers, notably BRCA1 mutant ones, are thought to arise from misprogrammed luminal progenitors45,53. Indeed, the single-cell landscape inferred with LandSCENT underscores the similarity of the highly potent cells within the basal compartment with those in the immature luminal cluster, strongly suggesting that the cell of origin for basal breast cancer may well be a bipotent-like cell that shares an expression profile similar to that of luminal progenitors. YBX1 has also been shown to interact with ESR1, and via FGFR2 signaling may contribute to tamoxifen resistance54. Interestingly, although the majority of the 72 upregulated genes were also overexpressed in the quiescent mammary stem-cells derived from mammosphere-growing assays, both YBX1 and ENO1 were not overexpressed relative to the transit-amplying progenitors, suggesting that they may not be stemness markers per-se, but markers of a bipotent early progenitor state. Beyond YBX1, we characterized the putative bipotent cells in terms of a 144-gene “bipotent” expression signature, which clearly marked basal breast cancer, and which also correlated with poor overall survival within the basal subtype independently of standard prognostic factors, all consistent with it defining a “poor outcome stemness signature”. While poor outcome stemness signatures derived from bulk data have been widely reported in breast cancer55,56,57,58, this study presents a prognostic stemness signature derived from single cells and therefore free from the confounding effects of cell-type heterogeneity. Thus, the observation that the single-cell stem-like signature correlates with clinical outcome in basal breast cancer, whilst also including a TF that is oncogenic for basal breast cancer and which has also been implicated in basal breast cancer risk is in our opinion an important finding. Indeed, there is growing evidence that molecular alterations (both inherited and somatic) affecting the adult stem/progenitor cell pool of a tissue is a main risk factor for epithelial cancer development16,17,18,59,60. Thus, we speculate that it is the genetic and epigenetic alterations that accumulate within the bipotent progenitor cell pool identified here, which may confer the risk of breast cancer, especially basal breast cancer.

In summary, we have here showcased the application of an unbiased marker-free computational approach for estimating cell potency, and which, in an application to the human mammary epithelium, has identified a novel putative bipotent stem-like state, with the transcriptome of these cells exhibiting associations with basal breast cancer risk and outcome. Our LandSCENT algorithm and findings may serve as a general paradigm for analogous scRNA-Seq studies in other tissue types, including those performed on cancer tissue which aim to identify putative cancer stem-cells5,8,30.

## Methods

### Single-cell data and preprocessing

10X Genomics set: The main scRNA-Seq data analyzed in this work derives from the study of Nguyen et al.34, who used the 10X Genomics Chromium platform to sequence a total of 24,646 cells from reduction mammoplastic specimens from four separate nulliparous women (Ind4–7), at an average read-depth of 60,000 reads per cell. Mapped read count data from the four individuals was downloaded from GEO (GSE113197), and further normalized as follows: for each cell we counted the number of expressed genes (“coverage per cell”), and for each gene we also counted the number of times it was expressed across all single cells (“coverage per gene”). For each cell, we also computed the total read count mapping to mitochondrial genes, which revealed low cell coverage for those cells having a high proportion of mitochondrial gene read counts. Based on this, we selected all cells expressing at least 1000 genes and with the proportion of mitochondrial read counts <0.05, leaving a total of 23,369 cells. Mitochondrial genes were removed and the total read count per cell c recomputed (TRCc). Denoting the maximum of TRCc by maxC, and the read count matrix by RCM, the latter was normalized with the following transformation: LSCgc = log2(RCMgc*maxC/TRCc + 1.1). Finally, we only use Entrez gene ID annotated genes, which resulted in a log-normalized single cells matrix of dimension 22,049 genes and 23,369 cells (3473 for Ind-4, 6811 for Ind-5, 5807 for Ind-6, and 7278 for Ind-7).

Fluidigm C1 set: In addition, we also analysed the corresponding Fluidigm C1 scRNA-Seq set, also from Nguyen et al.34. We downloaded the FPKM-valued matrix of 33,694 features and 815 cells encompassing cells from three different women. We selected cells expressing at least a 1000 genes and with a mitochondrial proportion less than 0.3, leaving a matrix of 33,681 features and 715 cells. The FPKM matrix was log2-normalized with a pseudocount of 1. We only kept genes mapping to an entrez gene ID, which resulted in a normalized expression matrix over 22,049 genes and 715 cells. The number of cells for the three individuals were 198 (Ind-1), 195 (Ind-2) and 322 (Ind-3).

### The LandSCENT algorithm

LandSCENT is a direct extension of the SCENT algorithm. There are four steps to the LandSCENT algorithm: (1) Inference of potency states: estimation of the differentiation potency of single cells via computation of the signaling entropy rate (SR) and subsequent inference of the potency state distribution across the single cell population. (2) Inference of cell-types: we perform t-SNE38 followed by density-based spatial clustering (dbscan)39 on a suitably dimensionally reduced LSC matrix. (3) Identification of cell-states, i.e., potency state single-cell cluster pairs that contain a minimum number of cells30, and construction of cell-density landscapes for each potency-state. (4) Identification of a root-state, i.e., the cell state of highest entropy/potency (SR), and subsequent application of Diffusion Maps27,32 to infer bifurcations and lineage trajectories. We note that step-1 is the exact same procedure as used in our original SCENT algorithm30.

Step-1 Inference of potency states: We estimate differentiation potency of each single cell by computing the signaling entropy, as described previously31,61. Briefly, the normalized genome-wide gene expression profile of a sample (this can be a single cell or a bulk sample), which provides the biological context, is used to assign weights to the edges of a highly curated protein–protein interaction (PPI) network. The construction of the PPI network itself is described in detail elsewhere31, and is obtained by integrating various interaction databases which form part of Pathway Commons (www.pathwaycommons.org)62. The PPI network as used here is available from https://github.com/ChenWeiyan/LandSCENT/tree/master/data under filename net13Jun12.m.RData. The weight of an edge between protein i and protein j, denoted by wij, is assumed to be proportional to the normalized expression levels of the coding genes in the cell, i.e., we assume that wij~xixj, and we interpret these weights (if normalized) as interaction probabilities. Thus, in a sample with high expression of i and j, the two proteins are more likely to interact than in a sample with low or absent expression of i and/or j. Normalizing the weights results in a random walk defined by a stochastic matrix, P, over the network, with entries

$$p_{ij} = \frac{{x_j}}{{\mathop {\sum }\nolimits_{k \in N(i)} x_k}} = \frac{{x_j}}{{(Ax)_i}}$$

where N(i) denotes the neighbors of protein i, and where A is the adjacency matrix of the PPI network (Aij=1 if i and j are connected, 0 otherwise, and with Aii= 0). The signaling entropy is then defined as the entropy rate (denoted Sr) over the weighted network, i.e.,

$$Sr\left( {\vec x} \right) = - \mathop {\sum }\limits_{i = 1}^n \pi _i\mathop {\sum }\limits_{j \in N(i)} p_{ij}\log p_{ij}$$

where π is the invariant measure, satisfying πP=π and the normalization constraint πT1 = 1. The invariant measure, also known as steady-state probability, represents the relative probability of finding the random walker at a given node in the network (under steady state conditions i.e., long after the walk is initiated). Nodes with high values thus represent nodes that are particularly influential in distributing signaling flux in the network. In the steady-state we can assume detailed balance (conservation of signaling flux, i.e., πipij =πjpji), and it can be shown61 that πi = xi(Ax)i/(xTAx). Given a fixed adjacency matrix A (i.e., fixing the topology), it can also be shown61 that the maximum possible Sr among all compatible stochastic matrices P, is the one with $$P = \frac{1}{\gamma }v^{ - 1} \otimes A \otimes v$$ where denotes product of matrix entries and where v is the dominant eigenvector of A, i.e., Av = λv with λ the largest eigenvalue of A. We denote this maximum entropy rate by maxSr, and define the normalized entropy rate (with range of values between 0 and 1) as

$$SR\left( {\vec x} \right) = \frac{{Sr(\vec x)}}{{maxSr}}$$

Since SR is bounded between 0 and 1, we next transform the SR value of each single cell into their logit-scale value, i.e., y(SR) = log2(SR/(1−SR)). Subsequently, we fit a mixture of Gaussians to the y(SR) values of the whole cell population, and use the Bayesian Information Criterion (as implemented in the mclust R-package)63 to estimate the optimal number K of potency states, as well as the state-membership probabilities of each individual cell. Thus, for each single cell, this results in its assignment to a specific potency state.

Step-2 Inference of cell-types: Cell-types are inferred as clusters using cell-density in the two-dimensional t-SNE space as the main criterion. Preliminary dimensional reduction is achieved by first selecting genes with a mean average expression larger than 1, and also a standard deviation larger than 1. These thresholds were chosen after inspection of the mean-variance plot, and in the case of Ind-4 this resulted in 4261 highly variable and expressed genes. To map the high dimensional nature of the data matrix to a two-dimensional subspace we used t-SNE with an initial dimension of 30, a perplexity parameter of 30, 1000 maximum iterations and epoch parameter set to 100. We then used the dbscan algorithm (density-based spatial clustering) with eps = 5 and minPts = 15 to identify clusters. Thus, after steps-1 and 2, each cell is assigned to a unique potency state and co-expression cluster (cell-type).

Step-3 Identification of cell-states and construction of cell-density landscapes for each potency state: Specific potency-state single-cell cluster pairs may contain many cells and therefore represent clear candidates for defining cell-states. However, in principle, cells in the same state, whilst being in the same cluster, may not necessarily be that close in the tSNE embedding. For this reason, we also construct cell-density elevation maps for all single cells within each of the inferred potency states. In these surface maps, the elevation is directly proportional to cell-density. By comparing the resulting landscapes for each potency state, this may reveal novel cellular states characterized by high cell-density.

Step-4 Inference of bifurcations and lineage trajectories: From step-3, it is assumed that a cell-state of highest potency is identifiable. This provides a natural and unbiased way of assigning a root-state for subsequent application of a lineage-trajectory inference algorithm. We used Diffusion Maps27, as implemented in the destiny Bioconductor package32 with k = 30, otherwise default parameters were used. Pseudotime, specifically, DPT over the inferred trajectories was also computed using destiny.

### Estimation of cell-cycle and TPSC pluripotency scores

To identify single cells in either the G1-S or G2-M phases of the cell-cycle we followed the procedure described in 5. Briefly, genes whose expression is reflective of G1-S or G2-M phase were obtained from refs. 64,65. A given normalized scRNA-Seq data matrix for a given individual is then z-score normalized for all genes present in these signatures. Finally, a cycling score for each phase and each cell is obtained as the average z-scores over all genes present in each signature. When adjusting differential expression analyses for cell-cycle phase, we included the G1-S and G2-M scores as covariates in the linear models.

### Bulk expression datasets

In this study we used three mRNA expression datasets from bulk samples. One dataset consists of 38 FACS sorted bulk samples (Illumina expression beadarrays), as profiled by Shehata et al.45. Of the 38 samples, 10 were categorized as luminal non-clonogenic (L), i.e., terminally differentiated cells, with the rest (n = 28) making up a relatively differentiated (EpCAM+/CD49f + /ALDH-, n = 17) and undifferentiated (EpCAM+/CD49f + /ALDH+, n = 11) luminal progenitor (LP) populations. The two undifferentiated LP populations were further distinguished by expression or not of ERBB3. mRNA expression data was generated using Illumina Beadarrays and we used the normalized data, as described in ref. 45.

The second dataset is the METABRIC study, which profiled almost 2000 primary breast cancers using Illumina expression beadarrays47. We used the assignment of tumors to PAM50 intrinsic and integrative cluster subtypes as given by the METABRIC study. We used the normalized data, as provided by the METABRIC consortium.

A third Affymetrix mRNA expression dataset derives from Pece et al.43. This set consists of three separate pools of FACS sorted cell populations. Each pool contains a quiescent putative mammary stem cell population, as well as a population of derived progeny, consisting of transit-amplifying progenitor cells, thus a total of six bulk samples. We normalized the HGU133 plus2 data using the affy BioC package, specifically, the rma function. Only probes mapping to an Entrez gene ID were used, data was quantile normalized using limma, and probes mapping to the same gene were averaged, resulting in a normalized data matrix over 20,186 genes and six samples.

### Differential expression analysis

When performing differential expression analysis within the main single-cell clusters, differences in expression are smaller and therefore more susceptible to confounding by the technical dropout rate. Thus, when comparing gene expression of single-cell subgroups within a main single cell cluster, we always restrict to cells where the gene is expressed. That is, we remove all dropouts and do not impute data. When correlating to potency, we used a linear model between the normalized expression profile and the potency estimates, optionally adjusting for the two cell-cycle scores computed earlier. In the case of the Illumina beadarray datasets, we used the normalized data from the respective publications45,47 and called DE using the empirical Bayes limma framework66. We always use Bonferroni-adjusted thresholds to call statistical significance unless there are too few hits, in which case we relax the threshold using FDR < 0.05.

### Construction of the 144-gene bipotent signature and score

We performed differential expression analysis as described in previous section between the single-cells in the high potency (putative bipotent) single cell cluster to cells in the other two potency states using a linear model. A Bonferroni-adjusted P< 0.05 threshold was used to call significance. Because the great majority of differentially expressed genes were downregulated in the high potency state, with only 72 being upregulated, we defined a 144-gene signature consisting of the top 72 downregulated genes plus the 72 upregulated ones. The bipotency score in independent samples (e.g., METABRIC) was then obtained as the Pearson correlation of the signed 144-gene signature (i.e., using +1 for upregulated genes, and −1 for downregulated genes) with the expression profile of the independent sample.

### Doublet score analysis

We used two different simulation-based methods to derive doublet scores for each cell and to identify those more likely to be doublets. One approach used the simulation method of Dahlin et al.40 to obtain doublet scores for all single cells that passed QC and for each individual separately. Specifically, we used the doubletCells function (using approximate = TRUE option) from the scran R-package (version 1.10.1)67. In the second approach we used the Python package Scrublet68 (https://doi.org/10.1101/357368). Within Scrublet, the scrub_doublets function, which is responsible for computing doublet scores and predicting doublets within a dataset, was run using default parameters.

### Statistics and reproducibility

All statistical analyses were performed with R version 3.6.0. P-values were estimated using Wilcoxon rank sum tests or linear regression, as indicated. Cox proportional hazards regression was used for survival analysis. Hazard Ratio, 95% confidence interval, and P-value as derived from the score-test is given for univariate analyses. In multivariate analysis, P-value derives from the Wald-test. We used the following open-source Bioconductor/R- packages: mclust_5.4.2, dbscan_1.1–3, tsne_0.1–3, igraph_1.2.4, monocle_2.99.3, scran_1.10.1, destiny_2.14.0.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.