Breast cancer is not a single disease. Rather, it relies on several different tumor subtypes each with their own phenotype and clinical outcome1 (for review see ref. 2). One of the most difficult-to-treat subtypes is the basal-like. Basal-like breast cancer originates from progenitor cells within the normal breast, typically among premenopausal women. We and others have previously narrowed down a luminal progenitor, which is double positive for K14 and K19 as a likely candidate cell of origin of basal-like breast cancer3,4,5,6. In basal-like breast cancer, apparent equivalents to double-positive cells are believed to contribute to aggressive behavior by taking on a leader role in invasion7. Indeed, knockdown of K14 in these cancer cells is sufficient to block what is referred to as collective invasion7. In primary tumors, the basal-like cells, reminiscent of normal double-positive cells are considered progenitors. In a tumor setting, these cells exhibit the potential of acquiring a hybrid epithelial-mesenchymal transition (EMT) state with a permanent aggressive potential8. Moreover, the EMT state seems to govern the level of progenitor activity as well as malignant behavior8,9,10,11. It is therefore important to understand in more detail the relationship between double-positive cells in the normal breast and the cells of the basal-like subtype of breast cancer.

Previous studies have emphasized the importance of spatial mapping based on combining micro-collection of organoids directly from reduction mammoplasties with quantitative fluorescence-activated cell sorting (FACS) and multicolor imaging to investigate human breast progenitors3,12. These and other studies have provided compelling evidence for the site-specific presence of distinct progenitors in ducts and TDLUs3,12,13,14. This observation is potentially much more far-reaching if viewed in context with data of distinct disease-free survival rates exclusively determined by cell of origin in ducts and lobules as determined by mammography and histology15,16. That duct-derived tumors may exhibit the worst prognosis15,16 emphasizes the importance of establishing further evidence for a relationship between normal double-positive progenitors in ducts and basal-like breast cancer.

Studies by others have recently employed scRNA-seq to describe the diversity of the epithelial cells in the human breast17,18,19,20. While these confirm the existence of three distinct epithelial populations irrespective of donor age and measures taken to enrich for epithelial cells prior to analysis, it remains unanswered whether the luminal epithelial compartment can be resolved further and whether transcriptional profiles relate to anatomical position. To address this, we here used micro-collected primary normal breast organoids to isolate trophoblast antigen 2+ (TROP2+)/CD271 luminal epithelial cells from ducts and TDLUs and performed scRNA-seq. We discovered that the most immature luminal human breast epithelial cells reside in ducts and exhibit a unique expression profile that includes high levels of KRT15. This signature was found to correlate strongly with basal-like breast cancer. Together, our data provide a new level of resolution of phenotypes in the human normal breast for precision of cancer cell of origin studies.


Combined micro-collection of organoids, FACS, and scRNA-seq lead to spatial mapping of differentially expressed genes of luminal epithelial cells in the human breast gland

The presence of a luminal stem cell zone in ducts based on functional assays has been well described3,5,21. However, relatively little is known about the molecular constitution enabling the cells to function as progenitors or about the relationship between the stem cell hierarchy and tissue architecture. To address these questions, we micro-collected ducts and TDLUs from reduction mammoplasties, used lineage-specific cell-surface markers to enrich for luminal epithelial cells by a FACS protocol including TROP2 instead of EpCAM in combination with CD271 previously shown to optimize the separation of luminal and myoepithelial cells12,22,23, and subjected the resulting populations to scRNA-seq (Fig. 1a–c). To minimize variation due to age, parity, and ductal-lobular ratio12,24,25, we used biopsies from three same-aged young adults (age 18), collected in the range of 30 to 50 TDLUs and ducts, respectively, from each, and sorted a total number of 36,000 ductal- and 36,000 TDLU- derived cells for scRNA-seq. For an integrative analysis of the luminal lineage, we performed clustering of a total of 20,286 cells and on average 1300 genes per cell using Seurat (version 3.0)26, which identified 12 clusters with distinct gene expression profiles (Fig. 1c). Analysis of cluster entropy indicated unskewed contribution from the three biopsies to each cluster (Supplementary Fig. 1a). The cluster designated 0 included a minor proportion of cells (146 cells, 0.7% of the total population) reflecting the presence of immune cells, and thus, was excluded from further analysis. The cells were separated in two major groups of ten clusters labeled 1.1–1.4 and 2.1–2.6, respectively. The remaining cluster located in between was labeled cluster 3. Since the clustering relied on compiled data from ducts and lobules in separate, the contribution of each to the collective image was readily resolved and revealed a higher contribution of duct-derived cells to clusters 1.1–1.4, and to a significant level in cluster 1.2, and TDLU-derived cells to clusters 2.1–2.6 (Fig. 1d and Supplementary Fig. 1b).

Fig. 1: Combined micro-collection, FACS, and scRNA-seq assign duct- and TDLU-derived cells to separate clusters.
figure 1

a Phase-contrast micrographs of micro-collected ducts (left) and TDLUs (right) from human breast organoids. Scale bar = 100 μm. b FACS diagrams showing the gating of luminal cells (TROP2+/CD271) from ducts (red circle, left) and TDLUs (blue circle, right) for scRNA-seq. Colors in FACS diagrams indicate frequency (red = high, blue = low frequency). c Uniform Manifold Approximation and Projection (UMAP) plot showing the clustering of luminal cells of three age-matched biopsies (20,286 cells). Clusters are denoted by color and labeled in the plot. Clusters: 0 as immune cells, 1.1–1.4 as luminal progenitors, and 2.1 to 2.6 and 3 as mature luminal cells according to differentially expressed genes (DEGs), some of which are shown in Fig. 2a. d (left) Same UMAP plot as in c showing contribution of duct- (red) and TDLU-derived (blue) luminal cells. (right) Bar graph illustrating contribution (relative frequency of cells in percent) of duct- (red) and TDLU-derived (blue) luminal cells to groups 1 and 2 of clusters. Error bars represent mean + standard error of the mean (SEM), n = 3 biopsies. *p < 0.05 by two-tailed t-test. Source data are provided as a Source Data file.

The ductal-enriched group comprises immature luminal progenitors including K14+/K19+ double-positive cells and a human-specific population of K15+ cells

To infer the roles of the clusters, differentially expressed genes (DEGs) were examined and summarized in Fig. 2a and Supplementary Data 1. From these data, it became obvious that clusters 1.1–1.4 in general may represent immature progenitors by expression of, e.g., ALDH1A3 (aldehyde dehydrogenase 1 family member A3) and KRT153,27,28,29, whereas clusters 2.1–2.6 and 3 represent more mature luminal epithelial cells by expression of BCL2, FOXA1 (forkhead box A1) and several endocrine receptors (Fig. 2a)28,30,31. This hierarchical division was echoed in a screen of lineage-specific cell-surface markers within the list of the in silico human surfaceome32. Thus, we found expression of established progenitor markers TNFRSF11A (RANK), CD55, PROM1 (prominin 1) and KIT (c-Kit)33,34,35,36,37,38,39 in clusters 1.1–1.4 and differentiated luminal epithelial markers ALCAM (CD166), AREG (amphiregulin), and TNFSF11 (RANKL) in clusters 2.1–2.6 (Fig. 2a)28,40,41,42. The full list of DEGs encoding cell-surface proteins is available in Supplementary Data 2. Intriguingly, the significantly duct-enriched cluster 1.2 accumulates MCAM (CD146), KRT14 and KRT15 expressing cells on a KRT19-positive background (Fig. 2b, c)—a combination of phenotypes, which have been amply validated in functional progenitor assays and which have been localized primarily to ducts at the protein level3,21,28. Cluster 3 was characterized by a significant expression of prolactin induced protein (PIP) and mucin like 1 (MUCL1) compared to other clusters. Also, immunoglobulin superfamily 1 (IGSF1) encoding a cell-surface molecule is exclusively upregulated in cluster 3 (Fig. 2a). To uncover any species related controversy concerning the generation of epithelial lineages43, we compared our data with existing scRNA-seq data based on the mouse mammary gland44. Of note, in accordance with others45, we found that expression of KRT15 in the luminal compartment is specific for the human breast, and moreover that the mapping of the human stem cell hierarchy differs from that of mice (Supplementary Fig. 2a).

Fig. 2: The ductal-enriched clusters 1.1–1.4 include immature progenitors.
figure 2figure 2

a Bubble plots showing a subset of cluster-specific DEGs of luminal lineage markers (left) and DEGs encoding cell-surface proteins (right). Circle size denotes the percentage of cells expressing the gene in each cluster, while color indicates the average expression level (red = high expression, blue = low expression). b Bar graph showing log-transformed gene expression of KRT15, KRT14, and MCAM in cluster 1.2 (orange) and in the other clusters (green). Error bars represent mean ± SEM. ****p < 0.001 by multiple two-tailed t-tests using False Discovery Rate (FDR) for multiple comparisons. c Violin plots showing expression of KRT19, KRT15, KRT14, and MCAM in all luminal clusters. d Principal component (PC) plots showing the trajectories of luminal differentiation (left) and pseudotime (right) using principal curves. Colors denote clusters (left) and pseudotime (right). The origin and endpoints were identified without supervision. Slingshot inferred cluster 1.1 as the origin and arrow heads indicate endpoints. e Heat maps showing expression of genes related to epithelium development (adj p < 0.001, g: Profiler analysis, left) and anatomical structure and morphogenesis (adj p < 0.0001, g:Profiler analysis, right) in the luminal clusters. f (left) Representative fluorescence multicolor imaging of normal breast stained for PODXL (green), c-Kit (red), and nuclei (blue). Scale bar = 25 μm. (right) Dot plot showing percentage of colony-forming units (CFUs) per 96 well-plate of sorted PODXL/c-Kit- mature luminal (ML), PODXL/c-Kit+, and PODXL+/c-Kit+ luminal cells. Filled squares indicate PODXL+/c-Kit−/+. PODXL+ cells form significantly more colonies compared to ML cells (n = 4 biopsies). Error bars represent mean ± standard deviation (SD). *p < 0.05 by Kruskal–Wallis test with Dunn’s multiple comparisons test. Source data are provided as a Source Data file.

To further substantiate the analysis of maturation among clusters based on DEGs, we applied the lineage inference algorithm Slingshot46 in the search for a potential hierarchy in an unbiased and unsupervised manner. In short, this method identifies a trajectory based on a minimum spanning tree algorithm towards the most differentiated state. Cells that are placed closer to the beginning of the trajectory belong to an early time point in the lineage. Using our single-cell transcriptome data of luminal epithelial cells, Slingshot built several trajectories all starting in cluster 1.1 and ending in either clusters 2.1–2.3 or cluster 1.4, as summarized in Fig. 2d. Accordingly, estimation of pseudotime places the least differentiated cells in clusters 1.1–1.2 and the most differentiated cells of the luminal lineage in clusters 2.1–2.3 (Fig. 2d). These data were confirmed by geneset enrichment analysis showing that whereas clusters 1.1 and 1.2 are particularly high in genes involved in epithelium development (adj p < 0.001, Fig. 2e), suggesting a role upstream of epithelial differentiation within a hierarchy, clusters 2.1–2.6 are enriched for genes involved in anatomical structure morphogenesis (adj p < 0.0001, Fig. 2e). Additional gene sets in support of a hierarchical organization with respect to response to extracellular signaling, MAPK signaling, mammary gland development, nuclear receptor and ERBB signaling are highlighted in Supplementary Fig. 2b (adj p < 0.05). However, cluster 1.4 is somewhat of a conundrum by being the end of a separate trajectory never leaving the progenitor compartment (Fig. 2d). Therefore, in order to characterize this subcluster relative to the rest of cluster 1, we sought for a marker suitable for prospective isolation of subcluster 1.4 progenitors in a FACS-based protocol. We identified podocalyxin-like (PODXL) as an ideal candidate. PODXL is a gene encoding a member of the CD34 sialomucin protein family, which is expressed in hematopoietic stem cells47, and whose expression has been associated with basal-like breast cancer48. At the protein level, it is expressed at the apical surface of a subset of c-Kit+ luminal progenitors35 (Fig. 2f and Supplementary Fig. 2c), and PODXL+ cells from reduction mammoplasties can be readily identified in a FACS protocol with c-Kit included (Supplementary Fig. 2d). By this protocol we defined mature luminal cells as PODXL/c-Kit-, c-Kit+ progenitors as PODXL/c-Kit+, and PODXL+ progenitors as PODXL+/c-Kit+/−, respectively (Supplementary Fig. 2d). These three cell types were gated for and plated at clonal density in a colony formation assay (Supplementary Fig. 2e). Indeed, PODXL+ cells turned out to exhibit the highest colony-forming capability (Fig. 2f), confirming also at the functional level that cluster 1.4 represents a progenitor population.

Micro-collection- and cluster-based spatial mapping is confirmed by multicolor imaging in situ

In order to validate the scRNA-seq cluster profiling and the apparent enrichment of duct-derived progenitors in cluster 1.2 and late progenitors in cluster 1.4 at the protein level with particular emphasis on surface markers, we searched our antibody repository and identified CD55 or annexin A1 and SLC34A2, respectively, as promising candidates. In line with our scRNA-seq data, we have previously found ductal-enriched expression of K15 and heterogeneous distribution of c-Kit expression by immunostaining3,33. Here, in an effort to classify cluster 1.2 we found that K15, due to its higher expression level, was superior to K14, which we have otherwise used as a ductal progenitor marker3,21. Thus, here we compared cluster 1.2-associated markers CD55 or annexin A1 with K15 as well as a cluster 1.4-associated marker SLC34A2 with c-Kit in ten different human breast biopsies using multicolor imaging. As inferred from the scRNA-seq (Fig. 3a), the majority of biopsies showed strong co-staining of CD55 with K15 (7 out of 10 biopsies, Fig. 3b) and to some extent co-staining of annexin A1 with K15 (5 out of 10 biopsies) in ducts compared to TDLUs (Fig. 3b), while SLC34A2 essentially co-stained with c-Kit (10 out of 10 biopsies) in both ducts and TDLUs (Supplementary Fig. 3a). To assess whether CD55 added to the established c-Kit protocol28 for enrichment of progenitors, smears were recovered from different gate combinations of FACS and stained for K15 (Supplementary Fig. 3b). A significantly higher frequency of strongly K15+ (K15high) cells was seen from the combined CD55/c-Kit gate compared to c-Kit alone (Supplementary Fig. 3c). We have previously found evidence of progenitor heterogeneity by comparing c-Kit+ and CD146+ cells functionally21. Our present data add to these differentiation programs, since CD55 co-stained with CD146 in ductal luminal cells, while CD146 rarely overlapped with the cell-surface marker PODXL, which is a cluster 1.4-associated cell-surface marker as shown above (Supplementary Fig. 3d, e)—all in line with the scRNA-seq data. The lack of overlap between the preferentially ductal CD55 and PODXL did not only unfold in ducts. Rather, the strongest PODXL staining was seen in CD55neg TDLUs including the lobules proper (Supplementary Fig. 3f). Since we have previously shown that primarily the lobules and TDLUs are rich in hormone receptor-positive late progenitors and differentiated cells28,49, we here co-stained with PODXL and the surrogate marker, Ks20.8, of estrogen/progesterone receptor-positive cells28. Clearly, the hormone receptor-positive-cells were negative for PODXL (Supplementary Fig. 4a). Collectively, these data are in favor of the existence of a ductal luminal progenitor CD55+/K15high cell, which is phenotypically and functionally distinct from those of TDLUs as summarized in Supplementary Fig. 4b.

Fig. 3: Cluster 1.2-associated CD55, annexin A1, and K15 preferentially localize to ducts.
figure 3

a Feature plots showing highest expression of KRT15, CD55, and ANXA1 (annexin A1) in the area of cluster 1.2 and highest expression of KIT, SLC34A2, and PODXL in the area of cluster 1.4. b Representative fluorescence multicolor imaging of cryostat sections of normal breast stained for CD55 or annexin A1 (green) and K15 (red), and nuclei (blue). Arrows indicate CD55+/K15+ or annexin A1+/K15+ luminal cells. Images with higher magnification of selected areas (white squares) are shown below. Note that annexin A1 also stains myoepithelial cells. Annexin A1 and CD55 staining colocalizes with K15 preferentially in ductal luminal cells. Scale bars = 50 μm.

Comparison of duct and TDLU expression profiles with those of breast cancer subtypes

Finally, we investigated whether the identified single-cell transcriptome signatures overlapped with those of breast cancer. Gene expression and molecular characteristics of breast cancer have allowed the classification of breast cancer into several subtypes1,50. It has previously been shown that gene expression signatures of luminal progenitors are significantly correlated with basal-like breast cancer35. By adapting the method by Lim et al.35, we calculated the expression signature scores of ductal and TDLU luminal DEGs (Supplementary Data 3) and compared them to those of breast cancer subtypes51. As shown in Fig. 4a, the transcriptome signature of luminal cells in ducts exhibited an expression profile that was much more similar to those of the basal-like subtype of breast cancer than those of the luminal subtypes. The TDLU-derived luminal signature, on the other hand, showed a stronger compatibility with the luminal breast cancer subtypes (Fig. 4a). In addition, we asked which clusters aligned with the basal-like breast cancer subtypes using the established Prediction Analysis of Microarray 50 (PAM50) subtyping52, as well as a new subtyping classifier based on scRNAseq of breast cancer cells using the “SCSubtype” gene signatures53. Accordingly, PAM50 subtyping showed that clusters 1.1–1.4, in contrast to clusters 2 and 3, are more closely related to basal-like breast cancer (Supplementary Table 1). Likewise, upon comparison with SCSubtype gene signatures, clusters 1.2, 1.3, and 1.4 show a positive correlation with basal-like breast cancer (Supplementary Fig. 5). Recent molecular profiling has stratified basal-like breast cancer within triple-negative breast cancer (TNBC)54,55,56. We therefore further compared TNBC subtype gene signatures with clusters 1.1 to 1.4. Our analysis according to TNBC subtyping of DEGs among cluster 1, reveals that clusters 1.1–1.4 are related to basal-like 1/mesenchymal, basal-like 2/mesenchymal, basal-like 2 and immunomodulatory subtypes, respectively (Supplementary Fig. 6 and Supplementary Data 4). Intriguingly, assessment of the expression level of KRT15 in a dataset of 2,164 breast cancer biopsies subdivided according to the most widely used classification showed a significantly higher expression level in basal-like breast cancer compared to any of the other subtypes (Fig. 4b)57. While others have reported that K15 protein is expressed among TNBC, HER2 and Luminal A carcinomas58, we here sought to corroborate that K15 is a marker of basal-like breast cancer also at the protein level. In a series of TNBC biopsies, 22/36 (61%) biopsies stained positive for K14 and 9/36 (25%) of these stained positive for K15. All nine K15+ biopsies contained K14+/K15+ cells in addition to K14+ and K15+ cells (Fig. 4c). Taken together, the data suggest that progenitors with a ductal profile represent the most immature cell type within the luminal lineage of the human breast and a likely source of basal-like breast cancer.

Fig. 4: Correlation between expression profiles of duct- and TDLU-derived normal luminal cells and breast cancer subtypes.
figure 4

a Whisker box plots of expression signature scores comparing DEGs of duct- (left) and TDLU-derived (right) luminal cells with gene expression profiles of different breast cancer subtypes (basal-like (Basal), normal-like (Normal), Claudinlow (Claudin low), HER2-enriched (Her2), luminal type B (LumB), and luminal type A (LumA)). Boxes represent median ± quantiles. Whiskers indicate the minimum and maximum values. **p < 0.01 by Kruskal–Wallis test with Dunn’s multiple comparisons test. b Whisker box plot showing KRT15 gene expression in the different breast cancer subtypes. KRT15 is significantly upregulated in basal-like breast cancer compared to the other breast cancer subtypes (n = 2164, GENT2 database). Boxes represent median ± quantiles. Whiskers indicate the minimum and maximum values. ****p < 0.001 by one-way ANOVA with Tukey’s multiple comparison test. Source data are provided as a Source Data file. c Representative fluorescence multicolor imaging of cryostat sections of basal-like breast carcinomas stained for K15 (red), K14 (green), and nuclei (blue). Nine out of 36 basal-like breast carcinomas stained positive for both K15 and K14. Scale bar = 50 μm.


The present work demonstrates that the two major segments of the human breast ductal tree, i.e., the ducts and the TDLUs, are specifically enriched for cells reminiscent of the major breast cancer subtypes. This agrees with clinical data that duct- and TDLU-derived breast carcinomas exhibit unique histological and radiological appearances as well as clinical outcomes15. Our study opens for precision cell of origins comparisons with breast cancer subtypes. For example, we find that within the luminal epithelial lineage the most immature progenitors are characterized by the expression of basal-like breast cancer-associated K15 and a localization preferentially to ducts. Furthermore, we provide a proof of principle that luminal progenitors close to the apex of the hierarchy can be isolated by new combinations of surface markers revealing progenitors lending themselves to mechanistic studies of breast cancer subtype specific transformation and evolution. In the present study, scRNA-seq is based on biopsies from three age-matched young women. While this may serve as a starting point, importantly, we and others have shown that aberrant basal-like luminal cells accumulate with age25,59,60, thus implying that further studies of the resemblance between the transcriptomic profiles of normal luminal breast cells and breast cancer should take the age of the donors into account.

It is becoming increasingly clear not least by scRNA-seq that the luminal epithelial compartment consists of a multitude of progenitors and differentiated cells exhibiting molecular profiles overlapping with breast cancer subtypes18,25,61,62,63,64. Nevertheless, as far as cell of origin of breast cancer is concerned, the current view is centered around an estrogen receptor-negative, c-Kit-positive progenitor, which constitutes a relatively large proportion of cells widely distributed along the entire ductal-lobular tree35,65,66. This concurs with the widely held notion in the mouse mammary gland field based on early transplantation experiments that any part of the gland can give rise to the entire complement of the ductal-lobular tree if transplanted to a cleared fat pad (for review see67). Therefore, the alternative explanation to breast cancer subtypes is that they depend on the order and magnitude of genomic aberrations rather than distinct cells of origin, which is furthered here (for review see ref. 67). Indeed, early studies based on X-chromosome inactivation patterns showed that an entire duct and ductal-lobular unit had the same genomic constitution68. Therefore, the differences recorded in the present and previous studies are most likely governed by microenvironmental cues eventually leading to spatially determined, more permanent states of immaturity or differentiation12,13,14,69.

The present work highlights the value of ductal expression of K15 as a marker of an immature progenitor cell zone. K15 expression has been widely used as a biomarker for epithelial stem cells70,71,72. The antibody clone used here against K15 (LHK15) has been exhaustively shown to be entirely specific for K1573. However, the protein expression pattern in ducts of the human breast appears to be broader than what one would expect from cells near the apex of a differentiation hierarchy43. In general, human breast stem cells are believed to reside in the basal layer12,74. K15 is, however, expressed by luminal cells and not basal cells and as such it may have another function more relevant for progenitor or even differentiated cells. Analogously, in the esophagus, K15+ progenitors are found in the non-stem cell suprabasal layer (for review see also ref. 75). The fact that K15 in some tissues stain stem cells and in other tissues stain their progeny has been explained by different mechanisms of K15 regulation: a differentiation-specific mechanism involving the PKC/AP1 pathway and a basal-specific mechanism mediated by FOXM176. Therefore, K15 expression in the normal breast undoubtedly covers more progenitors than those suspected of being cells of origin to breast cancer.

We also discovered additional markers of human breast progenitors. One of these, CD55, is enriched in cluster 1.2, spatially mapped primarily to ducts, and expressed coordinately to a great extent with K15. CD55 is a glycosylphosphatidylinositol-anchored protein, which regulates complement activation pathway, and it is also referred to as decay-accelerating factor 1 (Daf1). In human breast cancer cell lines, it renders cells resistant to apoptosis and thus facilitates tumorigenesis77. However, its function in normal breast is not known. A recent scRNA-seq study in the mouse mammary gland suggests CD55 as an early progenitor subset marker in the basal compartment and a marker of luminal transit cells during development78. In the adult mouse gland, CD55+ cells were found exclusively in the luminal epithelial compartment78. In addition, based on colony-forming assays, the CD55+ cells exhibited an about three times higher colony-forming capacity as an indication of their progenitor status78. Our colony formation data including c-Kit+ and PODXL+ luminal cells representing cells in group 1 of clusters are in good agreement with a progenitor status of this entire group of cells. We have previously shown a progenitor potential of CD146+ cells21. Whereas CD55 adds to the value of c-Kit for identifying progenitors in this compartment, the PODXL+ cells seem to mark a separate progenitor compartment. PODXL expression is of particular interest since others have reported that high PODXL expression is associated with higher risk categories, and that breast carcinomas with high PODXL expression are more likely to exhibit characteristics of basal-like cancers48. In support of such association between PODXL expression and prognosis, silencing of PODXL expression in a basal-like human breast cancer cell line reduces primary tumor formation and metastasis79, and analysis of the EMT program in an immortalized transformed human breast cell line reveals PODXL as a key promotor of extravasation80. The finding of PODXL primarily outside the CD55+ compartment combined with functional progenitor activity implicates the existence of a residential progenitor zone within the lobules. This may be at variance with our previous findings of limited colony-forming activity in cells from TDLUs3. It is possible, though, that the differences can be explained either by the use of different culture media in the two studies or that in the present study, both ductal and TDLU-derived PODXL+ cells contribute to the CFU activity. Importantly, however, we here find good agreement between staining, FACS and mRNA as far as CD146, CD55 and PODXL are concerned. Nevertheless, this does not exclude that broader populations of cells may be captured in particular at the protein level depending, e.g., on the threshold setting of detection81.

Collectively, we here provide evidence for a hitherto underappreciated spatial distribution of luminal progenitors and unravel a progenitor cell population in the ducts of the human breast with resemblance to basal-like breast cancer. These findings emphasize the relevance of cells of origin in breast cancer in general and pave the way for further investigation of the development and progression of basal-like breast cancer in particular.


Human tissue

The use of human tissue has been approved by the Scientific Ethical Committee of Region Hovedstaden and the Danish Data Protection Agency with reference to H-2-2011-052 and 2011-41-6722, respectively, and patients agreed to donate tissue by written consent. Normal breast tissue was acquired from 29 female donors undergoing reduction mammoplasty for cosmetic purposes. Donors remain anonymous except their ages at the time of surgery. 36 breast carcinoma specimens were donated by women undergoing mastectomy for primary breast cancer. Tissue was cut into pieces for cryo-sectioning or cut finely prior to dissociation using 900 U/ml collagenase solution (Worthington Biochemical) in DMEM/F12 (Gibco) supplemented with 2 mM glutamine and 50 μg/ml gentamycin (Biological Industries) to release epithelial organoids, upon collagenase digestion comprised of epithelium and adjacent stromal cells82, which were then stored in liquid nitrogen with 90% fetal bovine serum (F7524, Sigma-Aldrich) and 10% dimethyl sulfoxide (D2650, Sigma-Aldrich), which we find, is the optimal condition for freezing, thawing and survival83. Some of the biopsies used in this study have been included in previous studies21,23,28,84.

Fluorescence-activated cell sorting (FACS)

Primary breast organoids were used for micro-collection (collection under the microscope) of ducts and TDLUs under the Leica DMIL microscope3,12,23. Organoids were dissociated using 0.25% trypsin in 1 mM EDTA (Sigma, E5134), following resuspension in HEPES buffer (Sigma, H3375) and filtering with a 100 μm filter12,22. Details regarding antibodies including the dilutions and catalog numbers used for all experiments included are summarized in Supplementary Table 2. Hereafter, samples were incubated at 4 °C with antibodies against an epithelial marker, TROP2 (brilliant violet (BV) 421 or BV510 conjugated) and a myoepithelial marker, CD271 (PE or APC conjugated) when sorting cells for scRNA-seq or with additional antibodies against c-Kit (BV421 or PE conjugated) and PODXL when cells were used for colony formation assays. Secondary antibody BV421-anti mouse IgM was added for staining the non-conjugated PODXL primary antibody. The secondary antibody was added to the control. For the comparison of PODXL and CD146, following antibodies were used; a luminal marker, EpCAM (CD326, BV786 conjugated), CD271 (PE conjugated), CD146 (AF647 conjugated) and PODXL, which is mentioned above. To sort luminal cells according to their expression of CD55 and c-Kit, TROP2 (BV510 conjugated), CD271 (APC conjugated) were used together with c-Kit (PE conjugated) and unconjugated CD55 in combinations with BV421-anti mouse IgM. After incubation, cells were washed twice with HEPES buffer and filtered through a 20 μm filter (BD, 340624) following addition of 1 μg/ml Fixable Viability Stain 780 (BD Horizon, 565388). Cell sorting was performed using the FACSAria™ Fusion Cytometer (BD) or the BD FACSAria™ III Cytometer with a 100 μm nozzle and prior multicolor compensations.

Single-cell RNA sequencing

Organoids representing ducts and TDLUs (30–50 of each from each biopsy) from normal breast tissue of three age matched (18 years old) individuals were micro-collected and TROP2+/CD271 luminal cells of each sample were sorted using FACS as described above. The myoepithelial cells were used in a separate study23.

Chromium Single-Cell 3’ Reagent Kits were employed for single-cell transcriptome sequencing. Version 2 (first two donors, PN-120237, PN-120236, PN-120262) or version 3 (last donor, PN-1000075, PN-1000073) of the kits were used for RNA isolation, cDNA amplification and library preparation. Hereafter, the Illumina® NextSeq500/550 High Output Kit v2 for 150 cycles (20024907) was used according to the manufacturer’s instructions for sequencing. Resulting files were demultiplexed and aligned to the human reference genome GRCh38-1.2.0.pre-mRNA. Hereafter, the data were filtered and barcodes as well as unique molecular identifiers were counted (10x Genomics Cell Ranger software). The R package “Seurat” version 3.0 was used for quality control, pre-processing (filtering, normalization, integration) and data analysis (clustering, data visualization, detection of differentially expressed genes)26. During filtering, cells with a feature count out of a range between 200 and 2000 as well as with a mitochondrial count higher than 10% were excluded. In order to analyze samples of three different donors together, datasets were integrated according to the “Integration and Label Transfer” tutorial of Seurat26. The equal contribution of biopsies to each cluster and cluster entropies were calculated to confirm successful integration of the data with an adapted function of the R package “Conos” (version 1.4.5)85. After clustering, the dimensionality reduction technique UMAP was applied to the whole dataset for visualization. DEGs for clusters were defined using a log2 fold-change (FC) cutoff of 0.1, and a threshold of 1% for the relative number of cells expressing the gene in the given group. Finally, only statistically significant genes (adjusted p-value < 0.05, Wilcoxon rank sum test together with Bonferroni correction for multiple testing) were used for further analysis. The in silico human surfaceome was utilized to search for genes encoding cell-surface proteins32.

Comparison of DEGs to molecular breast cancer subtype classifiers

DEGs between ductal and TDLU luminal cells (Supplementary Data 3) were compared to the gene expression profile of breast cancer subtypes using a method that calculates “expression signature scores” based on log2FCs of DEGs in a geneset and gene expression values of the same genes in a reference dataset.35. Gene expression data of breast cancer were extracted from the Gene Expression Omnibus (GEO) with accession number GSE3165 and the platform GPL887 explained in more detail by Herschkowitz et al.51. The expression values were background-corrected and normalized by loess normalization using the R packages “affy” (version 1.68.0) and “limma” (version 3.46.0)86,87. An expression signature score was calculated for each combination of DEGs of ductal to TDLU and samples of breast cancer subtypes (Basal: n = 28, Normal: n = 6, Claudin low: n = 6, Her2: n = 14, LumB: n = 17, LumA: n = 22). Used values were the log2FC of marker genes in our single-cell sequencing data, including markers with a negative fold-change, and the expression value of the same genes in the breast cancer samples. A high log2FC as well as a high expression of a gene in the breast cancer subtype resulted in a larger expression signature score. Thus, high scores indicate a high similarity of the single-cell cluster with the cancer sample. Kruskal–Wallis one-way analysis of variance with Dunn’s multiple comparisons test was used for statistical analysis. In order to compare to molecular subtypes based on PAM50 gene signature52, we employed R package “genefu” (version 2.26.0)88 and estimated the probability of each cluster belonging to each subtype. For comparison with a single-cell method of breast cancer subtype classification “SCSubtype”53, expression signature scores were computated based on similarity between cluster-specific DEGs (Supplementary Data 1) and SCSubtype gene lists according to Lim et al.35. To further examine the basal-like breast cancer signatures within clusters 1.1 to 1.4, Lehmann´s TNBC gene signatures55 were compared to DEGs among clusters 1.1, 1.2, 1.3, 1.4 (Supplementary Data 4) using Lim et al.’s method above. Since only the gene lists and not the actual expression values were available for SCSubtype and TNBC gene signatures, we set a value of 1 as upregulated genes, while a value of –1 was used for downregulated genes for the calculations of expression signature scores.

Pathway analysis

The functional enrichment analysis was performed using g:Profiler (version e102_eg49_p15_7a9b4d6, with a significance threshold of 0.05 after correcting for multiple testing89. Among the DEGs for cell clusters (Supplementary Data 1), genes of log2FC > 0.5, with >1.5 fold-change of the percentage of cells expressing the gene in a cluster compared to the rest of cells were tested for the “g:GOSt functional profiling” function and genes were sorted descendingly by log2FC.

Trajectory inference

By use of a lineage inference tool, “Slingshot” (version 1.8.0)46, we calculated a differentiation trajectory and pseudotime of luminal cells. Cells from cluster 0 (immune cells) were excluded from the analysis. The values from the principal component analysis (PCA) and Seurat-annotated clusters were used as input. Neither starting nor ending clusters were pre-defined and trajectories were identified in an unsupervised manner.

KRT15 gene expression in breast cancer subtypes and in mice

KRT15 gene expression in breast cancer subtypes was retrieved from the gene expression database of normal and tumor tissues (GENT2)57. Gene expression data of Krt15 in the mouse mammary gland was obtained from Tabula Muris, a compendium of scRNA-seq data derived from mouse tissue44.

Colony formation assay

Using FACS as described above, PODXL+, KIT+ and KIT-/PODXL cells were sorted into BioCoat™ Collagen I 96 Clear Well Plates (Corning, 354407) with 5 cells per well (1 cell/6.4 mm2). Cells were incubated for 3 weeks in TGFβR2i medium ((DMEM (Gibco, 21068028) and Ham’s F12 Nutrient Mix (Gibco, 21765029) 3:1 with 2 mM glutamine (Sigma, G7513), 5% FCS (Sigma-Aldrich, F7524), 0.5 μg/ml hydrocortisone (Sigma-Aldrich, H-0888), 5 μg/ml insulin (Sigma-Aldrich, I6634), 10 ng/ml cholera toxin (Sigma-Aldrich, C-8052), 10 ng/ml EGF (Peprotech, AF-100-15), 180 μM adenine (Sigma-Aldrich, A3159), 10 μM Y-27632 (AbMole BioScience, M1817), 5 nM amphiregulin (Peprotech, 100-55B), 25 μM RepSox (Sigma-Aldrich, R0158), and 10 μM SB431542 (Axon Medchem, 1661))21,28. Hereafter, cells were fixed with methanol (VWR Chemicals, 20847.307) for 5 min at –20 °C and stained with hematoxylin (Sigma-Aldrich, MHS16). For representative images, cells were stained with 0.4% crystal violet (Sigma Life Sciences, C6158) in 1:1 PBS and 96% ethanol. The number of wells with colonies was counted under a microscope (Leica DM2000), using a cutoff of 15 cells to define a colony. The assay was performed with four different biopsies.

Immunohistochemistry and immunocytochemistry

Snap-frozen normal breast biopsies from reduction mammoplasties and archival TNBCs, characterized as such as estrogen receptor-negative, progesterone receptor-negative and human epidermal growth factor receptor 2-low/negative, K5-positive and/or K17-positive, were cut into 6 μm thick sections using a CryoStar NX50 cryostat (Thermo Scientific). Cryostat sections and cell smears upon FACS were fixed either with methanol (VWR Chemicals, 20847.307) for 5 min at –20 °C, or with 3.7% formaldehyde (Merck, 104002) for 10 min at room temperature following permeabilization with 0.1% Triton X-100 (Sigma-Aldrich, X-100)3,28. Sections were incubated with primary antibodies for 2 h (Supplementary Table 2) and secondary antibodies for 30 min, with PBS washes in between. Finally, ProLong™ Gold anti fade reagent with DAPI (Molecular Probes, P36934) was applied. Images were acquired using confocal microscopes (Leica DM5500B equipped with a DFC550 camera, or a Zeiss LSM710 confocal system). Staining results were strictly dependent on the fixation protocol and dilutions of antibodies as specified in Supplementary Table 2. For the quantification of K15high cells in smears, ImageJ (version 1.53k) was used with a lower threshold of 20 for detection of cells with intense K15 staining. The archival breast carcinomas that had been characterized as TNBCs84 were stained for K5 and K17, before co-staining for K14 and K15.

Statistical analysis

All statistical analyses were performed using the softwares R Studio (version 1.2.5001 and version 3.6.2) or GraphPad Prism (version 9.0.0). Data were tested for normal distribution using Shapiro–Wilk and Kolmogorov–Smirnov tests. Tests to determine significant differences between datasets were chosen separately for each experiment and are specified in the figure legends. Significance is indicated as follows: p < 0.05*, p < 0.01**, p < 0.005***, p < 0.001****.