Comprehensive characterization of claudin-low breast tumors reflects the impact of the cell-of-origin on cancer evolution

Claudin-low breast cancers are aggressive tumors defined by the low expression of key components of cellular junctions, associated with mesenchymal and stemness features. Although they are generally considered as the most primitive breast malignancies, their histogenesis remains elusive. Here we show that this molecular subtype of breast cancers exhibits a significant diversity, comprising three main subgroups that emerge from unique evolutionary processes. Genetic, gene methylation and gene expression analyses reveal that two of the subgroups relate, respectively, to luminal breast cancers and basal-like breast cancers through the activation of an EMT process over the course of tumor progression. The third subgroup is closely related to normal human mammary stem cells. This unique subgroup of breast cancers shows a paucity of genomic aberrations and a low frequency of TP53 mutations, supporting the emerging notion that the intrinsic properties of the cell-of-origin constitute a major determinant of the genetic history of tumorigenesis. Claudin-low tumors are a rare aggressive subtype of breast cancers. In this study, the authors use a multiomics approach to demonstrate that these tumors are heterogeneous and comprise three main subgroups that emerge from different evolutionary processes.

B reast cancer is a highly heterogeneous group of diseases with variable biological and clinical behaviors. Providing early insights into this diversity, gene-expression profiling analyses initially resulted in the identification of four clinically relevant molecular subtypes, known as intrinsic subtypes (luminal A, luminal B, HER2-enriched and basal-like, according to the PAM50 classification) mostly corresponding to hormone receptor and HER2 status, and a normal breast-like group [1][2][3][4] . Among these intrinsic subtypes, the basal-like subtype appears to be the most distinct, as it is characterized by the unique expression of cytokeratins typically expressed by the basal layer of the skin and a very low level of expression of luminal-related genes 2,5 . This observation led to the hypothesis that breast cancers may arise from the transformation of two distinct cell types of origin or developmental stages of mammary epithelial cell development, one generating basal-like tumors and the other non-basal-like malignancies 5 . Although appealing in its simplicity, this model overlooks the potential reprogramming of lineage-restricted populations initiated by an oncogenic event or a microenvironmental signal over the course of tumor development [6][7][8] . Moreover, it does not account for the intrinsic diversity of each breast cancer subtype, notably basal-like tumors known for their great heterogeneity [9][10][11] . An additional intrinsic subtype of breast cancers, known as claudin-low, has recently been identified in human and mouse tumors and in breast cancer cell lines, showing several common features with basal-like tumors and reflecting the diversity of tumors with a low luminal differentiation status 4,12 .
Basal-like and claudin-low tumors form the majority of triplenegative breast cancers (TNBCs), an aggressive subgroup of breast malignancies defined as tumors lacking expression of the estrogen receptor (ER), progesterone receptor (PR), and HER2. A hallmark of the claudin-low subtype is the low expression level of critical cell-cell adhesion molecules, including claudins 3, 4, and 7, occludin, and E-cadherin. Tumors of this subtype are highly enriched in mesenchymal traits and stem cell features and are therefore considered as the most primitive breast cancers 13 . Although in vivo preclinical data suggested that basal-like tumors arise from the transformation of a luminal progenitor (LP) 14 , the putative cell-of-origin of claudin-low tumors remains unknown. The prevailing hypothesis is that, over the course of tumor progression, basal cancer cells undergo an epithelial-mesenchymal transition (EMT) in response to acquired oncogenic events and/ or microenvironmental signals, thereby gaining mesenchymal and stemness features [15][16][17] . An alternative hypothesis is that claudin-low tumors originate from an early epithelial precursor with inherent stemness features 4,13 . To gain further insight into the developmental origin of claudin-low tumors, we first sought to analyze their genomic architecture. Indeed, we have recently demonstrated that, whereas the oncogene-driven transformation of mature luminal (mL) cells and progenitor luminal cells triggers massive oncogene-induced DNA damage and an early onset of chromosomal instability (CIN), normal human mammary stem cells (MaSCs) can withstand an aberrant mitogenic activity 18 . The endogenous expression of the ZEB1 EMT-inducing transcription factor prevents replication and oxidative stress, leading to a process of malignant transformation in the absence of exacerbated genomic instability 18 . As basal-like breast cancers generally exhibit numerous genomic aberrations, we thus speculated that the extent of genomic aberrations might be used as a molecular indicator of the developmental origin of claudin-low breast cancers.
To comprehensively characterize the claudin-low breast tumor subtype, we used a multi-omics approach that reveal the existence of three distinct subgroups with specific transcriptomic, epigenetic, and genetic characteristics, strongly supporting the hypothesis of different cells-of-origin.

Results
Selection of claudin-low tumors. Claudin-low tumors exhibit marked immune and stromal cell infiltration, when compared with all other breast cancer subtypes 4,19 . As gene expression analyses do not allow to accurately discriminate mesenchymal cancer cells from normal stromal cells, we used the allele-specific copy number analysis of tumors (ASCAT) copy number-based tumor purity estimation method to assess tumor cell fraction 20 ( Supplementary Fig. 1a). To avoid any substantial bias due to non-tumor cell contamination, a stringent purity threshold was determined by applying Wilcoxon test, for which the degree of contamination of claudin-low tumors was not statistically different from that of non-claudin-low tumors, thus enabling a comparable distribution of tumors ( Supplementary Fig. 1b). Using this stringent threshold, 45 out of 152 tumors classified as claudin-low from Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) were selected for further studies. Demonstrating an unexpected degree of diversity, only 35.6% of these tumors were classified as TNBCs ( Supplementary  Fig. 1d). This result was not due to the purity selection, as the percentage of TNBCs within all claudin-low tumors was similar before applying the purity threshold ( Supplementary Fig. 1c). Of note, only four normal-like tumors were estimated as pure. Due to the lack of statistical power, they were excluded from subsequent analyses.
Identification of CNA-devoid claudin-low tumors. We next analyzed claudin-low tumors using the stratification based on gene expression and copy number alterations (CNAs), previously defined by Curtis et al. 21,22 (Fig. 1a). As expected for aggressive neoplasms, basal-like tumors mostly belonged to the integrative cluster 10, characterized by multiple CNAs affecting most chromosomes. Luminal A and B tumors were distributed across clusters 1, 2, 3, 6, 7, 8, and 9, whereas HER2-positive tumors were mainly found in cluster 5. Interestingly, claudin-low tumors were stratified into three main clusters, again demonstrating a significant heterogeneity. Only 17.8% were found in cluster 10, whereas 48.8% and 17.8% were found in integrative clusters 4 and 3, respectively, regardless of tumor purity when compared with non-claudin-low tumors ( Supplementary Fig. 2). Both integrative clusters 3 and 4 are marked by a paucity of copy number and cisacting alterations. Integrative cluster 3 displays a simplex pattern of rearrangements, dominated by whole arm gain of chromosome 1q and 16p and loss of 16q 11,21,22 . It was initially described as essentially composed of luminal breast tumors. Integrative cluster 4 includes both ER-positive and ER-negative cases. It was defined as a CNA-devoid subgroup, with an expression profile dominated by immune-related genes, leading to the hypothesis that it may be essentially composed of samples with a high infiltration of normal cells 11 . As the method used in the present study to select tumors with a high index of purity does not rely on gene expression patterns, it argues against this hypothesis. Nevertheless, to formally demonstrate the existence of claudin-low tumors with a flat genomic landscape, we analyzed the genomic architecture of tumor samples exhibiting a very low level of aberrations (fraction of genome altered, FGA < 1%). After having previously removed germline copy number variations (CNVs), we analyzed B allele frequency (BAF) and log R ratio (LRR) profiles of singlenucleotide polymorphisms (SNPs) localized in genomic regions of deletion (LRR segment mean cutoff value < −0.4) ( Fig. 1b and Supplementary Fig. 3a, c). This analysis led to the identification of regions of single-copy allelic loss displaying a clear separation between BAF values (<0.8 and >0.2). As previously demonstrated, this observation is incompatible with a massive contamination of the tumor tissue by normal cells 23 . Consistent with these data, the analysis of somatic mutations highlighted the presence of several point mutations with a variant allele frequency higher than 0.4 ( Fig. 1c and Supplementary Fig. 3b). These two results unequivocally demonstrate the existence of claudin-low tumors that develop in the absence of gross CIN.
Genomic diversity of claudin-low tumors. Although most claudin-low tumors are diploid (Fig. 2a), the extent of FGA was highly variable across tumors (Fig. 2b), confirming their intrinsic diversity 24,25 . The use of Gaussian finite mixture models revealed a trimodal distribution, reflecting the existence of three subgroups of claudin-low tumors relative to their level of FGA (Fig. 2c, d). The subgroup 1 of claudin-low tumors (CL1), defined by a low FGA (<10%), was mostly composed of ER-negative tumors that were all stratified into integrative cluster 4 ( Fig. 2d-f). The subgroup 2 (CL2) showed an intermediate level of FGA (>10% and <30%), similar to that of luminal A breast tumors. Consistent with this finding, CL2 was mostly composed of ER-positive tumors mainly stratified in two luminal-related clusters, whereas 35% of CL2 tumors belonged to cluster 4. The subgroup 3 (CL3) displayed a high FGA (>30%), similar to the one found in basallike breast cancers. Consistent with this notion, 33% of CL3 tumors were classified into the genomically unstable cluster 10. However, this subgroup was heterogeneous, with both ERnegative or ER-positive tumors, and a significant fraction of CL3 tumors were found in luminal-related clusters and in cluster 4. Of note, although these analyses were performed on tumors selected for their high tumor purity, similar data were obtained when studying the whole cohort of claudin-low tumors ( Supplementary  Fig. 4), further strengthening the characterization of the three claudin-low subgroups.
Distinct gene expression signatures of claudin-low subgroups. Gene-set enrichment analysis (GSEA) was next used to identify pathways differentially regulated among the three claudin-low subgroups, by comparing one with the other two (CL1 vs. CL2/3, CL2 vs. CL1/3, and CL3 vs. CL1/2). More than 15,000 pathways, available in H (Hallmarks), C2 (encompassing curated gene sets from literature and canonical pathways such as KEGG, REAC-TOME, or BIOCARTA), and C5 (comprising GO pathways) gene sets from Molecular Signature Database (MSigDB), were tested. These analyses highlighted as follows: (i) an enrichment in stem cell-related signatures in CL1, (ii) an enrichment in luminalrelated signatures in CL2, and (iii) an enrichment in cell cyclerelated pathways in CL3 tumors (Fig. 3a). Of note, global gene expression signatures demonstrated that the CL2 and CL3 subgroups showed lower luminal-and basal-related signature scores, respectively, compared with luminal-and basal-like breast tumors, while exhibiting a significant enrichment in invasiveness-related signatures ( Supplementary Fig. 5a, b). We then performed single-sample GSEA (ssGSEA) analyses to determine the differentiation profile of the three claudin-low   subgroups. Gene expression signatures for MaSCs, LPs, and mL cells were generated based on the transcription profiles of subpopulations of normal human mammary epithelial cells isolated from reduction mammoplasty tissues, as shown in Morel et al. 18 (Supplementary Data 1). Consistent with previous GSEA analysis, CL1 tumors showed a strong enrichment in the MaSC signature, whereas CL2 tumors exhibited a mL signature (Fig. 3b). Again, CL3 tumors showed a significant heterogeneity with a trend toward the LP signature. Notably, very similar data were obtained when using an alternative set of MaSC, LP, and mL signatures 14 (Fig. 3c).
Identification of a gene expression-based classifier. We next generated a gene expression-based classifier by using the nearest shrunken centroid method 26 , with the objective of discriminating claudin-low subgroups. A claudin-low gene list of 137 genes discriminating the 3 claudin-low subgroups was thus created (Supplementary Fig. 6a, b). This expression-based classifier showed an accuracy of 91% when compared with the FGA-based classification using Gaussian finite mixture models (Supplementary Fig. 5c, d). The classifier was then applied to claudin-low samples from The Cancer Genome Atlas Network (TCGA; https://www.cancer.gov/about-nci/organization/ccg/research/ structural-genomics/tcga) and the Cancer Cell Line Encyclopedia (CCLE 27 ), and the level of FGA was used as a readout to verify the characteristics of each claudin-low subgroup. Validating the accuracy of our classifier, CL1, CL2, and CL3 tumors and cell lines showed low, moderate, and elevated FGA levels, respectively (Supplementary Fig. 6e, f). Of note, we performed a pathway analysis of CL1 discriminant genes that revealed an enrichment in stem cell and pediatric cancer markers ( Supplementary Fig. 6g).
Distinct gene methylation profiles of claudin-low subgroups. Differentially methylated gene analysis was next performed in claudin-low subgroups followed by pathway-enrichment analysis, using over 15,000 pathways available in MSigDB (HALLMARKS, C2, and C5 gene sets) ( Supplementary Fig. 7). When compared with CL2 and CL3, the CL1 subgroup presented 40 differentially methylated genes, and among them 75% were hypomethylated and enriched in stemness and EMT markers (Fig. 4a). The CL2 subgroup encompassed 68 differentially methylated genes compared with CL1 and CL3, and among them 88% were hypermethylated and enriched in basal-like and EMT pathways (Fig. 4b). Finally, the CL3 subgroup was characterized by 219 differentially methylated genes compared with CL1 and CL2; among them, 58% were hypermethylated and related to luminal pathways and 42% presented a hypomethylation profile and were associated with basal-like pathways (Fig. 4c). To correlate these distinct gene methylation profiles with gene expression, we performed a gene expression analysis of stemness and EMT gene markers in all three claudin-low subgroups and in the intrinsic subtypes of breast cancers from METABRIC, TCGA, and CCLE cohorts. The CL1 subgroup had the highest score for stemness-(ALDH1A1, PROCR) and EMT-(mesenchymal markers: ZEB1,   Fig. 2 Claudin-low tumors form a genomically and molecularly heterogeneous subgroup. a Ploidy for each molecular subtype of the METABRIC cohort. Claudin-low tumors are mainly diploid. b FGA% for each molecular subtype. Claudin-low tumors display overall low levels of genomic alterations. c Gaussian mixture model application on FGA distribution across claudin-low tumors reveals three CNA-related subgroups of tumors. Each bar on the x axis corresponds to one claudin-low tumor. d FGA% in claudin-low subgroups compared with other molecular subtypes. e Integrative clusters and f breast cancer receptor status distribution in each claudin-low subgroup. The CNA-devoid CL1 subgroup is associated with IntClust4 and TN status, whereas CL2 and CL3 are respectively related to luminal (luminal clusters and hormone receptor expression) and basal-like (IntClust10 and TN  VIM, CDH2; epithelial markers: EPCAM, CDH1) related genes, whereas CL2 and CL3 displayed an intermediate score compared with CL1 tumors and with their relative luminal/basal counterparts ( Supplementary Fig. 8).
Distinct oncogenic pathways in claudin-low subgroups. To characterize the biological pathways driving the development of the three subgroups of claudin-low tumors, we determined, for each breast tumor sample from METABRIC and TCGA cohorts, ssGSEA scores for all MSigDB hallmark gene-set signatures. We then conducted an unsupervised clustering based on the median score of each of the 48 biological processes computed per breast molecular subtype (Fig. 5). Interestingly, in both datasets, the three claudin-low subgroups clustered together in a specific branch of the dendogram, which further discriminated CL1 from the CL2/CL3 tumors, highlighting the specific biological processes distinguishing claudin-low from non-claudin-low and CL1 from the CL2/CL3 tumors. Furthermore, unsupervised pathway clustering identified specific biological functions differentially enriched in claudin-low, basal, and luminal breast tumors. Among the pathways specifically enriched in all claudin-low subgroups, some of them were linked to immunity function, supporting the hypothesis of privileged infiltration by immune cells within the microenvironment of claudin-low tumors 4,19 . Additional gene expression analyses performed on METABRIC and TCGA cohorts using two different cellular signature deconvolution algorithms confirmed that these tumors were highly infiltrated by immune cells (Supplementary Fig. 9a-d). Nevertheless, unsupervised clustering of microenvironment signatures (immune and non-immune cell subsets) did not allow to discriminate the different claudin-low subgroups ( Supplementary Fig. 9e-h). The additional pathways specifically enriched in all claudin-low subgroups included the TP53-dependent pathway, the mitogenactivated protein kinase (RASMAPK) signaling pathway and a differentiation/EMT-related pathway (Fig. 5). CL1 presented the highest enrichment in claudin-low-related pathways, when compared with all other subtypes. Of note, the CL2 subgroup displayed a concomitant enrichment in claudin-low and luminal pathways, whereas the CL3 subgroup was enriched in claudin-low and basal pathways (Fig. 5).
To validate these observations, we generated ssGSEA scores for the pathways previously identified in Fig. 5 (EMT, RAS/MAPK, proliferation, estrogen, and DNA damage response (DDR)), using MSigDB C2 curated gene sets (KEGG, GO, REACTOME…). The same pathways were also analyzed at the protein level using reverse-phase protein array (RPPA) data from TCGA 28 . Supporting our previous results, CL2 and CL3 showed, among the three claudin-low subgroups, the highest score for luminal-(estrogenrelated response) and basal-related (proliferation and DDR) signatures, respectively ( Supplementary Fig. 10  compared with their related molecular subtype. Furthermore, both CL2 and CL3 showed an activation of the EMT process ( Fig. 6a, b, g), coherent with the downregulation of hormone (in CL2) and proliferation/DDR (in CL3) signatures, compared with luminal and basal tumors, respectively (Supplementary Fig. 10 and Fig. 6g). Moreover, CL1 displayed a low proliferation activity and a low DDR ( Supplementary Fig. 10a, b, e, f), as well as a strong TP53 signaling pathway signature associated with a low frequency of TP53 mutations (Fig. 6c, d, g). Finally, CL1 exhibited a high activation of the RAS-MAPK signaling pathway, a feature also found in CL2 and CL3, although at a lower level (Fig. 6e-g).
Consistent with these findings, claudin-low cell lines were significantly more sensitive to MEK inhibitors (MEKi) than non-claudin-low cell lines (Fig. 6h).
Altogether, our data strongly support the notion that the CL1 subgroup of claudin-low tumors was related to normal MaSCs, CL2 to normal mL cells and to luminal breast cancers, and CL3 to basal-like breast cancers. Unstratified claudin-low tumors have been previously associated with poor prognosis 4,29 . When stratified by CL subgroups, variations in survival outcome were found, consistent with their presumed origin (Supplementary Fig. 11). Indeed, the basal-like-related CL3 subgroup of  Fig. 4 Claudin-low subgroups show distinct gene methylation profiles relative to their differentiation status. Heatmaps and pathway-enrichment analysis of differentially methylated genes between a CL1 vs. CL2 and CL3 tumors from TCGA, b CL2 vs. CL1 and CL3 tumors from TCGA, and c CL3 vs. CL1 and CL2 tumors from TCGA. CL1 tumors mainly display hypomethylated genes that are related to a stemness phenotype, CL2 tumors are overall hypermethylated for genes associated with basal-like features, and CL3 tumors show more heterogeneous methylation profiles with both hypermethylated genes (linked to luminal phenotype) and hypomethylated genes (relative to basal-like characteristics) (see Supplementary Fig. 5 for methodology).
claudin-low tumors was associated with low overall and diseasefree survivals, as compared with the luminal-related CL2 subgroup. In line with their low proliferation index and their low frequency of TP53 mutations, tumors of the CL1 subgroup were associated with a favorable prognosis.

Discussion
Here we have characterized the intrinsic diversity within the claudin-low breast cancers by demonstrating the existence of three main molecular subgroups. Consistent with the data from a recent study 30 , published during the revision process of our manuscript, we have shown that these subgroups are associated with distinct survival outcomes. Moreover, our study has delved into the underlying reasons of the observed heterogeneity, revealing that claudin-low tumors exhibit different developmental origins.
Supporting the prevailing hypothesis that some of these tumors are generated from basal-like breast cancers, the comprehensive characterization of METABRIC and TCGA databases have revealed the existence of claudin-low breast cancers (CL3 subgroup) with characteristics of basal mammary epithelial cells and of basal-like breast cancers. These traits include the hypomethylation of genes related to basal-related pathways, the hypermethylation of genes related to luminal-related pathways, a high level of expression of proliferation-related genes and a strongly disturbed genomic landscape. Several experimental data substantiate the hypothesis of the basal origin of claudin-low breast cancers, including the demonstration that the oncogenic transformation of basal mammary epithelial cells and of LPs can generate malignant cells with mesenchymal characteristics through the activation of an EMT process 17,31 . However, highlighting a first degree of heterogeneity, a significant fraction of claudin-low tumors (CL2 subgroup) displays a gene expression signature of mL cells and shares common genetic, epigenetic and transcriptomic features with the luminal A intrinsic subtype. Although unexpected, this finding is reminiscent of the experimental observation that EMT induction in luminal breast cancer cell lines confers them with a claudin-low expression pattern 32 . A second degree of heterogeneity is shown by the CL1 subgroup of claudin-low breast cancers. Among the three subgroups of claudin-low tumors, CL1 appears as the most distinct, with differential transcriptomic, epigenetic, and genetic traits that comprise a prominent stemness signature and a paucity of genomic aberrations. Although TNBCs with negligible CNAs were previously reported 21,22 , their existence was questioned due to the marked immune and stromal cell infiltration attributed to claudin-low tumors 11 . Here, the identification of focal genetic alterations in tumors selected for their purity unequivocally demonstrates the existence of CNA-devoid claudin-low tumors. Beyond their stemness characteristics and their low CIN, CL1 tumors have additional characteristic features when compared to all other breast cancers. These traits include (i) a high expression of the ZEB1 EMT-inducing transcription factor and of its target, the methionine sulfoxide reductase MSRB3; (ii) a frequent activation of the RAS-MAPK signaling pathway; (iii) a low activation of the DDR; and (iv) a low frequency of TP53 mutations (Supplementary Figs. 8 and 10, and Fig. 6). These findings are highly consistent with the stemness features of CL1 tumors. Indeed, we demonstrated recently that normal human MaSCs exhibit a preemptive antioxidant program driven by ZEB1 and MSRB3, providing them with the unique capacity to readily adapt to an aberrant activation of the RAS-MAPK pathway 18 . As a consequence, RAS-driven malignant transformation of MaSCs occurs in the absence of DNA damage, thereby alleviating the selective pressure on the inactivation of the TP53-dependent failsafe program 8,18 . Altogether, these findings led us to propose a model with two alternative paths leading to the development of claudinlow tumors (Fig. 7)  Overall, this model illustrates the emerging concept of cellular pliancy 8 . The level of pliancy is defined by the intrinsic susceptibility of a discrete cell state to undergo malignant transformation or degeneration after sustaining a particular oncogenic insult 8,33 . In this regard, the low genomic instability of CL1 tumors might thus reflect the high pliancy of human MaSCs for RAS activation. Interestingly, although CL1 tumors show the most frequent activation of the RAS-MAPK pathway, it is noteworthy that the activation of this mitogenic pathway is also a common event in CL2 and CL3 tumors, compared with other breast cancer subtypes. This latter finding may reflect the capacity of the RAS signaling pathway to induce EMT in permissive conditions [34][35][36] . Implicit in this notion is that the activation of the RAS-MAPK pathway may be a primary oncogenic event in the tumorigenic process leading to CL1 tumors and a secondary event in the course of tumor progression in a fraction of basal-like and luminal breast cancers, eventually leading to CL2 and CL3 tumors.
Although this hypothesis remains to be tested, our data may have significant clinical implications in light of the previously described resistance of EMT-related tumors to a variety of anticancer therapies [37][38][39] . Indeed, the identification of the recurrent activation of the RAS-MAPK pathway in claudin-low breast cancers may lead to new therapeutic opportunities that need to be tested in preclinical models.

Methods
Samples. Breast tumor samples used in this study were from METABRIC 21 and TCGA Research Network (https://www.cancer.gov/tcga) cohorts, and breast cancer cell lines were from CCLE database 27 .
Statistics. All analyses and statistical tests were carried out with the R software (version 3.6.1) 40 . Heatmaps were generated with ComplexeHeatmap, principal component analysis was completed with ade4, Gaussian finite models were performed with mclust and Rmixmod, and survival analyses were conducted with survival R packages 25,[41][42][43] . All statistical tests were two-tailed. Figures were created using either the R software or GraphPad Prism 8.0 (GraphPad Software, Inc., San Diego, USA).
Expression data processing. METABRIC microarray expression data from discovery and validation sets were extracted from the EMBL-EBI archive (EGA, http://www.ebi.ac.uk/ega/; accession number: EGAS00000000083) (Normalized expression data files) 21 . Normalized expression data per probe of the discovery set and the validation set were combined after normalization of each set independently with a median Z-score calculation for each probe. The expression levels of different probes associated with the same Entrez Gene ID were averaged for each sample in order to obtain a single expression value by gene.
The  METABRIC and TCGA cohorts) and the previously described claudin-low and non-claudin-low centroids for tumor samples were determined, using the 1667 genes defined by Prat et al. 4 as significantly differentially expressed between claudin-low tumors and all other molecular subtypes. The tumor-related centroids (differentially expressed genes between claudin-low and non-claudin-low tumors using significance analysis of microarrays) were used rather than the cell-linerelated centroids (nine cell lines), as cell culture dramatically changes tumor phenotype and thus transcriptomic signature of primary samples. The use of a high-purity threshold for tumor selection allowed to circumvent the contamination bias related to the tumor-related centroid ( Supplementary Fig. 1). For CCLE breast cancer cell lines, claudin-low status assignment was performed using the nine-cellline predictor via the R package genefu 44 .
Pathway enrichment analysis. All pathway-enrichment analyses were conducted using MSigDB gene sets H, C2, and C5 from msigdbr R package 50 . GSEAs were carried out using fgsea R package 51 . Gene lists were pre-ranked using Signal2Noise metric. ssGSEA scores were computed through gsva R package 52 . Pan-cancer transcriptomic EMT signature, defined by Tan et al. 53 , was used to compute EMT score for each sample.
Normal mammary cell transcriptomic signatures. Previously published microarray expression data were used to generate lists of differentially expressed genes along the mammary epithelial cell hierarchy (MaSC1/2/3, LP, and mL1/2) 18 . Microarray data were robust multiarray average-normalized through oligo R package and differential expression analysis was performed using limma R package 54,55 . The top 500 false discovery rate (FDR) p-value ranked genes differentially expressed between one subpopulation (MaSC or LP, or mL cells) vs. the two others (FDR p-value < 0.05) were used as transcriptomic signature for each of the three cell subpopulations (Supplementary Data 1).
Claudin-low subgroups classifier. Expression-based classifier for the three claudin-low subgroups was identified using shrunken nearest centroid method through pamr R package 26 . A 1.87 threshold for centroid shrinkage was defined after examination of training errors and the cross-validation results. Finally, the nearest shrunken centroid classifier encompassed 137 genes whose expression discriminate the three FGA-related claudin-low subgroups in METABRIC cohort ( Supplementary Fig. 6).
Microenvironment analysis. Estimation of immune and non-immune cell fractions from tumor microenvironment were determined through gene expression analysis using Immunedeconv R package 56 together with Xcell and MCPCounter deconvolution methods 57,58 .
Copy number data processing. METABRIC segmented copy number data from discovery and validation sets were extracted from the EGA (http://www.ebi.ac.uk/ ega/; accession number: EGAS00000000083) (Segmented (CBS) copy number aberrations (CNA) files) 21 . TCGA BRCA segmented copy number data were extracted from the GDC data portal repository (files corresponding to alignments on the hg19 version of the human genome without germline CNV were chosen). FGAs was evaluated from TCGA and METABRIC segmented copy number data (both generated from Affymetrix SNP6.0 arrays) as follows: For each segment i, CN i is the mean LRR along segment i, L(i) is the length of segment i, WM is the weighted median of CN i by L(i) for each sample I, and T is the threshold value of the CN i above which the segments are considered to be altered. In other words, FGA is the ratio of the sum of the lengths of all segments with signal above the threshold to the sum of all segment lengths. For CCLE cell lines analysis, T was set as 0.2, whereas for METABRIC and TCGA tumors analysis, T was set as 0.1. ASCAT ploidy and purity estimates were extracted from COSMIC data repository (https://cancer.sanger.ac.uk/cosmic/) 20 . To avoid any substantial bias due to non-tumor cell contamination, tumors without available estimation of ASCAT aberrant tumor cell fraction were removed from the whole cohort. Furthermore, a stringent purity threshold (ASCAT purity > 0.38) was determined, by applying Wilcoxon test, to select purest tumor samples from TCGA and METABRIC databases when applicable.
Somatic mutation data. METABRIC somatic mutation data from targeted sequencing were obtained from Pereira et al. 63 (https://github.com/cclab-brca/ mutationalProfiles/tree/master/Data) and TCGA somatic mutation data from whole-exome sequencing were obtained from Ellrott et al. 64 . Only somatic mutations annotated as exonic, in-frame, and non-silent were used for mutation analysis.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.