Molecular signatures of in situ to invasive progression for basal-like breast cancers: An integrated mouse model and human DCIS study

Ductal carcinoma in situ (DCIS) of the breast is a non-obligate precursor of Invasive Ductal Carcinoma (IDC) and thus the identification of features that may predict DCIS progression would be of potential clinical value. Experimental mouse models can be used to address this challenge by studying DCIS-to-IDC biology. Here we utilize single cell RNA sequencing (scRNAseq) on the C3Tag genetically engineered mouse model that forms DCIS-like precursor lesions and for which many lesions progress into end-stage basal-like molecular subtype IDC. We also perform bulk RNAseq analysis on 10 human synchronous DCIS-IDC pairs comprised of estrogen receptor (ER) positive and ER-negative subsets and utilize 2 additional public human DCIS data sets for comparison to our mouse model. By identifying malignant cells using inferred DNA copy number changes from the murine C3Tag scRNAseq data, we show the existence of cancer cells within the C3Tag pre-DCIS, DCIS, and IDC-like tumor specimens. These cancer cells were further classified into proliferative, hypoxic, and inflammatory subpopulations, which change in frequency in DCIS versus IDC. The C3Tag tumor progression model was also associated with increase in Cancer-Associated Fibroblasts and decrease in activated T cells in IDC. Importantly, we translate the C3Tag murine genomic findings into human DCIS where we find common features only with human basal-like DCIS, suggesting there are intrinsic subtype unique DCIS features. This study identifies several tumor and microenvironmental features associated with DCIS progression and may also provide genomic signatures that can identify progression-prone DCIS within the context of human basal-like breast cancers.


INTRODUCTION
Breast cancer (BC) involves the transformation of the normal breast ducts through a variety of histopathologic recognized precursor non-invasive states into fully transformed malignant tumors 1 . Ductal carcinoma in situ (DCIS) is believed to be a precursor of invasive ductal carcinoma not otherwise specified (IDC), the most common BC histologic type 2 . DCIS comprises 20-30% of BC in the US and worldwide 3,4 . Similar molecular profiles, and DNA clonality commonalities, exist between many DCIS and IDC lending support to the precursor status of DCIS to IDC 5 . However, approximately only 20-40% of DCIS progress to IDC if left untreated, and this progression can be in part predicted by the histological grade of the DCIS [6][7][8] . These studies highlight that there is a subset of DCIS that are true precursors to IDC, and that features like grade can predict a higher propensity of a given DCIS lesion to turn into an IDC. This DCIS subset is likely enriched in cell populations containing genetic and/or genomic aberrations that increase the risk of malignant progression. However, there is a lack of biological understanding and diagnostic methods to robustly identify progression-prone DCIS beyond grade, thus leading to a present state of clinical consensus of DCIS overtreatment 8,9 .
Based on gene expression, IDC can be subdivided into "intrinsic" subtypes with basal-like subtype showing the worst clinical prognosis 10,11 . DCIS can also be similarly subtyped using gene expression like IDC suggesting a molecular continuum as DCIS progresses into cancer 5,[12][13][14] . Specifically, these studies suggest the existence of basal-like DCIS as a distinct entity that is unique from other DCIS. These studies also highlight that basal-like DCIS to basal-like IDC transition is associated with a microenvironment immune cell changes unlike DCIS-IDC transitions of other molecular subtypes 12 . However, PDX models fail to include the complete microenvironment changes in DCIS progression accurately and thus there is a pressing need to use animal models with intact immune systems for studying DCIS progression. In this regard, the C3(1)/SV40 T-antigen GEM model (henceforth called C3Tag) forms early Mammary Intraepithelial Neoplasia (MIN; DCIS equivalent term in veterinary histopathology) that histologically resembles human DCIS and end-state IDC-like tumors of basal-like subtype, and therefore might be a good model to study progression-prone DCIS [15][16][17][18] . Our hypothesis is that specific tumor and/or microenvironmental changes occur that governs DCIS to IDC transformation, and that these changes may be identified in a GEM model and also occur in human DCIS as well. Our aim was to identify these molecular changes in C3Tag MIN (henceforth called DCIS) and IDC-like tumors (henceforth called Tumor) utilizing single cell RNA sequencing (scRNAseq) and analyze these findings relative to human DCIS to identify possible commonalities. the C3Tag mouse model, we performed 6 scRNAseq experiments on the whole mammary glands from three distinct disease states/ timepoints (n = 2 for each timepoint) including: (1) Prepuberty: 5-6 weeks; (2) DCIS: 12-16 weeks; and (3) Invasive IDC-like Tumor: more than 16 weeks with presence of a palpable tumor. For the gland harvested for DCIS and prepuberty disease state scRNAseq, we hemi-sectioned the mammary gland and performed a routine formalin-fixed paraffin embedded (FFPE) hematoxylin and eosin (H&E) staining to confirm that the gland contained MIN/DCIS lesions and normal ducts before conducting the scRNAseq experiment on the contralateral gland of the same mouse ( Fig.  1a). At the same time, we also harvested another mammary gland of the same mouse to perform bulk RNAseq from the prepuberty and DCIS disease states (Fig. 1a). We defined prepuberty and DCIS disease states as per established developmental timepoints for normal ductal and MIN lesions found in C3Tag mouse model 15,19 . Normal ducts are seen in the C3Tag mouse when the SV40-large Tantigen is not fully activated and therefore the mice from these time points were collected before they attained puberty and thus called "Prepuberty" state. "DCIS" disease state was defined as the time frame after puberty and where the ducts start containing MIN lesions that are veterinary pathological entities like human DCIS. Furthermore, for this time point we sampled mammary glands that showed higher grade of MIN and presence of areas of central necrosis like human DCIS by routine H&E staining (Fig. 1b). For "Tumor" disease state, palpable tumor was detected, harvested, and sectioned for scRNAseq and bulk RNAseq.
For our primary analyses of C3Tag scRNAseq we captured 21,332 cells from 6 scRNAseq experiments and identified multiple cell type specific clusters comprising the mouse mammary gland milieu (Fig. 1c). We identified statistically significant differentially expressed (DE) genes defining each cluster and used previously determined marker genes to annotate the UMAP cell group clusters. Clusters 0, 2, 9, 18, and 7 were epithelial (Epcam, Krt8 and Krt18 high; Fig. 1d; Supplementary File 1). We also mapped the SV40-large-T-antigen gene sequences on our scRNAseq data and found that clusters 0, 2, 18, and 7 expressed this feature with a significant high expression in cluster 2 (Fig. 1d). Moreover, cluster 7 was composed of cells with SV40-large-T-antigen expression and many proliferative genes like Cenpe and Mki67 indicating a mitotically active subset of epithelial cells. Epithelial cluster 9 was devoid of SV40-large-T-antigen, showed low expression of proliferative genes, and had high expression of luminal genes namely Cd24a and Prlr indicating that this population was likely non-tumor normal epithelial cells (Fig. 1d). We also identified myoepithelial cells as a separate cluster 6 ( Fig. 1d) based on the expression of Krt5, Tagln and Acta2, and no expression of SV40large-T-antigen. In a similar fashion, we were able to identify several cell types of the microenvironment in all 3 disease states: Although all the cell clusters in the merged data set were found to contain cells from all three disease states, we found that the relative proportions of cells differed across the disease states. For example, both cluster 9 (normal luminal cells) and cluster 6 (myoepithelial cells) were shown to be almost entirely composed of cells from the Prepuberty and the DCIS disease states (Fig. 1e). To assess the statistical significance of the changes in the cellular composition across all 3 disease states, we utilized a generalized linear regression model that accounted for the batch effects and the 2 technical replicates for each state (Supplementary File 1). The results of this analysis showed that there was statistical significance of differences in cell numbers of epithelial clusters 2, 7, and 11 in the Tumor state versus the DCIS (p value < 0.001). In fact, the odds of finding a cell from cluster 2 that was high in SV40large-T-antigen, was 8-fold (odds ratio: 8.05) higher in the Tumor state than the DCIS state. Interestingly, the T lymphocyte cluster 11 (odds ratio: 4.46) and macrophage cluster 5 (odds ratio: 1.35) were also significant for an increased odds ratio to be found in the Tumor state. The same analysis between the Prepuberty and DCIS state showed that the fibroblast and endothelial cells to be significantly more in the prepuberty disease state than the DCIS disease state (Supplementary File 1).
Genomic and cell biological approaches to identify cancer cell populations and epithelial cells across C3Tag prepuberty, DCIS and IDC-like Tumor states Since gene expression "dropout" events are known to be associated with scRNAseq data, we utilized inferCNV [20][21][22] to identify epithelial cells with DNA copy number changes (i.e., tumor cells) in the C3Tag mouse model, as opposed to only using SV40large-T-antigen expression to identify malignant cells. InferCNV identifies Copy Number Aberrations (CNA) from scRNAseq data as chromosomally located regions of common high, or low, gene expression levels, and is considered a robust means of identifying cancer cells in a mix of normal and cancer cells. InferCNV identifies regions of CNAs using comparisons to normal reference cells, and hence we utilized our previously published scRNAseq data (n = 2) of FVB/NJ normal (12 weeks) mouse mammary glands 23 , which are the same strain background as C3Tag mice used here; specifically, we utilized the FVB normal mammary gland epithelial cells to call CNA events in our dataset.
Using the InferCNV approach we were able to identify CNA+ cells in all three disease states ( Supplementary Fig. 1), albeit at very different frequencies. Only small numbers of CNA+ cells were identified in our Prepuberty ( Supplementary Fig. 1a) and DCIS disease states, with shared events between the two DCIS samples identified ( Supplementary Fig. 1a). However, there were distinct human breast CNA events identified in the two Tumor samples where one tumor -'Tumor1' was identified as KRAS altered with a chr6 related KRAS amplification ( Supplementary Fig. 1a), which is a previously noted common event in this tumor model 24 . We also performed array CGH (aCGH) analysis on bulk DNA harvested from DCIS and Tumor states and validated many of the CNA profiles coming from the scRNAseq InferCNV calls ( Supplementary Fig. 1b, c). For DCIS states, chr2q deletion (del) was identified in both scRNAseq and bulk aCGH. For Tumor, chr1 amplification (amp), chr3 amp, chr6 amp, and chr10 del were identified in bulk aCGH and inferCNV ( Supplementary Fig. 1c). Lastly, to robustly identify malignant CNA+ cells for downstream analyses, we used a correlation approach and calculated two correlation values for each cell in relation to i) the CNA profile of the top 5% of nonepithelial cells in each individual disease timepoint and ii) the CNA profile of normal FVB epithelial cells ( Supplementary Fig. 2a-e). Cells were designated as CNA high cancer cells if they had a higher score than the median correlation value to the top 5% of non-epithelial cells and had a score lower than the median correlation value to normal epithelial cells. We also looked at the SV40 large -T-antigen expression in our CNA inferred cancer cells and found that 80-90% of these cells were SV40-large-T-antigen positive. Thus, we identified 2025 CNA high cancer cells (Prepuberty 1: 62 cells; DCIS 1: 319 cells; DCIS 2: 267 cells; Tumor 1: 986 cells; Tumor 2: 391 cells) out of 9679 epithelial cells from our 6 scRNAseq data combined; no CNA high cancer cells were found in Prepuberty 2.
We next utilized IKAP (Identifying K mAjor cell Pory ducts at prepuberty diseapulation) 25 to identify the optimal cluster number using the CNA altered cells and identified 5 subpopulations of cancer cells (Figs. 2a, b and 3). Since these are relatively unknown subpopulations, we relied on known breast cancer gene signatures 26  File 2) that are involved in HIF-1 alpha signaling. Similarly, cluster 4 had significant high expression of Gbp2, Usp18, Irf7, Ifit1, B2m, and Stat1, which are genes involved in the interferon pathway (Supplementary File 2). Cluster 1 and cluster 2 had no specific enriched breast cancer signatures however cluster 1 had several ribosomal and ER stress-associated genes (Rpl41, Rps27, Rps29, Fosb, and Jund). Cluster 2 had high expression of other mammary gland-specific genes (Fxyd3, Trf, Wfdc18, Plekhb1, and Lcn2). We next identified 141 genes that were constitutively high within all cancer cells in all 3 disease states. These conserved genes comprised predominately of 78 proliferation genes including Ube2c, Cdc20, and Cenpf ( Supplementary Fig. 4a), and 43 proinflammatory genes such as Cxcl10, Ifit1, and Isg15 (Supplementary Fig. 4b; Supplementary File 2). These genes arose early even in InferCNV+ prepuberty cells and remained high in the C3Tag DCIS-Tumor transformation process indicating that these genes could be early markers of cancer transformation. This finding of proliferation-associated genes is also directly related to the natural biology of the SV40-Large T antigen that drives tumorigenesis in C3Tag mice by inactivating p53 and Rb. In fact, 30/78 of genes from our proliferation signature had an E2F transcription binding site (Supplementary File 2).
Since we identified many genes aberrantly high in the DCIS state, we sought to examine the panel of OncotypeDx DCIS genes 27,28 in our mouse models cells dataset. Oncotype Dx DCIS score was specifically developed as a prognostic score to identify biologically aggressive human DCIS and consists of 5 proliferative genes (Ki67, STK15, Survivin, CCNB1, and MYBL2), 2 nonproliferation genes (PR, GSTM1) and 5 housekeeping reference genes. Interestingly we found 5/7 of the Oncotype Dx DCIS nonhousekeeping genes constitutively high in the C3Tag cancer cells from all 3 states except Pgr, which was <10% in cells of our murine DCIS states and completely not present in the murine tumor state ( Supplementary Fig. 4c); these finding highlights that our CNA high C3Tag cancer cells are expressing genes already used to identify biologically aggressive human DCIS.
We also constructed disease state specific gene signatures for the 3 states. Since both our C3Tag tumors showed different CNA profiles ( Supplementary Fig. 1a), they each showed unique upregulated genes with Tumor1 exhibiting high level of Kras (Fig. 2d). Cancer cells (i.e., CNA+) from the prepuberty disease state had significant high expression of genes involved in the innate immunity and chemokine signaling pathway (Ccl2, Ltf, Ccl7, Ccl20; Fig. 2d). The genes enriched in the DCIS state were associated with regulation of stress response (Hspa1a, Hspa1b), apoptosis (Txnip, Bex3, Gadd45a, and Ankrd1), proliferation (Cebpd, Nfkbia) and inflammation (Ccl20, F3, Icam1); however, most genes identified between the cancer cells in prepuberty, and DCIS states were shared in both states ( . Thus, using our CNA high cancer cells, we reidentified proliferation gene signatures but also put forth gene signatures that may be associated with basal-like precursor states.

Microenvironment-specific disease state signatures show similarities between prepuberty/DCIS state and the tumor state
The surrounding fibroblasts and immune cells are considered to play a role in influencing DCIS-Tumor transition 12,30 . We first examined fibroblasts from the 3 disease states and when examined alone these cells clustered into 2 broad populations including a myofibroblast like group ( Supplementary Fig. 5a, b; cluster 1) and an inflammatory like group ( Supplementary Fig. 5a, b; clusters 0, 2) based upon marker genes from published fibroblasts subsets in BC 31 (Supplementary File 3). Upon construction of disease state-specific gene signatures, we found that the fibroblasts from the Tumor disease state were distinct (Fig. 3a); these fibroblasts from the Tumor state were associated with increased expression of genes involved in extracellular matrix organization and signaling pathways like Integrin and FGF signaling pathways (Col3a1, Col5a1, Tnc, Sdc2, and Spp1; Supplementary File 3). We also applied a published breast tumor Cancer-Associated Fibroblast/CAF gene signature 32 from the Molecular Signatures Database (MSigDB) 33,34 and found it was significantly higher in the Tumor vs DCIS and Prepuberty fibroblasts (Fig. 3b).
Analysis of immune cells across all 3 disease states revealed 4 broad immune cell types ( Fig. 1d and Supplementary Fig. 5c, d), which included T and B lymphocytes, monocytes, and macrophages. Disease state specific T cells signatures showed unique gene expression profiles for DCIS and Tumor states ( Fig. 3c; Supplementary File 3). Specifically, DCIS T Lymphocytes exhibited significant upregulation of the Pdcd1 gene (Fig. 3c, f), which was lower in the Tumor state. We also applied gene signature modules from fractionated T cells from human DCIS and BC 35 (Supplementary File 3) and noted that the C3Tag mouse DCIS T Lymphocytes had a significant upregulation of activated T cell gene signature (Fig. 3d). Furthermore, we also found a significant upregulation of the cytotoxic T cell gene signature in the Tumor state T cells (Fig. 3d).
Similarly, analysis of macrophages from the 3 states revealed that there were distinct genes in the Tumor macrophages which includes genes like Cotl1, Tgfbi, Fos, and Fn1 along with complement activation genes C1qa, C1qb, and C1qc ( Fig. 3e;  Supplementary File 3). Finally, we saw a significant high expression of immune checkpoint markers PDL1/Cd274 in DCIS macrophages (Fig. 3f) and PD1/Pdcd1 in DCIS T cells (Fig. 3g), both of which drop in the Tumor state.   (Fig. 4a, b).
Next, we harvested RNA from archival FFPE DCIS-IDC pairs from our hospital from samples containing synchronous DCIS and IDC within the same specimen; the DCIS and IDC regions for each were individually cored, RNA isolated, Ribo-Zero bulk RNAseq performed, and then we focused on those specimens with PAM50 subtyping characterization of basal-like DCIS (n = 4) and basal-like IDC (n = 3), LumA DCIS (n = 6) and LumA IDC (n = 5). Since RNA was collected from tissue sections containing both epithelial and non-epithelial components, we calculated our C3Tag DCIS malignant cells and microenvironment signatures (Supplementary File 4) on these samples. We observed a significant enrichment of the C3Tag DCIS signatures in the basal-like DCIS samples relative to all other DCIS or IDC samples tested (Fig. 4c). We also observed a significant enrichment of the C3Tag DCIS fibroblast signature in human basal-like DCIS (Fig. 4d), and enrichment of C3Tag DCIS immune in human basal-like DCIS (Fig. 4e). All these microenvironment and tumor cell changes are summarized in Fig. 5. Lastly, we investigated NFKB associated gene signatures that were enriched in the CNA+ C3Tag DCIS cells, in all the above human datasets (Supplementary Fig. 6a-c). Basallike DCIS in 2/3 human datasets (Balleine et al. and the present study) showed a statistically significant upregulation of the C3Tag DCIS-like NFKB gene signature.

DISCUSSION
It is the current consensus that DCIS is being overtreated leading to physical, emotional, and economic burden for patients and society 39 . This is possibly due to the increased detection of DCIS from increased radiologic screening 40 , and also the knowledge that if left alone, most DCIS would not progress to an invasive disease 6,7 while yet many DCIS patients receive either systemic and/or local therapies. A challenge in studying DCIS biology is that to study it experimentally one needs to study the dynamic interaction of multiple cells within the controlled environment of a duct through to the occurrence of the invasive disease, or a sufficiently long enough time to know that DCIS will not progress to invasive disease. Currently, many human DCIS basic biology studies utilize MCF10DCIS cell line either alone [41][42][43][44] or injected into a mouse duct (MIND model) [45][46][47][48] to study DCIS in the laboratory. Although these studies add to our biological knowledge, they do not fully mimic human DCIS disease biology as it might interact with the adaptive immune system, which is likely an important component of progression potential. Importantly, a recent study by Risom et al. showed that it is those DCIS with an intact basement membrane and myoepithelial cell activation, and not those with direct tumor-to-microenvironment interactions, that were the most likely to progress 49 ; thus, model systems that contain all microenvironment components would be valuable to study DCIS progress. In addition, most human studies utilize human synchronous DCIS-IDC FFPE/frozen tissue to estimate molecular similarities between the two entities, however, these studies link the DCIS features when an invasive component is already present and does not follow the natural course of DCIS progression. Indeed, currently, there are 3 ongoing clinical trials that are studying the progression of low-risk DCIS naturally till the incidence of breast cancer [50][51][52] . With all these challenges in mind, we used the C3Tag mouse model that spontaneously forms highgrade DCIS-like lesions towards its natural course of forming IDC basal-like mammary tumors 15 .
Here we utilized this consistent murine model, and single-cell RNA sequencing, to study cell type-specific features in the DCIS and invasive disease states, including both tumor cells and nonepithelial cells. Copy number aberrant malignant cells were identified and showed increased expression of genes associated with unique biological pathways including Interferon response in prepuberty state, NFKB pathway in DCIS state, and cancer specific pathways like KRAS, p53, Myc, MTORC1 in the IDC state (Fig. 5). Importantly we also identified in the CNA+ high cancer cells, regardless of disease state, a set of genes with sustained high expression of proliferative and pro-inflammatory genes. For the microenvironment, there was an increase in the number of T cells and macrophages as a normal duct transition to DCIS to IDC, however, there were significant cellular changes in each disease state. There was an increase in PD1+ T cells at the DCIS state in comparison to prepuberty and IDC states. This was also associated with an increased activated T cell signature in the DCIS state (Fig.  5). Conversely there was a reduction of PD1+ T cells and PDL1+ macrophages in the IDC state. The IDC T cells were also more cytotoxic in nature, and it should be noted that the IDC tumors are rapidly increasing in size, and thus the cytotoxic T cells are not keeping the tumor in check despite their presence. Finally, cancerassociated fibroblasts (CAFs) were only found in the IDC state.
Using a methodology of inferring copy number to identify malignant cells from scRNAseq data, we show that gene signatures of glycolysis and hypoxia, along with sustained expression of genes associated with proliferation and interferon pathway, are present in both DCIS and CNA+ tumor/IDC cells. This finding suggests that certain genes and pathways are already initiated in pre-cursor states. Casasent et al. reported the same finding using single cell sequencing on human DCIS-IDC pairs putting forth a polyclonal mechanism of DCIS-invasive transformation 53 . In line with this, we report genes associated with other broad human DCIS pathways like hypoxia 42,54 , glycolysis 54,55 and proliferation 27 , which have been previously used for mathematic modeling of DCIS progression 54 . Importantly, we recapitulated some of the findings of the prognostic Oncotype Dx DCIS assay and identified 78 proliferation-associated genes sustained in CNA high cancer cells at the C3Tag DCIS stage, including 5/7 exactly found in the OncotypeDX DCIS proliferation feature.
Using our C3Tag DCIS cancer cell data, we report that there is a NFKB pathway enriched in these cells. NFKB has been reported involved in hypoxia and proliferation in breast pre-cursor disease 42,56,57 . Muggerud et al. also reported that the NFKB gene signature was specifically enriched in the ER-high-grade DCIS in Fig. 2 The different subpopulations and gene pathways identified in copy number high C3Tag cancer cells across the disease states. a UMAP plot of 2025 copy number high cancer cells with RNAseq data colored by the disease state (Top panel) and by the cell populations identified (Bottom panel). b Heatmap of top 10 significant upregulated genes identified per cancer cell subpopulation using the Wilcoxon rank sum test. c UMAP plots highlighting significant breast cancer gene signatures from Fan et al. 26 12,35 . Our C3Tag immune cell findings are similar to their findings including higher number of PD1 expressing T cells in the C3Tag DCIS state. IHCbased studies on pure human DCIS FFPE samples have found that almost all subsets of T cells are increased in ER-negative DCIS 61,62 , but then many go lower in the IDC state. We report a similar decrease in pdcd1(PD1)+ T cells in DCIS vs IDC, however, we also saw an increase in cytotoxic T cell signature in the tumor state and subsequent decrease in PDL1 expressing macrophages in the tumor state, which are features of basal-like invasive breast cancer 61,63 . Recently a study on pure human DCIS has also reported the importance of studying immune-epithelial cell interactions in DCIS and DCIS can exhibit 3 states based on this active, suppressed, and excluded 64 . This study further highlighted that the exclusion of T cell infiltration in the DCIS duct seen in the excluded state could be seen as an early immunesuppressive event influencing the cancer cells in DCIS to become more aggressive and start showing lack of MHC class I expression 64 . We thereby add to the current knowledge and put forth a specific immune gene signature for basal-like DCIS from immune cells identified in C3Tag DCIS that is loss of PD1+ immune cells as a marker of possible progression, and support more studies in this regard to study immune-epithelial relationships with more spatial methods.
Few studies have analyzed the DCIS-Invasive transition in a subtype-specific manner, but most of them used microarray data from FFPE tissues 12,65 . Our findings strengthen the previous results and put forth new genes of interest. However, we admit that there are limitations to our study. First our mouse tumor single cell analysis may not include all cell types that can be found in the human DCIS setting. Indeed, there might be unique cell subpopulations with other defining gene features that may be present in the human DCIS setting yet not present in our current mouse model. Second, our two C3Tag invasive tumors were molecularly distinct, with each showing unique inferred DNA copy number changes and gene expression features; nonetheless, we were able to find common features between these tumors. Lastly, we only examined a single mouse model, and our sample size of two specimens per time point is noted; however, this does represent >2000 cells per time point per specimen, and thus each specimen was well represented, and each time point showed common tumor cell and microenvironmental features, many of which were also seen in human basal-like DCIS.
In conclusion, we build upon the need to study DCIS based upon a molecular stratification and propose C3Tag mouse model as a good model to study human basal-like progression. We put forth scRNAseq-derived cell type-specific DCIS gene signatures that can be relevant in understanding DCIS biology and clinical behavior, especially since one of our signatures reiterates a major feature of the Oncotype Dx DCIS assay. Finally, we encourage the application of single cell technologies in studying the roles played by cancer cells and microenvironment cells in the malignant transformation of DCIS of other tumor subtypes.

Cell Suspension Preparation details for scRNAseq
The mammary glands for prepuberty and DCIS were placed in 10 ml of a digestion medium containing EpiCult™-B Mouse Medium Kit (#05610, StemCell Technologies), Collagenase/Hyaluronidase (#07912, StemCell Technologies), and 1% penicillin-streptomycin (Gibco). The mammary gland was digested overnight in a thermocycler maintained at 37°C with continuous rotation. The C3(1)-Tag tumors were digested with the Miltenyi tumor dissociation kit (#130-096-730, Miltenyi Biotech) under a gentle agitation setting. The cell pellets retrieved from these

Single-cell scRNAseq library construction and alignment
The cell suspensions were loaded on a 10× Genomics Chromium instrument to generate single-cell gel beads in emulsion (GEMs) for targeted retrieval of approximately 5000 cells. Resulting fastq files were aligned to the mouse mm10 reference genome using the STAR aligner algorithm 67 . The resulting BAM files were sorted and indexed using Samtools and quality control was performed using Picard. Transcript read counts were determined was performed using Salmon 68 . Genes with no reads across any of the samples were removed.  Single-cell scRNAseq preprocessing and data analysis InferCNV was run using standard settings in 'sample' mode of cutoff = 0.1, window_length = 101 and max_centered_threshold = 3. The inferCNV CNA scores were used to calculate correlations in two ways. First, correlation was calculated between the CNA profile of each cell and the average CNA profile of all copy number altered cells within the sample which is similar to the approach by Neftel et al. 20 . Second, correlation was calculated between the CNA profile of each cell and the average CNA profile of all normal cells within the sample. The final cancer cells were identified by plotting the two correlations values for each cell and identifying cells with high correlation to copy number altered cells and low correlation to normal cells. The limits for both correlation scores were identified using the mean +− 2SD.
Breast cancer gene signatures 26 were calculated within the single cell gene space by using the Seurat scaled.data in the "RNA" assay tab of the integrated datasets. Individual signature values for each cell were calculated as an average expression of all genes present in the gene signature. Once calculated, the significant gene signatures were identified using the Wilcoxon rank sum test.
GSVA was calculated using the log transformed data in the "RNA" assay slot of the integrated datasets in R. We utilized the Hallmark gene sets (H) and the immunologic gene sets (C7) for "mus musculus" using msigdbr R package. Once the GSVA scores were calculated, they were fit into a linear model, and cluster identity, or disease state labels were used to identify significant gene signatures per clusters or disease states.
The generalized linear regression model for cell proportions was constructed using the emmeans R package.

Human external microarray gene expression PAM50 centroid calculations and data analysis
Microarray data from DCIS studies were downloaded from Balleine et al. 37 (GSE7882) and LeSurf et al. 12 (GSE59246). Individual datasets were gene median centered before application of the conventional PAM50 centroid predictions using the 50 gene PAM50 predictor 72 . The median centered values were used for C3TAG DCIS signature calculations. Significance testing was done using t-test with Benjamini-Hochberg correction of p values.
Human Bulk Ribo-Zero library construction and data analysis All human tissue was procured under IRB approval from the University of North Carolina at Chapel Hill with written consent from patients to participate. FFPE sections of tumor specimens with co-occurring DCIS and IDC were identified from the medical records, examined by a pathologist, and the DCIS and IDC regions separately cored using 1 mm coring technology typically used to make Tissue Microarrays. Each core was placed into a separate Eppendorf tube and RNA was isolated using the RNeasy Mini Kit (QIAGEN, Hilden, Germany) according to manufacturer protocol. Next, Ribo-Zero libraries were made using Illumina Ribo-Zero plus rRNA Depletion Kit #20037135 following the manufacturer's protocol. Paired end (2 × 50bp) sequencing was performed on the Illumina HiSeq 2000/2500 sequencer at the UNC High Throughput Sequencing Facility (HTSF). Resulting fastq files were aligned to the human hg38 reference genome using the STAR aligner and transcript read counts were determined was performed using Salmon 67,68 . Genes with no reads across any of the samples were removed. The data were upper-quartile normalized, log-transformed, and median centered before calculating the C3TAG DCIS signatures. Significance testing was done using t-test with Benjamini-Hochberg correction of p values.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
All C3Tag mouse 10× single cell RNAseq data generated from the 10× Genomics Cell Ranger pipeline and C3Tag mouse bulk mRNAseq count data are available in GEO database (GSE182389) and raw FASTQs for are deposited in SRA (SRX11865213). All aCGH DNA data are available in GEO database (GSE182389). All the raw human data FASTQs are deposited in dbGAP (phs002443) and in SRA (SRX11865213). Processed human gene counts matrix is deposited in GEO database (GSE182389). The source data underlying Figs. 1c-e, 3a, c, e-g and Supplementary Fig. 5 are provided as Source Data file 1. The source data underlying Fig. 2 and Supplementary Figs. 1-4, are provided as Source Data File 2. The source data underlying Fig. 4 and Supplementary  Fig. 6 are provided as Source Data File 3.

CODE AVAILABILITY
There were no special new codes generated for any analysis in this paper. To determine regions of Copy Number Aberration (CNA), we utilized the R package SWITCHdna (version 1.0). All scRNAseq was done by using Seurat R package (version 3.0). Inferred copy number was determined using InferCNV (version 1.10.0). GSVA was calculated using GSVA R package (version 1.41.4) and msigdbr (version 7.4.1). Cell proportion analysis was done using emmeans R package (version 1.5.2). All bulk RNAseq subtyping was performed by PAM50 R functions (https://genomepublications.bioinf.unc.edu/PAM50/).