Mass cytometric and transcriptomic profiling of epithelial-mesenchymal transitions in human mammary cell lines

Epithelial-mesenchymal transition (EMT) equips breast cancer cells for metastasis and treatment resistance. However, detection, inhibition, and elimination of EMT-undergoing cells is challenging due to the intrinsic heterogeneity of cancer cells and the phenotypic diversity of EMT programs. We comprehensively profiled EMT transition phenotypes in four non-cancerous human mammary epithelial cell lines using a flow cytometry surface marker screen, RNA sequencing, and mass cytometry. EMT was induced in the HMLE and MCF10A cell lines and in the HMLE-Twist-ER and HMLE-Snail-ER cell lines by prolonged exposure to TGFβ1 or 4-hydroxytamoxifen, respectively. Each cell line exhibited a spectrum of EMT transition phenotypes, which we compared to the steady-state phenotypes of fifteen luminal, HER2-positive, and basal breast cancer cell lines. Our data provide multiparametric insights at single-cell level into the phenotypic diversity of EMT at different time points and in four human cellular models. These insights are valuable to better understand the complexity of EMT, to compare EMT transitions between the cellular models used here, and for the design of EMT time course experiments. Measurement(s) RNA-seq gene expression profiling assay • cell surface proteins • protein expression at the single-cell level • Epithelial-to-Mesenchymal Transition • breast cancer cell • mammary gland epithelial cell Technology Type(s) mRNA Sequencing • Flow Cytometry • cytometry time of flight assay • Cell Culture Factor Type(s) protein level • gene expression • cell morphology Sample Characteristic - Organism Homo sapiens Sample Characteristic - Environment cell culture Measurement(s) RNA-seq gene expression profiling assay • cell surface proteins • protein expression at the single-cell level • Epithelial-to-Mesenchymal Transition • breast cancer cell • mammary gland epithelial cell Technology Type(s) mRNA Sequencing • Flow Cytometry • cytometry time of flight assay • Cell Culture Factor Type(s) protein level • gene expression • cell morphology Sample Characteristic - Organism Homo sapiens Sample Characteristic - Environment cell culture Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.16989301

EMT time courses and cell harvesting. EMT was induced in the MCF10A cell line by prolonged stimulation with 5 ng/ml TGFβ1 (Cell Signaling Technology) for eight days 24 . For this, 0.8 million cells were seeded per 10 cm cell culture dish (Nunc) and incubated at 37 °C and 5% CO 2 according to ATCC recommendations. TGFβ1 treatment and vehicle treatment using Dulbecco's phosphate buffer saline (PBS, Sigma Aldrich) started 24 hours after seeding and was applied daily together with a growth medium exchange.
EMT was induced in the HMLE cell line by prolonged stimulation with 4 ng/ml TGFβ1 (Cell Signaling Technology) for 14 days 9 . For this, 0.5 million cells were seeded per 10 cm cell culture dish (Nunc) and incubated at 37 °C and 5% CO 2 . TGFβ1 treatment and vehicle treatment using PBS started 24 hours after seeding and was applied daily. The growth medium was exchanged every other day.
EMT was induced in the HTER and HSER cell lines by prolonged stimulation with 4 ng/ml 4-hydroxytamoxifen (4OHT; Sigma Aldrich) for 14 days 9 . For this, 0.5 million cells were seeded per 10 cm cell culture dish (Nunc) and incubated at 37 °C and 5% CO 2 . 4OHT treatment and vehicle treatment using methanol (Thommen Furler) started 24 hours after seeding and was applied daily. The growth medium was exchanged every other day.
To avoid over-confluence and senescence during the time course of HMLEs, HTERs, and HSERs, the cells were split and re-seeded on day four and eight. For this, the cells were washed once with pre-warmed PBS, incubated for 5 min at 37 °C with 4 ml pre-warmed TrypLE 1X Express (Gibco), quenched with pre-warmed growth medium, pelleted at 350 × g for 5 min at room temperature, resuspended in pre-warmed growth medium, and re-seeded using 0.5 million cells per 10 cm cell culture dish. FACS sorting and RNA sequencing. For live cell FACS sorting, cells were washed once with pre-warmed PBS, incubated for 5 min at 37 °C with 4 ml pre-warmed TrypLE 1X Express (Gibco), pipetted off the cell culture dish, and collected in 4 °C PBS. Cells were pelleted at 350 × g for 5 min at 4 °C, re-suspended in 4 °C PBS with 1% BSA, and stained with E-Cadherin-AF647 (clone 67A4, 5 µg/ 100 µl, Biolegend) and CD44-PE (clone IM7, 1.25 µg/ 100 µl, Biolegend) for 20 min at 4 °C in the dark. Cells were washed once using PBS with 1% BSA and kept on ice until FACS sorting using the FACSAria III (BD Biosciences). For RNA isolation, cells were pelleted at 350 × g for 5 min at 4 °C and lysed in 350 µl RLT buffer of the RNeasy Mini Kit (Qiagen). RNA was isolated according to the manufacturer's instructions. Briefly, RNA was collected on the RNeasy spin column, washed with 70% ethanol (Merck), and DNA was removed by incubation with DNAse I (Qiagen). RNA was collected in 30-50 µl diethylpyrocarbonate (DEPC, Sigma Aldrich)-containing water and stored at −80 °C. DEPC water was prepared by dissolving 1 ml DEPC in 1 L ddH 2 O prior to autoclaving. The RNA quality was assessed using a NanoDrop (Thermo Scientific) and Bioanalyzer (Agilent). RNA sequencing was performed using the HiSeq. 2500 System (Illumina) in SR 50 mode (50 base reads) after poly (A) enrichment and stranded library preparation.
Antibodies and antibody labeling. All antibodies and corresponding clone, provider, and metal or fluorescence tag are listed in Mendeley Table 1 and Mendeley Table 17 on Mendeley Data 22 . Target specificity of the antibodies was confirmed by the provider and in our laboratory. Antibodies were obtained in carrier/ protein-free buffer or were purified using the Magne Protein A or G Beads (Promega) according to manufacturer's instructions. Metal-labeled antibodies were prepared using the Maxpar X8 Multimetal Labeling Kit (Fluidigm) according to manufacturer's instructions. After conjugation, the protein concentration was determined using a NanoDrop (Thermo Scientific), and the metal-labeled antibodies were diluted in Antibody stabilizer PBS (Candor Bioscience) to a concentration of 200 or 300 µg/ml for long-term storage at 4 °C. Optimal concentrations for antibodies were determined by titration, and antibodies were managed using the cloud-based platform AirLab as previously described 27 .
Antibody staining and cell volume quantification for mass cytometry. Antibody staining was performed on pooled samples after mass-tag cellular barcoding. The pooled samples were washed once with CSM. Cells were stained with the EMT antibody panel (Mendeley Table 17 on Mendeley Data 22 ) and incubated for 45 min at 4 °C followed by three washes with CSM. For mass-based cell detection, cells were stained with 500 µM nucleic acid intercalator iridium ( 191 Ir and 193 Ir, Fluidigm) in PBS with 1.6% PFA (Electron Microscopy Sciences) for 1 h at room temperature or overnight at 4 °C. Cells were washed once with CSM and once with 0.03% saponin in PBS. For cell volume quantification, cells were stained with 12.5 µg/ml Bis(2,2′-bipyridine)-4′-methyl-4-carb oxybipyridine-ruthenium-N-succidimyl ester-bis(hexafluorophos-phate) ( 96 Ru, 98-102 Ru, 104 Ru, Sigma Aldrich) in 0.1 M sodium hydrogen carbonate (Sigma Aldrich) for 10 min at room temperature as previously described 23 . Cells were then washed twice with CSM, twice with 0.03% saponin in PBS, and twice with ddH 2 O. For mass www.nature.com/scientificdata www.nature.com/scientificdata/ cytometry acquisition, cells were diluted to 0.5 million cells/ml in ddH 2 O containing 10% EQ TM Four Element Calibration Beads (Fluidigm) and filtered through a 40 µm filter cap FACS tube. Samples were placed on ice and introduced into the Helios upgraded CyTOF2 (Fluidigm) using the Super Sampler (Victorian Airship) introduction system; data were collected as .fcs files.
For the mass cytometry experiment including fifteen breast cancer cell lines, cells were stained with the following modifications: Purified Galectin-3 (clone Gal397) was applied at 1 µg/ml for 15 min at 4 °C, the cells were washed with CSM, stained with anti-mouse IgG (polyclonal)-148 Nd for 15 min at 4 °C, washed, and then the EMT antibody panel was applied as above, but using Ki-67 (clone B56) in channel 198 Pt. We observed a strong background signal in the channel 175 Lu (position of Keratin 7) even in Keratin 7-negative cell lines such as MDA-MB-231 and PBMCs and thus drew a gate to exclude this background signal from downstream analyses.
Mass cytometry data preprocessing. Mass cytometry data were concatenated using the.fcs File Concatenation Tool (Cytobank, Inc.), normalized using the MATLAB version of the Normalizer tool 28 , and debarcoded using the CATALYST R/Bioconductor package 29 . The .fcs files were uploaded to the Cytobank server (Cytobank, Inc.) for manual gating on populations of interest. The resulting population was exported as .fcs files and loaded into R v4.1.0 (R Development Core Team, 2015) for downstream analysis.
Flow cytometry surface marker screen data processing. Flow cytometry data were compensated on the LSRFortessa Cell Analyzer (BD Biosciences) using single-stained samples. The .fcs files were uploaded to the Cytobank server (Cytobank, Inc.) for manual debarcoding and gating on populations of interest. The mean signal intensity per well and population of interest was exported as an excel sheet. The mean signal intensity of the 'Blank' wells of the screen and the signal intensity of the respective 'Isotype control' well were subtracted. From the resulting intensity values, log2-transformed fold changes were calculated.
Clustering analyses and heatmap. For PhenoGraph 31 clustering of mass cytometry data, the R RPhenograph package (https://github.com/i-cyto/Rphenograph) was used. PhenoGraph clustering was performed per cell line, using 1,000 cells per condition and replicate and k = 30. All markers except IdU, Cyclin B1, Ki-67, and cleaved CASP3/PARP1 were used. For the heatmap we performed hierarchical clustering on the z scores of the shown markers, using Euclidean distance and ward.D linkage. The z scores were calculated on the arcsinh-transformed data per marker.
RNA sequencing data analysis. The RNA sequencing data was processed using an analysis setup derived from the ARMOR workflow 32 . Quality control of the raw FASTQ files was performed using FastQC v0.11.8 (Andrews S, Babraham Bioinformatics, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Transcript abundances were estimated using Salmon v1.2.0 33 , using a transcriptome index based on Gencode release 34 34 , including the full genome as decoy sequences 35 and setting the k-mer length to 23. For comparison, the reads were also aligned to the genome (GRCh38.p13) using STAR v2.7.3a 36 . Transcript abundances from Salmon were imported into R v4.0.2 and aggregated on the gene level using the tximeta Bioconductor package, v1.6.2 37 . The quasi-likelihood framework of edgeR, v3.30.0 38,39 was used to perform differential gene expression analysis, accounting for differences in the average length of expressed transcripts between samples 40 . In each comparison, edgeR was used to test the null hypothesis that the true absolute log2-fold change between the compared groups was less than 1. edgeR was also used to perform exploratory analyses and generate a low-dimensional representation of the samples using multidimensional scaling (MDS). The analysis scripts were run via Snakemake 41 and all the code is available on GitHub (https://github.com/csoneson/WagnerEMT2020).

Data Records
A detailed list of all materials used in this study can be found as Mendeley Table 1 Table 17).

Technical Validation
Optimizing the time courses for in vitro induction of EMT. We induced EMT in four non-cancerous human mammary epithelial cell lines by prolonged ectopic stimulation with TGFβ1 or 4OHT over several days ( Fig. 1a; Methods); all four systems are widely used models of EMT 9,16,21 . We initially carried out a basic characterization of these models and optimized each induction time course to yield the maximum percentage of cells www.nature.com/scientificdata www.nature.com/scientificdata/    www.nature.com/scientificdata www.nature.com/scientificdata/ with mesenchymal (M) phenotype, characterized by loss of E-Cadherin and concomitant gain of expression of Vimentin 4 . We excluded apoptotic cells from the analysis (Fig. 1b).
On day 12 of prolonged exposure to TGFβ1, the HMLE cell line yielded 25% of cells with an M-phenotype, 33% of cells with a hybrid epithelial-mesenchymal (EM) phenotype with increased Vimentin expression but no downregulation of E-Cadherin, 28% of an E-Cadherin high Vimentin low phenotype (E1), and 14% of an E-Cadherin low Vimentin low phenotype (E2) (Fig. 1c,d). In comparison, on day twelve, 2% of control HMLEs exhibited an M-phenotype, 5% an EM-phenotype, 84% an E1-phenotype, and 9% an E2-phenotype (Fig. 1c,d). Control HMLEs with EM-or E2-phenotype were most abundant during sparse growth conditions, such as after splitting (Fig. 1d, Methods), indicating a regulation of E-Cadherin and Vimentin levels by growth density 16,43 .   www.nature.com/scientificdata www.nature.com/scientificdata/ As previously reported, treatment with TGFβ1 induced spindle-like morphological changes 44 and resulted in lower cell density compared with control 45 (Fig. 1e).
In the MCF10A cell line, induction of EMT by TGFβ1 treatment occurred in a different time frame. The percentage of cells with an M-phenotype increased from 54% on day two to 70% on day eight, the percentage of EM cells (28%) and E1 cells (2%) remained stable across the time course, and the percentage of E2 cells dropped from 10% to 2% (Fig. 1f,g). In control, cells with M-phenotype were at 26% on day 2 and 10% on day 8, cells with EM phenotype more than doubled from 25% to 64%, the percentage of E1 cells stayed stable at 22%, and the E2 cells decreased from 29% to 1% over the time course (Fig. 1f,g). As reported, TGFβ1-treated MCF10A cells acquired spindle-like morphologies while control cells retained their cobblestone shape (Fig. 1h) 16 . Together, these data show that under sparse growth conditions on day 2, MCF10A cells exhibit mesenchymal-like phenotypes even without TGFβ1 treatment, reflecting the basal-like character of the cell line 16 . An increase in cell density over time is accompanied by upregulation of E-Cadherin and therefore loss of the M-phenotype in control, while stimulation with TGFβ1 inhibits an E-Cadherin upregulation and induces an upregulation of Vimentin. In TGFβ1-treated cells, a decrease in the percentage of cells with M-phenotype on day eight compared with day six, suggests that cell density may inhibit further EMT 46 .
In the HTER and HSER cell lines, EMT was induced by prolonged treatment with 4OHT (Methods). We detected the highest percentage (14%) of 4OHT-treated HTER cells with M-phenotype on day ten, at which   www.nature.com/scientificdata www.nature.com/scientificdata/ point 26% of cells exhibited an EM-phenotype (Fig. 1i,j,k). The percentage of 4OHT-treated HSER cells with M-phenotype peaked at 12% on day eight and 28% of cells exhibited an EM-phenotype at this time point (Fig. 1l,m). For both cell lines, treatment with 4OHT induced spindle-like morphologies and was accompanied by reduced cell density compared with control (Fig. 1k,n), as previously reported 9 . We then assessed possible effects of the 4OHT treatment on HMLEs in the absence of the Twist1-ER or SNAIL1-ER fusion proteins. As expected, treatment with 4OHT did not induce EMT or morphological changes in HMLEs (Fig. 1o,p,q). In treated and control, the percentage of cells with M-phenotype was below 1% and cells with EM-phenotype at 11% at all time points, indicating a basal-like character of the cell line 9 . The majority of treated and control HMLEs maintained an E1-phenotype throughout the time course (Fig. 1o).
In conclusion, we could induce EMT in four in vitro human cell line models of this process. We observed phenotypic variability, including both full and partial EMT phenotypes, in response to 1-2 weeks of prolonged stimulation with TGFβ1 or 4OHT. Each model followed a unique EMT timeline and showed varying extents of transition to the mesenchymal phenotype.
Transcriptomic profiling of cells undergoing EMT. We next used RNA sequencing to identify markers that distinguish EMT-undergoing cells from control and markers that distinguish cells with EM-phenotype from cells with E-or M-phenotype. From the resulting markers, candidates were selected to inform a mass cytometry antibody panel design. For RNA sequencing, EMT-undergoing HTER cells on day eight and day twelve were sorted by fluorescence-activated cell sorting (FACS) into three populations: E-Cadherin high CD44 low (E1-phenotype), E-Cadherin int CD44 int (EM-phenotype), and E-Cadherin low CD44 high (M-phenotype) (Fig. 2a,  Methods). CD44 served as a surrogate M-phenotype marker for intracellular Vimentin to avoid cell permeabilization and RNA loss 9 . As control, day-matched untreated HTER cells with E1-phenotype were used (Fig. 2a). As a second type of control to monitor possible effects of 4OHT independent of EMT, we included 4OHT-treated and untreated HMLE cells. We included two to four pairs of independent biological replicates per condition and collected high quality RNA for all samples (Mendeley Table 2, Methods). www.nature.com/scientificdata www.nature.com/scientificdata/ RNA sequencing yielded above 20 million reads per sample assigned to genes, except one sample with 19 million reads (Fig. 2b, Mendeley Table 2). Mean Phred scores ranged between 35 and 36, indicating high base call accuracy, and GC content distribution across samples did not indicate any noticeable contamination (Fig. 2c, Mendeley Table 2). For all samples, more than 82% of the reads could be uniquely aligned to the human reference genome using STAR 36 . Mapping to the transcriptome index using Salmon 33 showed that more than 86% of fragments were assigned to a transcript, with little variation across samples.
We next assessed the similarity of samples based on global gene expression levels using multidimensional scaling 38,39 (Methods). This showed that the respective pairs of biological replicates were similar (Fig. 2d). Control HTER cells were similar to day-matched 4OHT-treated and control HMLE cells, indicating few effects of 4OHT on transcription independent of EMT. This analysis further revealed that 4OHT-treated HTER cells with E-, EM-, and M-phenotype were all separate from their respective day-matched control (Fig. 2d). Differential gene expression analysis showed that more genes were significantly differentially expressed between HTER cells with M-phenotype or EM-phenotype and control than between E-phenotype and control on day eight (Fig. 2e, Mendeley Tables 3-5). Among differentially expressed genes between M-phenotype and control, we found upregulation of canonical markers of EMT, such as the transcription factors ZEB1, ZEB2, FOXC2, and PRRX1, as well as downregulation of typical epithelial markers such as EPCAM 1 (Mendeley Table 3). We then asked, which genes were significantly differentially expressed between HTER cells with EM-phenotype and cells with E-or M-phenotype on day eight and found three genes (HHIP, FBN1, HHIP-AS1) and one gene (KIAA1755), respectively (Fig. 2f, Mendeley Tables 6 and 7). When comparing HTER cells on day twelve, more genes were significantly differentially expressed between cells with M-phenotype and control than between E-phenotype and control (Fig. 2g, Mendeley Tables 8 and 9).
In conclusion, 4OHT-treated HTER cells with M-phenotype or EM-phenotype deviated transcriptionally more from control than cells with E-phenotype. Also, 4OHT-treated cells with E-phenotype are transcriptionally distinct from control cells with E-phenotype.
Surface protein expression screen during EMT. We then carried out a flow cytometry surface protein screen to identify further markers that distinguish EMT-undergoing cells from control and to design a mass cytometry antibody panel. Treated and control samples of the HTER, HMLE, and MCF10A cell lines were fixed at multiple time points, fluorescently barcoded, and co-stained with a combination of surface epithelial markers, E-Cadherin and/or EpCAM, and a surface mesenchymal marker, CD44. The resulting flow cytometry data  www.nature.com/scientificdata www.nature.com/scientificdata/ were compensated, debarcoded, and gated for cell populations of interest (Fig. 3a-c, Methods). We detected expected surface protein abundance differences between cell populations, such as elevated levels of CD51 in EMT-undergoing cells compared with control 47 , confirming the quality of the screening results (Fig. 3d). We identified multiple surface proteins that were more than two-fold differentially expressed between treated (TGFβ1-treated or 4OHT-treated) and control samples (Tables 1-3, Mendeley Tables 14-16). Several of these were regulated in all three cell lines (CD51, CD83, CD266) or in two cell lines (e.g., CD90, CD146, CD166,     www.nature.com/scientificdata www.nature.com/scientificdata/ EGFR, N-Cadherin, Notch 3, Podoplanin) and most were regulated in the same direction (up or down) relative to control (Fig. 3e). Based on these flow cytometry screen results and the RNA sequencing analysis, we assembled a panel of candidate targets to assess phenotypic heterogeneity during EMT using a multiplex mass cytometry workflow (Fig. 3f, Mendeley Table 17).

Mass cytometric profiling of EMT phenotypes.
Mass cytometry is uniquely suited to assess phenotypic heterogeneity during EMT due to its ability to measure about 40 targets at the single-cell level 20,48 . To ensure high data quality, all antibodies against the candidate targets were titrated using samples that represent epithelial phenotypes (HMLE and MCF10A control cells), mesenchymal phenotypes (fibroblasts, TGFβ1-treated HMLE and MCF10A cells), and non-epithelial, non-mesenchymal phenotypes (peripheral blood mononuclear cells) (Fig. 4a). We then selected EMT-undergoing and control samples at four to six time points for each of the HMLE, HTER, HSER, and MCF10A cell lines, totaling 92 samples ( Table 4). The single-cell suspensions were fixed and mass-tag barcoded 26 to allow pooling and simultaneous antibody staining of the samples (Methods). We used antibodies against cleaved CASPASE-3 (cl. CASP3) and cleaved poly(ADP-ribose)-polymerase 1 (cl. PARP1) to exclude apoptotic cells, yielding more than 1 million live cells for downstream analysis (Fig. 4b). Comparing three biological replicates of the MCF10A or the HMLE cell lines using the dimensionality reduction algorithm Uniform Manifold Approximation and Projection (UMAP) 30 showed a strong similarity of the triplicates for each cell line (Fig. 4c). For MCF10A, the UMAP showed good discrimination of treated and control samples, including differences in E-Cadherin and Vimentin levels ( Fig. 4d; Methods). The separation of day 2 control MCF10A cells from other control MCF10A cells is likely caused by the very low expression of epithelial markers and strong expression of the proliferation marker Ki-67 in day 2 cells in contrast to later time points and may reflect growth at lower confluence. Sparse growth conditions have previously been associated with more basal/ mesenchymal-like phenotypes in the MCF10A cell line 16 . In the HMLE cell line, TGFβ1-treated and control samples were less separable (Fig. 4e). In the HTER and HSER cell lines, we observed a separation of 4OHT-treated cells with E-Cadherin low Vimentin high phenotype from their respective control on the UMAP (Fig. 4f). In contrast and as expected, 4OHT-treated HMLE cells were indistinguishable from control and displayed only low levels of Vimentin, indicating the absence of an EMT (Fig. 4g). Together, our multiplex mass cytometry data shows that EMT is associated with strong phenotypic changes in all four cell lines.  Table 4. Types of samples used for mass cytometry analysis in Fig. 4a-g. www.nature.com/scientificdata www.nature.com/scientificdata/ We next wanted to assess the phenotypic diversity of EMT-undergoing cells in more detail and in the context of other cell types and cell lines, specifically fibroblasts (i.e., mesenchymal cells), fifteen breast cancer cell lines spanning luminal epithelial, HER2-positive, and basal/mesenchymal-like epithelial phenotypes, and peripheral blood mononuclear cells (PBMC; i.e., neither epithelial nor mesenchymal cells). For this, we repeated the mass cytometry analysis for 35 markers and using a subset of time points for the HMLE, MCF10A, HTER, and HSER cell lines. We included four luminal (MCF-7, T47D, ZR-75-1, MDA-MD-134 VI), one HER2-positive (SKBR-3), eight basal Vimentin-positive (MDA-MB-436, MDA-MB-231, HCC38, HCC1395, BT459, CAL-51, HDQ-P1, and Hs578T), and two basal Vimentin-negative breast cancer cell lines (DU4475, HCC1806), fibroblasts, and PBMCs (Table 5). We then applied the algorithm PhenoGraph 31 to each EMT model individually, which grouped treated and control cells into nine to eleven phenotypically diverse clusters per cell line based on expression of all 35 markers (Fig. 4h, Methods). The majority of clusters contained mostly treated or untreated cells, indicating a treatment-based separation, while other clusters contained cells of both conditions. We observed this separation for all eleven clusters for the MCF10A cell line, six of eleven clusters for HMLE, eight of eleven clusters for HTER, and two of nine clusters for HSER (Fig. 4h). In MCF10A cells, we observed upregulation of CD44, Podoplanin, CD146, and CD51 upon EMT induction compared with control, and concomitant downregulation of E-Cadherin and K5. In the HMLE, HTER, and HSER cell lines, Vimentin, CD44, CD90, CD51, and CD10 were upregulated in EMT-undergoing cells compared with control (Fig. 4h). Several clusters containing EMT-undergoing cells were phenotypically similar to different basal breast cancer cell lines, as determined by hierarchical clustering (Fig. 4h, Methods). For example, cluster HSER_2 contained 4OHT-treated cells from days 5 and 10 and shared high levels of CD44, CD90, and CD146 and low levels of E-Cadherin, EpCAM, and cytokeratins with the basal Hs578T cell line and with fibroblasts (Fig. 4h, blue rectangle). In another example, the clusters MCF10A_1-6 contained TGFβ1-treated cells from days 4 and 8 or day 2 control cells and shared high levels of Vimentin, CD44, N-Cadherin, and Galectin-3 and low levels of EpCAM and E-Cadherin with the MDA-MB-231, BT549, HCC1395, and MDA-MB-436 basal cell lines. Low levels of epithelial markers in day 2 control cells likely reflects growth at low confluence 16 . In contrast, all luminal and HER2-positive breast cancer cell lines clustered separately from the EMT transition phenotype clusters (Fig. 4h).
In conclusion, we assembled an antibody panel for multiplex mass cytometry characterization of EMT and discovered a vast phenotypic diversity of EMT states among four widely used human in vitro models of this process. Several of these EMT states displayed phenotypic similarities with basal breast cancer cell lines and fibroblasts, suggesting that EMT in normal mammary epithelial cells can induce phenotypes observed among aggressive breast cancer cell lines.   www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
We provide here a comprehensive characterization of EMT transition phenotypes in four human mammary epithelial cell lines. We characterize transcriptomes and multidimensional protein-level single-cell phenotypes of these cell lines during EMT. We place these transition phenotypes in the context of the multidimensional phenotypes of fifteen luminal, HER2-positive or basal breast cancer cell lines, fibroblasts, and PBMCs. It has previously been shown that EMT in the here used models is associated with increased mammosphere formation 9 , or induction of invasion and migration 49 . A detailed functional assessment of the different molecular phenotypes of EMT-undergoing cells presented here is not part of this study and may be of interest.

Code availability
The code used for RNA sequencing data analysis can be found on GitHub (https://github.com/csoneson/ WagnerEMT2020) and can be accessed without restrictions. Please refer to the Methods section above for more details on software versions.