Evaluation of a developmental hierarchy for breast cancer cells to assess risk-based patient selection for targeted treatment

This study proposes that a novel developmental hierarchy of breast cancer (BC) cells (BCCs) could predict treatment response and outcome. The continued challenge to treat BC requires stratification of BCCs into distinct subsets. This would provide insights on how BCCs evade treatment and adapt dormancy for decades. We selected three subsets, based on the relative expression of octamer-binding transcription factor 4 A (Oct4A) and then analysed each with Affymetrix gene chip. Oct4A is a stem cell gene and would separate subsets based on maturation. Data analyses and gene validation identified three membrane proteins, TMEM98, GPR64 and FAT4. BCCs from cell lines and blood from BC patients were analysed for these three membrane proteins by flow cytometry, along with known markers of cancer stem cells (CSCs), CD44, CD24 and Oct4, aldehyde dehydrogenase 1 (ALDH1) activity and telomere length. A novel working hierarchy of BCCs was established with the most immature subset as CSCs. This group was further subdivided into long- and short-term CSCs. Analyses of 20 post-treatment blood indicated that circulating CSCs and early BC progenitors may be associated with recurrence or early death. These results suggest that the novel hierarchy may predict treatment response and prognosis.

Despite improved treatments, breast cancer (BC) remains a clinical problem. BC cells (BCCs) can remain dormant for decades, commonly referred as cellular dormancy [1][2][3][4][5][6][7][8] . Cellular dormancy is a method by which the BCCs enter a state of cellular quiescence until it receives a que from the environment to proliferate 9 . Clinical outcome studies have documented disease re-occurrence from 1-20 years after initial treatment regardless of lymph node involvement 10 . There could be a long lag time between the initiation of the tumor to clinical diagnosis 11 . During this lag time, metastatic BCCs could escape into the circulation from undetectable, but developing tumor 8,[12][13][14] .
The bone marrow (BM) can facilitate the survival of dormant BCCs for decades 15,16 . Thus, the BM is a significant organ when considering treatment for BC. Approximately 30% of BC patients have BM metastasis and about 50% of those may have cancer recurrence 17 . There is a strong correlation between BCCs in the BM and relapse. However, a direct evidence on cause-effect relationship between BCCs in the BM and metastatic recurrence requires additional studies. Regardless, it is evident that the presence of BCCs in the BM may be prognostic 8,18 . Thus, studies of BCCs using a developmental hierarchy as part of the characterization should be considered in future studies to correlate any association between developmental phenotype and outcome events including response. The stratification of BCCs into a robust hierarchy is missing in the literature. This study has begun to address this problem using gene chip arrays.

Results
Affymetrix Gene Array Analyses/Differential Expression. As the first step to stratify BCCs, we performed global gene expression analyses using MDA-MB-231. The method used to select three different subsets was previously described 3 , and based on the relative GFP intensities of cells stably transfected with the Oct4A promoter linked to GFP. Based on previous studies, we showed a direct correlation between Oct4A and GFP intensity 3 . Despite previous studies showing low levels of Oct4 protein maintaining pluripotency 39 , in our model, which is based on previous report, functional and biochemical studies indicated that the highest expression of Oct4A is linked to 'stemness' of BCCs 3 . We should also note that most of the Oct4 pseudogenes are not translated to proteins. Thus, for the purpose of simplifying our model, we have designated relative GFP intensities with Oct4A promoter as Oct4 hi (5%), Oct4 med (30%) and Oct4 low (10%) 3,26 (Fig. 1A). We then performed global gene expression analyses with total RNA from four different sets of MDA.MB-231, using the Affymetrix Genechip Human Genome U133 Plus 2.0 arrays. Oct4 hi and Oct4 low were analyzed in quadruplicates and Oct4 med , in duplicate (Fig. S1). The data were analyzed with multiple database and software tools, Partek ® Genomics Suite software, Gene Spring GX 10.0.2, QIAGEN's Ingenuity ® iReport, and BRB-Array Tools.
We conducted principal component analyses (PCA) to determine the similarities and differences among array results of each sample. The results indicated confidence with respect to similarities as indicated by clustering of the replicates from each BCC subset (Fig. 1B). Additionally, the clusters were far apart, indicating distinct separation of the overall gene expression pattern among the subsets (Fig. 1B).
BRB-Array Tools provided a multi-dimensional scaling analysis, supporting distinct gene expression among the samples and depicted the tight grouping of replicate sample and greater separation among the BCC subsets (Fig. S2). The similarities and differences were obvious in the hierarchical clustering of the samples linking Oct4 med and Oct4 lo (Fig. 1C). Both subsets were linked to Oct4 hi BCCs, further supporting a hierarchy among the three Oct4A subsets.
Heat maps generated with Partek and BRB-Array Tools incorporated the differentially expressed genes. The map with 6,300 genes showed clustering among the replicates, but with distinct gene intensities among the three BCC subsets (Fig. 1D, left). A hierarchical clustering of genes with −/+1.2 fold changes (2,227 genes) showed distinct pattern of gene expression among the three subsets (Fig. 1D, right). The results showed distinct global gene expression among BC subsets.
The analyses programs could not incorporate the data for Oct4 med BCCs due insufficient replicates. Thus, we compared Oct4 hi and Oct4 lo BCCs for differentially expressed genes using the normalized signal values. Those at higher statistical relevance are highlighted (p value, < 0.05; fold change > 2.0) (Fig. 1E, Table S1).
Biological networks analysis/canonical pathways. The IPA knowledge base compared genes with >1.5 fold for those associated with breast and ovarian cancers, and then determined the connections to Oct4/ POU5F1 (Fig. 1F). Molecule Activity Predictor illustrates activation or inhibition of regulators/molecules in the networks. In Oct4 hi BCCs, Oct4/POU5F1 expression was highly activated including its 34 transcriptional activators. Activation used in the network's output implies functional activation and increase in the gene expression. High expression of Oct4 showed upregulation/activation of genes linked to TNFα and TGFβ signaling, TGFB1, TGFB2, INHBA, SMAD3 and SMAD4, suggesting that Oct4 may be linked to microenvironmental functions. In general, the tumor cells are within a niche of cytokines. In Oct4 med BCCs, TNFα was inhibited and TGFβ signaling molecules: either downregulated (TGFB2) or poorly activated (TGFB1, TGFBR2, SMAD3 and SMAD4). Interleukins (IL-1, IL-2 and IL-4) and the proinflammatory CCL2 were highly activated in SNAI1, SNAI2, TWIST1, ZEB1 BCCs. The key molecules linked to epithelial mesenchymal transition (EMT), SNAI1, SNAI2, TWIST1, ZEB1, CXCL12, VIM, CDH2, MMP2, CXCR4, ERBB2, RAC1, Notch, PARD6A and PARD6G, were highly activated whereas E-cadherin (CDH1) and connexin-43 (GJA1) were inhibited. We also analyzed the data set for membrane-spanning genes, which is the focus of this paper. Our goal was to use the dataset to identify novel genes that can stratify BCCs. Hierarchical clustering of the samples using 2227 genes differentially expressed at p < 0.001 and FDR < 1.12%. (D) Heat map with hierarchal clustering generated from Partek compares all three conditions and represents 6300 genes differentially expressed (1.2+/−fold) at p < 0.1 and 10% FDR (false discovery rate). Two-sample T-test or F-test (more than 2 classes) based p value was calculated for each significant gene with random variance model. (E) Molecular interaction network of BC overrepresented genes associated with OCT4expressing BCCs generated from the information in the Ingenuity knowledge base. Top upstream regulators, identified by IPA core analysis, were incorporated in the network. Molecule Activity Predictor (MAP) illustrates upstream/downstream activation or inhibition of molecules in the network. Solid lines represent direct relationship and dashed line represents indirect relationship between nodes. Prediction legend and gene product's functional class are shown in the legend key. (F) A scatter plot developed in Genespring using normalized signal values in Log 10 scale. The statistical significance was assessed with an unpaired t-test between Oct4 hi and Oct4 low , cut-off points were p-value < 0.05 and fold-change > 2.0. Specific genes of interest have been annotated on the plot.
We showed the results for a few, in particular those with immediate interest since we sought membrane proteins for the stratification of BCCs; hence the selection of FAT4, TMEM98 and GPR64 ( Fig. 2A). The results for these three genes from the arrays were validated by western blot (Figs 2B and S3). RAP1A, which showed the highest difference in the Affymetrix data could not be validated by PCR or western blot and was not selected for further studies. Fibronectin (FN) was studied since as a control since it is an EMT marker and is therefore expected in the Oct4 hi subset, which has been shown to be the CSC 3 . In summary, the findings validated FAT4, TMEM98 and GPR64 as key membrane proteins for further development of a hierarchical rank among BCC subsets. ALDH1 and Telomere Length in BCC subsets. CSCs have been reported to be within the Oct4 hi BCC subset 3 . Since ALDH1 has been reported in CSCs (38), we examine the different BCCs for ALDH1. Positive control with K562 cells showed a major shift to the right (Fig. 3A). MDA-MB-231 cells showed 4.25% (+) for ALDH1 (Fig. 3B). The observation was consistent with BCCs comprise of ~5% CSCs (Fig. 3B) 3 . Next, we stratified BCCs, based on the relative intensities of fluorescence in cells stably transfected with Oct4-dsRED-SB. These analyses could not use Oct4-GFP due to the overlap with the ALDH1 substrate on the FACScan. BCCs were stably transfected with pOct4-dsRED-SB. Since we used this vector for the first time, we validated its function by treating the transfectants with Bix 3 . The purpose of using Bix is to inhibit methytransferases and deacetylation of the Oct4 regulatory region within pOct4-dsRED in all BCC subsets, regardless of endogenous expression 40 . The results with Bix indicated that pOct4-dsRED was functional (Fig. 3C). Using the gating scheme in Fig. 3D, 40% ALDH1(+) was observed in Oct4 hi , 9% in Oct4 hi/med and no detection in the Oct4A reduced BCC subsets (Fig. 3E).
High telomerase generally correlates with increase in the telomere length 41 . We therefore examined Oct4 hi , Oct4 med and Oct4 low BCCs for telomere length. The results showed a direct correlation between Oct4 levels and telomere length (Fig. 3F).

Distribution of TMEM98, GPR64 and FAT4 within CD44+/CD24−
which were reported to be markers of CSCs, needs further hierarchical stratification as this subset was shown to be heterogeneous with respect to CSCs 3,38 . We therefore studied how the CD44 + CD24− BCCs can be further stratified based on TMEM, GPR64 and FAT4 expressions. We observed a shift to the right for each marker, indicating their expression ( Fig. 4/middle panels). Next we analyzed CD44 + CD24− BCCs based on Oct4 (hi, medium, low) for TMEM98, GPR64 and FAT4. The geometric (Geo) mean for TMEM98 showed its presence on Oct4 (hi/med/lo) and, GRP64 and FAT4 on Oct4 (med/lo) CD44 + CD24− BCCs.
Assembling BCC hierarchy. The methods used for cell lines, analyses of ALDH1 used to assemble the hierarchy of BCCs are summarized in a flow diagram (Fig. S4). Although Oct4 hi BCCs were shown to contain CSCs 3 , we only found 40% ALDH1+ with 9% at Oct4 hi and Oct4 med interface (Oct4 hi/med ) (Fig. 3E). We assigned the Oct4 hi/med BCCs with reduced Oct4A-GFP intensity as short-term BC repopulating cells (BC-RC) and assigned Oct4 hi subset with the highest GFP intensity as term long-term BC-RC (

Identifying circulating tumor cells in BC patients.
We investigated if the developed hierarchy can predict treatment response and prognosis in 20 BC patients (Table 1). Due to ethical reasons, we studied time points, mostly after treatment using blood remaining after clinical tests. Table 1 describes the demographics and disease course of the patients. Flow cytometry with markers shown in the hierarchy (Fig. 5) assigned the developmental stage of BCCs for the patients ( Table 2). The method used to analyze the patients' samples were similar to analyses of cell lines as outlined in Fig. S4. Subject 5 showed high amounts of CSCs before treatment and early BC progenitors after treatment. At the time of writing (6 months post-treatment), this subject was diagnosed with bone metastasis. Similarly, subject 13 ended treatment with circulating CSCs and shortly after was diagnosed with bone pain. The preliminary studies indicated that the markers in the hierarchy may predict BC outcome and treatment response.

Discussion
We report on the genetic and phenotypic differences in BCC subsets with a working hierarchy. The population with high Oct4A showed heterogeneity depending on the scatter pattern and CD44/CD24 expression (Fig. 5). ALDH1, which is generally linked to stem cells was incorporated to demarcate the most primitive BCCs into more and less mature cells, with the possibility of varied multipotency (Fig. 5).
We applied the hierarchy (Fig. 5) by testing patients' blood during and after treatment (Tables 1 and 2). The analyses were done with limited amount of blood since the samples were those remaining after clinical diagnosis. The data indicated that the hierarchy may be relevant to treatment response and long-term outcome. The study lacked a timeline change due to the ethics of drawing blood for research when the patients were on chemotherapy. Subject (#5) was assessed before and after treatment since she was not anemic and donated additional blood samples. As expected Subject 5, who was a relapsed patient started with high CSCs and ended with early progenitors. After 6 months, this patient was presented with bone metastasis. The information provided with the blood samples suggested the hierarchy may aid be relevant to precise treatment for BC. Residual BCCs at the lower end of the hierarchy may predict a good outcome, perhaps due to their inability to initiate a tumor. This may contrast residual BCCs in the upper hierarchy, which could indicate poor prognosis. Also, BCCs the more primitive residual cells may dedifferentiate into CSCs.
Despite the reproducibility among the replicates in the arrays, some of the changes could not be validated by real time PCR. This may be explained by the RMA normalization method, which entails averaging the intensity of all probes. Since we were interested in selecting membrane proteins for clinical application, we focused on those genes then selected TMEM, GPR64 and FAT4, which contributed to an expanded BCC hierarchy 3 .
CD44 + /CD24− BCCs showed varied expressions of TMEM98, GPR64 and FAT4 in subsets based on Oct4A levels (Fig. 4). These findings suggested that CD44 + /CD24− BCCs while encompassing CSCs, could be heterogeneous. The identification of surface markers to stratify BCCs is important for the isolation of primary subsets thereby avoiding in vitro changes of the primary tumor. The hierarchy developed from this study can be used to select primary BCC subsets as would be required for precise analyses in cancer biology.
FN in Oct4 hi BCCs was highly significant due to its link to epithelial-mesenchymal-transition. Since the CSCs can adapt dormancy, the production of FN1 within the stromal compartment could be an indicator that the dormant BCCs have become part of the normal hematopoietic microenvironment, ensuring their survival. In BM, the dormant BCCs will need to survive within a complex system of hematopoiesis, and FN1 is part of the BM stromal microenvironment.
The established a network with the microarray data explained how Oct4 (POU5F1) was linked to the validated genes (Fig. S5). Since Oct4 expression was used to isolate the different BCC subsets, the link between Oct4 and other genes provided insights into how this stem cell-associated gene Oct4, which can demarcate different subsets, is linked to developmental stages of BCCs, and the three membrane markers, TMEM98, FAT4 and GPR64. Not a surprising link, Oct4 connected to Twist, which is gene associated with EMT 42 . Similar links to Oct4 was shown for additional markers in Fig. 1E. The link among genes in the Oct4 hi BCCs, shown to comprise the CSC subset, is relevant to an understanding of CSC survival, and the development of targeted treatment 3,43 .
Despite the lack of reliable markers of CSCs, this study has begun to show several potential markers.
It is yet to determine if the newly identified markers can identify BCCs in all microenvironments, or if the markers might be involved in the behavior of the BCCs within a specific organ. EpCAM, reported on CSCs, showed a 1.1 fold change in our array 44 . In our hierarchy, we identified the Oct4 hi BCCs, which were previously shown to be CSCs 3 . Similarly, it is possible that EpCAM-expressing BCCs may require further stratification. Also, the variations among antibodies will require robust validation to translate the findings to patients. As suggested for EpCAM 44,45 , the markers shown in this study may be important for follow-up studies to target using the chimeric antigen receptor approach.
The hierarchy shown in Fig. 5 were generated based on the relative intensities of Oct4-GFP/RFP in BCCs. The other genes linked to these broad subsets, if used to isolate the respective BCC subsets could allow for functional information, in addition to the discussed utility for prognosis. Oct4 is shown to link with genes that have been reported to have significant roles in tumor progression, such as EMT markers (Fig. 1F). IPA analyses with Oct4 gene as the focus indicated important molecules in BC progression and at the same time, displayed how these molecules are connected, their expression levels, and sub-cellular localizations. Our analyses added upstream regulators of Oct4. In summary, this study has just begun to determine how the Oct4 gene is linked to gene expressions and function. Together the data can be combined for long-term outcome studies, and to sort subsets of BCCs for functional studies. The telomere length shown in this study begins to examine functional studies. We anticipate to isolate primary BCC subsets to expand on these studies, in addition to other functional analyses.
The biological replicates presented in this study used one cell line. The data seem to fit the triple positive BCCs since T47D cells were used in the flow cytometry studies. The hierarchy developed from these studies were applied to patients with phenotypically different tumors, with respect to hormone receptor expression. Thus, although the limitation in the initial studies is the lack of analyses with T47D, the data provide information to expand the clinical analyses in a large cohort of patients.
This paragraph discusses the translational relevance of the study. The stratification of BC developmental subsets for clinical translational purposes is a major hurdle. The consequence being the lack of clarity to determine the cancer subsets being targeted. If the relative maturity of the surviving BCCs can be identified, this may provide insights on the patient outcome, such as time to cancer re-occurrence. We applied genomic data from Affymetrix analyses to develop a hierarchy (working) in which the least mature BCCs are placed at the top. The working hierarchy forms the basis for future studies to expand the relative maturity of BCCs. The hierarchy incorporated TMEM98, GPR64 and FAT4, CD44, CD24, ALDH1 and Oct4A. The significance of the hierarchy was evaluated by analyzing the blood of patients. Due to ethical constraints, we mostly collected blood samples after treatment to evaluate what specific cancer subset remained. The patients' outcome was monitored up to 1-2 years after the last treatment, which coincided with the blood testing. Although 24 blood samples were collected, we could only test 20 due to technical difficulties. Despite the lack of a larger study with statistical analyses, the current data strongly suggested that the hierarchy could predict treatment response and outcome. A larger multi-center control trials could lead to the development of precise treatment for breast cancer patients.
The final paragraph discusses the limitation of the study and provide a plan for future studies. We previously performed functional assay to show that Oct4 hi BCCs comprise the tumor initiating cells 3 . The present study used phenotypic studies and apply the findings to test clinical samples from BC patients. However, the hierarchy shown in Fig. 5 requires functional studies using in vivo models. We are in the process of generating antibodies to the extracellular region of GPR64, FAT4 and TMEM98. The antibodies will be used to select subsets of BCCs from  BCCs, stably transfected with Oct4-GFP, were previously described 3 . Briefly, the cells were transfected with pEGFP1-Oct3/4 and then selected with neomycin. Similarly, stable transfectants of BCCs with pDsRed2-1-Oct3/4-SB were selected with neomycin.

Reagents
Ethical statement. The use of human blood was approved by the Institutional Review Board (IRB) of Rutgers, Newark Campus. All subjects signed the informed consent forms. Furthermore, the studies were conducted as outlined in the IRB protocol.
Human subjects. Blood (2-5 mL) was taken from left over samples after diagnostic tests from BC patients.
The demographics of the patients and their treatment at the time of blood draw are outlined in Table 1.
Vectors. pEGFP1-Oct3/4, which contained the Oct4A promoter, was generously provided by Dr. Wei Cui (Imperial College London, UK). The insert Oct4 insert from pEGFP1-Oct3/4 was removed using the restriction enzymes, HindIII and AgeI. Since the insert and the vector were similar size, a further digestion was performed with NotI to cut the vector backbone from pEGFP. This separated the backbone vector and the Oct4 insert. The  Table 2. Phenotype of circulating BCCs in patients. The patient demographics shown in Table 1   Mononuclear cells from the blood of patients were isolated by Ficoll Hypque Density Gradient. The cells were labeled for the markers listed above, expect for the addition of CD45, which was used to gate out the hematopoietic cells. In cases where there was sufficient cells, we included anti-pan-cytokeratin to gate these cells for analyses, and performed labelings for intracellular Oct4A.
Data were analyzed with BD CellQuest (BD Biosciences). The geometric (Geo) mean, shown in Fig. 4, was calculated by subtracting the mean fluorescence intensity (MFI) of isotype control from the experimental MFI. The relative amount of circulating tumor cells (CTCs) were determined by the relative MFI of specific subsets based on the hierarchy shown in Fig. 5.
Aldefluor Assay. Aldefluor assay was performed with BCCs, untransfected or stably transfected with Oct4-dsRed cells using a kit from Stem Cell Technologies (Vancouver, Canada). The assay was performed according to the manufacturer's instructions. The K562 cell line served as a positive control. The aldefluor reagent was added to the cells. After this, the reaction was stopped at different times with diethylaminobenzaldehyde (DEAB). The cells were immediately analyzed by flow cytometry. The untransfected BCCs were gated as the total population and the Oct4-dsRed transfectants were gated based on Oct4 level (red fluorescence). We used 15 milliwatt 488 nm laser. In the case of Aldeflour, we used bandpass 530/30 filter and for Ds-Red, 585/42 filter. We use single labeled controls to compensate spectral overlap to guide for further optimization.
Telomere Length Assay. Telomere  Affymetrix Array. MDA-MB-231-pEGFP1-Oct3/4 cells were sorted into three conditions as stated above. In four different experiments, total RNA was isolated from the sorted cells with Qiagen RNeasy Mini Kit (Qiagen). The RNA quality was assessed on the Agilent Bioanalyzer 2100. Total RNA (100 ng) was converted into cDNA, cRNA, and fragmented according to the manufacturer's protocol with the assistance of the Biomek FX P Target Prep Express with standard scripts for 3′ IVT Express for Gene Expression. Fragmented samples were hybridized to GeneChip Human Genome U133 Plus 2.0 Arrays and then washed and stained following the manufacturer's protocols. The hybridized GeneChip Arrays were scanned with a GeneChip Scanner 3000 7 G (Agilent/ Affymetrix) and the data saved in .cel files.

Data Analyses.
A CONSORT chart is outlined to show how the samples were used in the Affymetrix analyses (Fig. S1). Three BCC subsets were selected based on GFP expression, which we correlated with Oct4A expression, based on previous studies 3 . The sorted cells were assigned Oct4 hi , Oct4 med , Oct4 lo . Oct4 hi and Oct4 lo were tested in quadruplicate, except for Oct4 med , which was tested in duplicate. Genes < 1.2 fold among the different subsets were filtered and the resulting genes were subjected to further analyses with multiple programs. This allowed us to reduce inherent bias of the individual program: Partek ® Genomics Suite software, version 6.11.0801 (Partek Inc., St. Louis, MO), Gene Spring GX 10.0.2 (Agilent Technologies, Santa Clara, CA); QIAGEN's Ingenuity ® iReport (QIAGEN Redwood City, CA) and BRB-Array Tool, version 4.4.0 developed by Dr. Richard Simon and BRB-Array Tools Development Team. Oct4 hi , Oct4 low and Oct4 med samples were considered as replicates if they were under the same condition. Average intensity values across each replicate were taken into consideration for analysis. Further statistical analysis was done in each program and noted in the figures.
Background corrected hybridization intensities were imported into BRB-Array Tools Version 4.4.0 log2-transformed and robust multi-array average (RMA) normalized (36). Features not significantly above background intensity in 50% or more of the samples, and features not changing at least 1.2-fold in at least 20% of the samples were filtered out. This yielded 25,750 features that were used in subsequent analyses.
BRB class comparison was conducted to identify genes that were differentially expressed. Two-sample t-test or F-test (more than 2 classes) based p value was calculated for each significant gene with random variance model. Genes with p-values of <0.001 were considered significant. The false discovery rate was also estimated for each gene to control false positive, as described 46 . Multidimensional scaling was performed in BRB-ArrayTools to represent high-dimensional data (e.g., a selected gene list with relative expression) graphically in low dimensions. The Euclidian distance metric was used to compute a distance matrix and the principal components of the gene expression signature. Each sample was then represented by a single point and the distance between two points indicated the overall similarity of those two samples. The first three principal components of gene expression were used as axes to generate a plot. Data was also visualized using hierarchical clustering in BRB-Array Tools. The Euclidean distance metric and average linkage was used to cluster genes and generate a heat map.
Pathway and networking by Ingenuity Pathway Analysis (IPA). The significantly differentially expressed genes (p < 0.001, FDR ≪0. 6%) were imported into Ingenuity Systems (http://www.ingenuity. com/). The IPA knowledge base filter mapped 2705 transcripts to known genes in IPA. We next performed IPA core-analysis in the context of pathways and networks, biological function and/or diseases. The right-tailed Fisher's exact test was applied to calculate the p value ascertaining the probability that each biological function and/or disease. IPA core analysis identified 1721 genes as the highest significant functional category associated with cancer (p < 2.10E-15) and 344 genes relevant to BC (p < 3.10E-13). Of the 344 genes over-represented for BC, 99 of these genes have relative fold change >1.5. We used pathway analysis to determine whether Oct4 hi expression was connected to genes associated with BC at the molecular network level based on connectivity information in the IPA Knowledge Base. We added molecules suggested by the IPA "pathway explorer" in order to connect molecules of interest. Priority was given to those molecules with a high degree of connectivity within the pathway rather than molecules with many connections to molecules not on the pathway. Data availability. The dataset generated with the Affymetrix gene arrays are available through the GEO database using accession number GSE 86861.