Gene expression in breastmilk cells is associated with maternal and infant characteristics

Breastmilk is a rich source of cells with a heterogeneous composition comprising early-stage stem cells, progenitors and more differentiated cells. The gene expression profiles of these cells and their associations with characteristics of the breastfeeding mother and infant are poorly understood. This study investigated factors associated with the cellular dynamics of breastmilk and explored variations amongst women. Genes representing different breastmilk cell populations including mammary epithelial and myoepithelial cells, progenitors, and multi-lineage stem cells showed great variation in expression. Stem cell markers ESRRB and CK5, myoepithelial marker CK14, and lactocyte marker α-lactalbumin were amongst the genes most highly expressed across all samples tested. Genes exerting similar functions, such as either stem cell regulation or milk production, were found to be closely associated. Infant gestational age at delivery and changes in maternal bra cup size between pre-pregnancy and postpartum lactation were associated with expression of genes controlling stemness as well as milk synthesis. Additional correlations were found between genes and dyad characteristics, which may explain abnormalities related to low breastmilk supply or preterm birth. Our findings highlight the heterogeneity of breastmilk cell content and its changes associated with characteristics of the breastfeeding dyad that may reflect changing infant needs.

Scientific RepoRts | 5:12933 | DOi: 10.1038/srep12933 is limited to cellular changes due to feeding/milk removal 6 and the responses of breastmilk leukocytes to mother and/or infant infections during the course of lactation 12 .
Breastmilk contains a heterogeneous mix of cells including epithelial cells and leukocytes. Leukocytes are the most widely studied cell type in breastmilk due to their protective properties and their known ability to infiltrate the infant's tissues [12][13][14] . However, leukocytes constitute only a minority of cells in mature human milk when both the breastfeeding mother and infant are healthy 11,12 . On the other hand, epithelial cells are thought to be the most dominant cell type in human milk, and their properties and functions have not been intensively studied 11 .
Breastmilk epithelial cells consist of the two main types of cells, luminal and myoepithelial cells. Luminal cells express epithelial cell adhesion molecule (EPCAM) 11 whereas myoepithelial cells express smooth muscle actin (SMA) and cytokeratin 14 (CK14) 11 . Luminal epithelial cells are made up of small populations of ductal non-secretory epithelial cells which express cytokeratin 19 (CK19) and alveolar cells (lactocytes) that express cytokeratin 18 (CK18) and synthesize and secrete milk, and are thus positive for milk proteins such as α -lactalbumin (α -LA) 15 and β -casein 15 . Mammary stem-like cells positive for the markers α 6 integrin (CD49f) and cytokeratin 5 (CK5) have also been identified in breastmilk and have been proposed to act as precursors to both luminal and myoepithelial cell types 16,17 . Earlier reports have demonstrated that epithelial cells isolated from freshly expressed breastmilk were able to expand in adherent culture and form colonies of various morphologies that could be maintained through multiple passages 15,[18][19][20][21] . This suggested for the first time the presence of self-renewing cells in breastmilk 11 . These observations together with previous work by Russo et al. on the ultra-structure of lactocytes suggested that breastmilk contains both less differentiated, self-renewing cells and more differentiated milk-secretory cells 22 . Cregan et al. (2007) first reported cells with mammary stem-like properties in breastmilk, expressing ectodermal progenitor markers such as Nestin 23 . The presence of these cells has been confirmed by other investigators 16,24 and was further expanded by the identification of breastmilk stem cells (BSCs) 25 . BSCs not only self-renew in 3D spheroid culture, but also express pluripotency genes including the core transcription factors OCT4, SOX2 and NANOG and downstream targets KLF4, REX1 and GDF3. These breastmilk-derived cells are capable of differentiating into cells from all three germinal layers 25,26 . The various levels of gene expression observed within single breastmilk cell samples 25 confirmed the presence of a cellular hierarchy in breastmilk, from early-stage stem cells to progenitor cells to more differentiated lactocytes and myoepithelial cells 11,16,[23][24][25] .
The discovery of BSCs with multilineage differentiation potential raised numerous questions as to the fate of these cells in the breastfed infant and their potential use in regenerative medicine 11,27 . Recent advances in our laboratory have provided the first evidence that BSCs integrate into tissues of the neonate 13 , potentially providing developmental benefits 11 . This, together with previous observations that BSCs are not tumorigenic 25 and are naturally transferred from the mother to the infant render these cells excellent candidates for stem cell therapies 11 . However, before such applications can be further explored, it is necessary to establish factors influencing the prevalence of breastmilk stem cells in expressed milk in order to optimize their isolation. Therefore, the aim of this study was to describe the existing variation of breastmilk cell populations between women, and to explore associations of gene expression with dyad characteristics, particularly those that have been previously linked to changes in biochemical components 5,7,9,10 . A secondary aim was to examine genes not previously assessed in breastmilk cells to broaden knowledge of the cell types present in human milk and their potential functional significance.

Results
Breastmilk cell gene expression varies amongst women. The demographic characteristics of the study participants (n = 66) are described in Table 1 17.5-37.4). Bra size was used as an indicator of breast volume, as previously described 28 . The average change in bra size from pregnancy to lactation was from a C to a D cup. The infants were mainly first born (n = 40), with a range of 2-4 children from multiparous mothers (n = 26). Breastmilk sample volumes collected were on average 55 mL (range: 6-240 mL) and contained on average total cell counts of 2.77 × 10 5 cells/mL milk (range: 0.17 × 10 5 -3.75 × 10 6 cells/mL milk). The majority of cells isolated from freshly expressed breastmilk were viable (mean: 97.9%; range: 85.3-100%) ( Table 1).
The core pluripotency transcription factors OCT4, SOX2 and NANOG had similar expression profiles (Fig. 1a), with the majority of breastmilk cells expressing these genes at high levels and in some cases Scientific RepoRts | 5:12933 | DOi: 10.1038/srep12933 comparable or even higher than in hESCs (Fig. 1a, Tables 2 and 3). It was found that 96% of the breastmilk cell samples expressed these genes at levels higher than fibroblasts, 2% had higher OCT4 expression than hESCs, and over 10% of the samples expressed the three genes at levels higher than HUMECs. All breastmilk samples had significantly higher ESRRB expression than fibroblasts, with a mean of greater than 6,400 times higher expression (p < 0.001) (Tables 2 and 3). Similarly, approximately 80% and 90% of the breastmilk cell samples had higher ESRRB expression than hESCs and HUMECs, respectively. GDF3 was expressed significantly higher in all breastmilk cell samples compared to both fibroblasts and resting human mammary epithelial cells (HUMECs) (p < 0.001 and p = 0.001, respectively) (Tables 2  and 3); however, all breastmilk cell samples had significantly lower GDF3 expression compared to hESCs (p < 0.001). For all cell lines tested, KLF4 and REX1 expression was within the range seen in breastmilk cells (Tables 2 and 3).
Approximately 35% of breastmilk cell samples expressed CD49f at levels up to 3 times higher than HUMECs, 70% of the samples at levels up to 12-fold higher than hESCs, and all samples were significantly higher than fibroblasts (p = 0.001) ( Table 3). PAX6 was expressed in 90% of breastmilk samples up to 350 times more highly than both fibroblasts and hESCs (Fig. 1a). All breastmilk samples were also significantly higher than HUMECs for PAX6 and NESTIN (p < 0.001). NESTIN was significantly lower (up to 4 times) in breastmilk cells compared to hESCs (p = 0.002), and higher than fibroblasts for 90% of the samples. Gene expression for NOGGIN and PTEN for all the cell lines tested was within the range of breastmilk cells (Tables 2 and 3).
Mammary stem cell gene CK5 was variably expressed amongst the breastmilk cell samples, with HUMECs showing similar expression (Fig. 1a). On the other hand, myoepithelial marker CK14 was significantly lower in breastmilk cells, on average 70 times lower compared to HUMECs (p = 0.001). Expression of lactocyte markers α -LA, EPCAM and CK18 in breastmilk cells was up to 1000, 16 and 100 times higher, respectively than in HUMECs for the majority of samples tested (Table 3).

Breastmilk gene expression correlates with genes of similar function.
Using principal component analysis (PCA), interactions for most genes tested were found to be within the first four components explaining 85.1% of the total variance within our gene expression profiles (Fig. 2). The first principal component (PC1) explained 52.6% of the variation in gene expression. Largely weighted α -LA and EPCAM clustered together and were in opposition to similarly weighted genes PTEN and NOGGIN. Within the second principal component (PC2) (Fig. 2), which explained a further 20.1% of the variation, the variance for ESRRB was opposite to the genes KLF4 and REX1. In the third principal component (PC3) (Fig. 2), which explained 6.5% of variation, OCT4, SOX2 and NANOG clustered together with similar weighting. In the forth and final principal component (PC4), which explained 5.9% of the variation, CK14 and CK5 clustered together with strong loadings with CK18, which had a weaker loading by comparison (Fig. 2).
Breastmilk cell gene expression is associated with mother/infant demographic characteristics. Higher maternal BMI was associated with lower expression of CK18 (p = 0.029). Higher gestational age at delivery (closer to term birth) was associated with higher α -LA (p = 0.040), higher NESTIN  Table 1. Demographic and breastmilk sample characteristics of the mother/infant dyads (N = 66). * Cup sizes represent Australian bra sizes.

Discussion
Breastmilk cells are emerging as a valuable non-invasive tool to examine the normal function, development and pathologies of the mammary gland. In this study, we examined breastmilk cellular heterogeneity at the molecular level, how it varies amongst lactating women, and how mother and infant characteristics may influence this variability. Importantly, we identified genes such as estrogen-related receptor-β (ESRRB), cytokeratin 5 (CK5), cytokeratin 14 (CK14) and α -lactalbumin (α -LA), which was universally expressed at high levels amongst the breastmilk cell samples examined. Moreover, genes regulating similar functions were found to have similar expression profiles, including pluripotency-regulating genes and lactation-associated genes. Mother-infant dyad characteristics showed significant correlations with specific genes, illustrating the high variability in gene expression amongst and within mothers and providing the basis for standardisation of studies that investigate the cellular content of breastmilk.
Expression profiles for most tested genes varied widely amongst breastmilk samples. These differences provide further evidence that breastmilk cell composition is highly variable amongst women and support the notion that breastmilk cells change in response to maternal and infant characteristics and are highly dependent on the specific mother-infant dyad 11 . The high expression recorded for the genes CK5, CK14 and CK18 is not surprising as these genes are representative of mammary stem/progenitor cells, myoepithelial cells and lactocytes respectively, all of which have been previously found in breastmilk 1,11,23,25,29,30 . In addition, these genes clustered closely together suggesting tight molecular regulation and/or involvement in similar cellular pathways in the lactating mammary gland. As expected, the milk protein α -LA and the pan-epithelial marker EPCAM showed higher expression in breastmilk cells compared to resting mammary gland cells (HUMECs), and the protein for both genes was prevalent in stained lactating breast tissues (Table 3). Our data therefore illustrate a strong epithelial signature and the presence of an epithelial hierarchy in breastmilk samples obtained from healthy mother-infant dyads, which is in agreement with previous literature 11,12,25 .
Interestingly, high expression of the gene estrogen-related receptor β (ESRRB) was observed in the majority of the breastmilk samples analysed and tended to be higher in breastmilk cells (range in breastmilk cells of 115-42,655 times higher than fibroblasts, whereas HUMECs had 2.73 times higher expression than fibroblasts) compared with HUMECs. ESRRB is an orphan nuclear receptor that displays   31 . In Drosophila it functions as a metabolic switch during development 32 , however in murine embryonic stem cells it binds to NANOG to sustain pluripotency and cell self-renewal 33 . Its presence and significance has not yet been confirmed in hESCs, though it has been suggested to play a role in trophoblast stem cell differentiation [33][34][35] . ESRRB may be integral to mammary stem cell regulation and could potentially be regulated by fluctuating oestrogen that occurs naturally throughout pregnancy and lactation in the breast 36 . In the mammary gland, ESRRB has been shown to be co-expressed and correlated with estrogen receptor β (ER-β ) in breast cancer biopsies; however, the significance of these observations is not well understood 37 . To our knowledge, this is the first report of ESRRB expression in breastmilk cells. The relatively high expression of this gene in breastmilk in comparison to resting mammary cells or human embryonic stem cells suggests that this receptor and its associated ligand(s) may play an important role in the human lactating mammary gland, potentially facilitating milk synthesis and mammary gland function. ESRRB functions may relate to stem/progenitor cell maturation and maintenance during lactation to support milk synthesis, a hypothesis that is supported by its close interaction with the genes KLF4 and REX1, both of which are known to participate in the control of multi-lineage differentiation and pluripotency in human embryonic stem cells 33,34,38 .
Other genes that were examined in this study that have not been previously considered in breastmilk cells are NOGGIN and PTEN. NOGGIN has been confirmed to be an important gene controlling embryonic development in the human 39 , and is also known as a bone morphogenic protein (BMP) antagonist, functioning to prevent differentiation of mammary epithelial cells 39,40 . In breastmilk cells, NOGGIN was not highly expressed (Fig. 1a), and is consistent with the high prevalence of differentiated lactocytes both in the lactating mammary gland and in breastmilk 11,41,42 . On the other hand, PTEN is a tumour suppressor influencing diverse functions such as cell cycle, apoptosis and metastasis [43][44][45][46] , known to be expressed in normal mammary tissue, with mutations and/or downregulation being characteristic of breast and other ectodermal cancers 47,48 . PTEN was minimally expressed in all breastmilk samples tested. Expression levels reported here, represent the normal expression in the lactating mammary gland and breastmilk of healthy women who do not have a history of breast cancer, and potentially reflect the rapid cycling of cells in the lactating epithelium. These results suggest that it would be of value to investigate expression of PTEN as a potential diagnostic biomarker in breastmilk of women that had previously had breast cancer or a family history for this disease. Both PTEN and NOGGIN were found to cluster together, but were inversely associated with the genes α -LA and EPCAM, suggesting potential opposing regulation and/or function of these gene pairs.
The pluripotency genes OCT4, SOX2 and NANOG were strongly correlated (Fig. 2). These genes have been shown to be the core transcription factors controlling multi-lineage differentiation and pluripotency of ESCs as well as induced pluripotent stem cells (iPSCs) 49 . Whilst not all breastmilk samples expressed high levels of these genes, some samples had similar levels of expression to hESCs, which is  in accordance with previous literature 25 . The presence of these genes in the lactating mammary gland and in breastmilk at levels higher than the resting gland confirms previous reports from our and other laboratories 6,11,25,50 and is also depicted in stained lactating breast tissue sections (Fig. 1b). The demonstration of a tight association of these genes further supports an important function in lactation and in the multi-lineage differentiation capability of breastmilk cells 25 . Other genes found to be expressed in breastmilk cells that may also regulate their differentiation capability are CD49f and PAX6, which are representative of mammary stem cells and the neuroectodermal lineage respectively 51,52 . Both genes had similar expression profiles in the breastmilk samples examined; however, PAX6 expression in breastmilk cells was much higher compared to resting breast cells, showing a significant upregulation during lactation. This signifies a function of PAX6 in the mammary gland during its remodelling associated with pregnancy and/or lactation. In addition, it may explain the known propensity of breastmilk stem cells to differentiate towards the neural lineage 25,27 . NESTIN is another known multi-lineage stem cell marker, which has been previously detected in breastmilk cells 23 . NESTIN expression was higher in breastmilk cells compared with HUMECs, which is indicative of an upregulation of the mammary stem/progenitor cell characteristics during lactation. NESTIN protein was also confirmed to be present in abundance in lactating breast tissues (Fig. 1b). Importantly, large variation was found in gene expression amongst breastmilk samples, some of which can be explained by maternal and infant characteristics. Gestational age at delivery was associated with gene expression, in particular higher α -LA and NESTIN were found for infants with greater gestational age at delivery. This suggests that the mammary gland is more mature in mothers of term infants, containing more milk-secretory cells, but also more cells with progenitor properties thus being able to maintain some level of plasticity. For mothers of preterm infants, the levels of the stem cell marker SOX2 were much higher, accompanying the lower expression of α -LA and NESTIN compared to mothers of term infants. These findings suggest that the breast has not yet fully matured in women giving birth prematurely, providing a biological explanation for the observation that mothers of preterm infants often have a compromised initiation of lactation 53 . In this connection, a role in mammary development early in lactation may be exerted by ESRRB, which was more highly expressed earlier in lactation. The observed depletion of ESRRB+ cell populations, which occurs the longer a woman breastfeeds, may provide a an explanation for the known protection of breastfeeding against breast cancer 54 . On the other hand, GDF3 had a positive association with the stage of lactation, suggesting that this gene may be representative of a progenitor-like cell that is enriched towards later stages of lactation. This is in agreement with previous studies showing that during involution certain types of progenitor cells remain in the mammary gland and are thought to fuel the remodeling of the gland during the next pregnancy 55 . Both ESRRB and GDF3 may play fundamental roles in mammary development during pregnancy and lactation and are important to consider in further investigations.
A negative association was found between BMI and expression of CK18, a marker for luminal epithelial cells including lactocytes 30 . This suggests that women with a larger body mass index have less epithelial tissue capable of synthesizing milk. Previous studies have observed that maternal obesity before conception leads to a number of complications in postnatal breastfeeding including delayed onset of lactogenesis, impairment of lactogenesis II, and thus lower rates of breastfeeding initiation and shorter breastfeeding duration [56][57][58][59] . Our finding linking high maternal BMI with lower CK18 expression in breastmilk cells may explain the observed difficulties among breastfeeding mothers who are obese; this finding also provides a potential target for future clinical intervention.
Significant differences in the change of bra size between the pre-pregnancy and the postpartum mammary gland were found and were linked to higher levels of α -LA and EPCAM. This suggests that whilst prior studies have found that larger bra size does not necessarily translate to presence of more lactocytes, greater change in bra size, i.e. in breast volume 28 , is reflective of more epithelial cells, potentially resulting in higher milk production. This finding is interesting in light of Powe et al., who reported that larger changes in bra size were linked with greater milk energy content; however, this association became non-significant after considering other factors 10 . The lower levels of SOX2 and the greater levels of REX1 expression observed with a greater change in maternal bra cup size may facilitate this process (Fig. 3).
In addition to demographic characteristics of the mother-infant dyad, breastmilk characteristics such as the cell content were found to be associated with gene expression. Higher breastmilk cell content was related to higher levels of α -LA and GDF3, suggesting the presence of more lactocytes and GDF3+ cells in cell-rich breastmilk, which typically reflects recent emptying of the breasts 6 . Breastmilk cell content was also significantly positively related to the gene cluster α -LA, EPCAM, PTEN and NOGGIN, which mainly represents epithelial cells. Moreover, infant sex was found to relate to epithelial/lactocyte genes such as α -LA and EPCAM, which were higher in the breastmilk from mothers of female infants. This suggests that these mothers had more lactocytes in their gland, which could potentially be associated with greater milk yield. Although this was not measured in the present study, it has been previously reported in captive rhesus macaques and in the dairy cow that mothers of female infants have greater milk yield compared to mothers of male infants 60,61 . This is in agreement with our findings, which may provide a mechanistic explanation of the fetus-driven differential remodeling of the mammary gland resulting in greater milk synthesis in mothers of female offspring.
Differences in breastmilk cell gene expression were observed between samples collected in Australia and the USA, although we are cautious in interpreting these results as other factors associated with different laboratory conditions or maternal milk supply may contribute to these observations. A further limitation of this study was the analysis of a pre-selected set of genes in all the cellular content of breastmilk. Therefore, some of the variation seen in gene expression may reflect the activity of a subgroup of cells, such as lactocytes or myoepithelial cells, whereas other variation may reflect the proportion of each cell type. Future studies will further delineate these questions by isolating and examining specific subpopulations of breastmilk cells as well as utilizing more comprehensive gene expression analyses, such as next generation sequencing. Finally, given that this study examined gene expression and effects of demographic traits in a cross-sectional dataset, it will be important to confirm and further examine these associations longitudinally.

Conclusions
This study highlights the heterogeneity of breastmilk cell content and associations with characteristics of the breastfeeding mother-infant dyad. Interestingly, the lack of certain correlations, such as between breastmilk cell gene expression and maternal age, parity or mode of delivery, suggests that these characteristics may not important factors influencing total breastmilk cell gene expression in our cohorts. We identified core genes such as CK5, CK14, ESRRB and α -LA that are prevalent in breastmilk cells across mothers. Many of them show important interactions and may be involved in mammary gland function, lactation performance, and/or infant development in light of the recently reported integration of breastmilk cells into the infant 13 . Interestingly, some of the interactions found here between dyad demographic characteristics and genes such as α -LA and GDF3 may potentially explain abnormalities associated with low breastmilk supply and preterm birth, and warrant further investigation. In the future, breastmilk cellular analyses both at the protein and mRNA levels, considered together with dyad demographic characteristics, may be a useful routine practice in hospitals particularly in neonatal intensive care units, in the management of low milk supply, and during treatment of maternal and/or infant infections.

Materials and Methods
Breastmilk Sample Collection. The study was approved by the Human Research Ethics Committee of the University of Western Australia and the institutional review board (IRB) of The University of North Carolina at Chapel Hill, USA. All the methods were carried out in accordance with the approved guidelines. Breastfeeding dyads (n = 66) were recruited in Australia (n = 38) and the USA (n = 28), who were breastfeeding at least once daily (not necessarily exclusively) to attend a single session at either The University of Western Australia or the University of North Carolina. Exclusion criteria of the study were applied to participants that reported signs of breast, other organ or general/systemic infection, were pregnant or did not provide a sufficient sample to perform PCR with the extracted cellular RNA. All participants provided informed written consent and completed a confidential questionnaire including relevant demographic data. During this session, participants expressed a breastmilk sample (5-250 mL) under aseptic conditions with a Medela Symphony breast pump (Medela AG, Baar, Switzerland). Samples Scientific RepoRts | 5:12933 | DOi: 10.1038/srep12933 were transported to the laboratory immediately shielded from the light. Cell content and viability were measured, and RNA was extracted for RT-PCR analysis as previously described 25 . Breastmilk Cell Isolation. Breastmilk was diluted with equal volume of sterile phosphate buffered saline (PBS; pH 7.4, Gibco, Grand Island, NY) and centrifuged at 800 g for 20 minutes at 20 °C. The lipid layer and skim milk were removed, and the cell pellet was washed twice in PBS at 400 g for 5 minutes and was resuspended in PBS. The total cell content and viability of each sample were determined with a Neubauer haemocytometer by Trypan Blue exclusion. The cell pellet was stored at − 80 °C until RNA extraction.
Cell Culture. Cell lines used as positive and negative controls included adult dermal fibroblasts (Lonza, Walkersville, USA) and primary neonatal fibroblasts 25 , the Mel-2 embryonic stem cell line (StemCore, Brisbane, Australia), the OCT4-transduced breast cell (OTBC) line 62 , and a resting human mammary epithelial cell line (HUMEC) derived from normal resting breast tissue mammoplasties 25 . Cell lines were cultured in T75 flasks for several passages in a Sanyo CO 2 incubator MCO-17AIC (Quantum Scientific, Queensland, Australia) held at a constant temperature of 37 °C at 5% CO 2 . Fibroblasts were cultured in DMEM/F12 supplemented with 20% fetal bovine serum (FBS; Fisher Biotec, Western Australia, Australia) and 1% antibiotic/antimycotic (Invitrogen, Victoria, Australia). Mel-2 cells were cultured in flasks coated with Matrigel (In Vitro, Victoria, Australia) at a density of 1.3 μ L/cm 2 in conditioned media from human fetal fibroblast feeder cells (CMKSR; conditioned mTeSR Stem Cell Technologies, Victoria, Australia) purchased from StemCore (Brisbane, Australia). Both OTBCs and HUMECs were cultured in HUMEC complete medium (Invitrogen, Victoria, Australia) supplemented with 1% antibiotic/antimycotic.

RNA Extraction.
Total RNA was extracted with the mini RNeasy extraction kit (Qiagen, Valencia, CA). Cell pellets were incubated in 600 μ L of RLT buffer for 10 minutes and transferred to a separate Eppendorf tube whereby they were triturated through a 21G needle syringe 10 times. Lysate was mixed well and equal volumes of 70% ethanol was added before mixing and spinning the lysate through the provided spin column at a maximum speed of 8,000 g for 30 seconds. Flow through was discarded and 700 μ L of RW1 solution was added to the spin column before spinning at 8,000 g for 30 seconds. Following this, 500 μ L of RPE buffer were added and spun at 8,000 g for 30 seconds. This was repeated after the flow-through was discarded, and it was spun for 2 minutes at 8,000 g. The spin column was placed in a new tube and 30 μ L of RNAse free water was added to the center of the column, incubated for 10 minutes on ice before spinning for a final time at 8,000 g for 1 minute. After the RNA was eluted, RNA quantitation and purity was measured using a Nanodrop 1000. RNA obtained was of an acceptable quality fitting in the absorbance at 260/280 ratio of 1.8-2.2.
cDNA Generation. Total RNA was reverse transcribed using the high-capacity cDNA archive kit (Applied Biosystems, Carlsbad, CA). A 50-μ L reaction was created by adding prescribed volumes of each component contained within the kit, to make up the cDNA master mix to 25 μ L of the RNA diluted in Ultrapure RNAse free water (Gibco). Samples were incubated in a Bio-Rad C1000 96 well gradient block thermo cycler and held at 25 °C for 10 minutes, 37 °C for 120 minutes, 85 °C for 5 minutes, and held at 4 °C until collected. cDNA was stored at − 20 °C until required for quantitative real-time polymerase chain reaction (qRT-PCR).

Quantitative Real-Time Polymerase Chain Reaction (qRT-PCR). Gene transcription was quan-
tified by qRT-PCR using hydrolytic probes (Taqman, Applied Biosystems; Supporting Information Table  S1) with the 7500 Fast RT-PCR system (Applied Biosystems). Each sample was measured in triplicate or in few cases in duplicate when the extracted RNA was not adequate. Genes were standardized to fibroblasts, and each sample was controlled with an in house regulator GAPDH. Fold change in gene expression for each sample and experimental condition was calculated as 2 Ct(control)-Ct(sample) ± SD and relative quantitation was determined for each replicate. Repeated measures of the samples were averaged and the standard deviations were calculated. Standard deviations were used for quality control of the data and means were used for statistical analyses.
Immunostaining of mammary tissues. Human biopsied mammary tissues from lactating women were obtained from the tissue archive of the School of Anatomy, Physiology and Human Biology, The University of Western Australia. Formalin-fixed and paraffin-embedded tissues were sectioned at 5 μ m thickness and immunostained for fluorescence microscopy as described in Hassiotou et al. 2012 25 . The primary and secondary antibodies used are shown in Suppl. Table 2.
Flow cytometry. Flow cytometric analysis was conducted in cells isolated from freshly expressed breastmilk collected from two breastfeeding women according to Hassiotou et al. 2012 25 . In short, isolated cells were incubated with fixative (1.5% Paraformaldehyde: BDH, Poole, England; 1% Sucrose: Merck, Darmstadt, Germany in phosphate buffered saline/PBS: Invitrogen, Mulgrave, Victoria) for 10 minutes followed by washing in PBS and were then centrifuged for 30 seconds. Permeabilisation occurred by adding Tween 20 (USB, Cleveland, USA) 0.05% for 10 minutes to the cells. Primary antibodies (Table S2) Scientific RepoRts | 5:12933 | DOi: 10.1038/srep12933 were added to the cells in Tween 20 0.05% with 2% fetal bovine serum (Borogen, Essendon, Australia) and incubated for 1 hour in the dark at 4 °C following washing. Cells were washed twice and incubated with 1:200 secondary antibody solution (Table S2) as above for 30 minutes in the dark at 4 °C. Cells were washed twice in PBS/Tween 0.05%, and fixative was added. Appropriate negative internal controls were also prepared (no primary antibody). Data acquisition was done with a FACS Calibur Flow Cytometer (Becton Dickinson, Franklin Lakes, NJ, http://www.bd.com) and data analysis was done with FlowJo. Statistical Analyses. Statistical analyses were carried out using R 2.9.0 for Mac OSX, with additional packages PMA 63 and MASS 64 for sparse principle component analysis and robust regression, respectively. Summary demographics and sample details are presented as means (ranges) in the results section and medians and ranges in Table 1. Demographic data were coded as follows: bra cup sizes as A = 1, B = 2, C = 3, D = 4, DD = 5, E = 6, F = 7, FF = 8, G = 9, H = 10; mode of delivery as either vaginal delivery (including assisted) or caesarean section (emergency and elective). Gene expression was determined by relative quantitation (RQ) compared to the control fibroblasts. RQ data were generally logged due to the order of magnitude changes that exist between samples. Missing data due to varying levels of RNA extracted from breastmilk samples were taken into consideration using statistical analyses as described below. Multiple models were used to account for non-uniformity of the data, as explained below. In general, p-values below 0.05 were considered statistically significant, unless stipulated otherwise. Very small p-values are reported as p < 0.001. No adjustments were made for multiple comparisons other than the selection of more stringent alpha values for z-score analysis. Borderline associations (0.05 ≤ p < 0.1) have not been reported.
To examine how gene expression varied between breastmilk cells, gene expression distributions were examined graphically and tested for normality. Normality was tested for each gene by the Shapiro-Wilk normality test. Data were assumed to be normally distributed when p ≥ 0.1, indeterminate when 0.05 ≤ p < 0.1 and not normally distributed when p < 0.05. To determine whether cell line values fitted within the distribution of breastmilk cells, z-scores and associated p-values were calculated based on whether cell line values sat outside of 99% of the distribution. As such, any z-scores for cell lines that sat outside of |2.56| were reported to be significantly different from breastmilk cell values (p < 0.01). The hESC cell line was not tested for the mammary differentiation genes (α -LA, EPCAM, CK14 or CK18) and OTBCs were not tested for EPCAM, CK5 or GDF3 as they were not considered relevant controls of these genes.
Possible gene-gene interactions were determined using sparse principle component analysis (PCA's) that examined logged RQ variance to determine different components. The method developed by Witten et al. 2009, used here, allows for calculations of principle components in the presence of missing data. Using the defaults: non-orthoganality, 20 reiterations and centered data 63 , we looked at the first four principle components with a sum of the absolute values equal to 2. These parameters were set based on four perceived groupings (pluripotency genes, stem cell genes, mammary genes and a tumour suppressor gene). Another principle component was added to examine if there might be further groups than this; however, it was dropped out because the explanation of the variance between the 4 th and 5 th component did not increase by more than 5% when using a greater group number. Loadings of > |0.1| only were considered.
Univariate relationships between measured gene expression (log[RQ]) and continuous demographic data (maternal BMI, maternal age, difference in bra size, parity, current infant age, gestational age at delivery, and sample cell content) were assessed using regression. For all genes, ordinary least squares (OLS) regression models were fit for each demographic, after centering to meaningful theoretical values for each demographic (BMI on 25; Gestational age at delivery of 280 days [40 weeks]) and logging RQ values. Appropriateness of the model was assessed using graphical residual analysis. Where residuals were found not to be normally distributed, robust linear modeling was applied 65 and considered both the Huber and Tukey bisquared residual corrections. T-value output was converted to p-values for interpretation. Ordinary least squares regression was used for the gene ESRRB against the demographic lactation stage, and for GDF3 compared to cell content. Genes CK18, NESTIN and REX1 were found to be more appropriately represented applying the Huber correction and the exception of cell content and α -LA where the Huber correction was also found to be more appropriate. The Tukey correction was suited for the genes α -LA, SOX2, EPCAN and GDF3 when examining correlations with lactation stage.
For dichotomous demographic data (primi versus multigravida, infant sex, mode of delivery), differences in gene expression between categories were tested with one or more of Student's t-test, Welch's t-test, or the Wilcoxon signed rank test as follows. Where the Shapiro-Wilk test indicated that the data were not normally distributed (p < 0.05), the Wilcoxon sign rank test was used. Where normality of the data was indicated (p ≥ 0.1), Bartlett's test was used to determine whether the assumption of equal variances held. Where this was considered to be the case (p ≥ 0.1) the Student's t-test was used, and where it was not (p < 0.05) Welch's t-test was used. In situations where either the Shapiro-Wilk or Bartlett's test was ambiguous (0.05 ≤ p < 0.1) both conditions were run. When examining the difference between infant sexes for different genes, the multi-sex infant pair was excluded. The Welch test was found to be the most appropriate analysis for the gene α -LA and Wilcoxon for the genes EPCAM, KLF4 and NANOG. Associations between the discovered groups of genes with individual demographics were determined using the values generated by the sparse principle components. Ordinary least squares regression was then applied to determine associations with continuous demographic data.