A class of GATA3 mutation reprograms the breast cancer transcriptional network through gain and loss of function

GATA3 is frequently mutated in breast cancer; these mutations are widely presumed to be loss of function. Here, we address molecular alterations downstream of a novel class of GATA3 mutations, revealing both gain and loss of function. Mutation of one allele of GATA3 led to loss of binding and decreased expression at a subset of genes, including Progesterone Receptor. At other loci, associated with epithelial to mesenchymal transition, gain of binding at a novel sequence motif correlated with increased gene expression. These results demonstrate that not all GATA3 mutations are equivalent and that these mutations impact breast cancer through gain and loss of function.


Introduction
Breast cancer is an important cause of cancer mortality among women.
Transcriptomic data classifies breast cancer into 6 subtypes -(1) Luminal A; (2) Luminal B; (3) HER2 positive; (4) Basal-like; (5) Claudin-low; and (6) Normal breast-like -that differ not only in molecular characteristics but also in disease course and response to therapy [1][2][3] . Systems-level analyses have identified GATA3 as one of the most frequently mutated genes in breast cancers 4,5 , yet the function of GATA3 mutations in breast tumors is poorly understood.
GATA3 belongs to the zinc-finger transcription factor family that functions as a key regulator of multiple developmental pathways including mammary epithelial cell differentiation [6][7][8][9][10] . In breast cancer, the expression level of GATA3 is strongly associated with estrogen receptor alpha (ERa) 11,12 , and loss of GATA3 expression is associated with poor prognosis 13,14 . In both animal and human cell line models, GATA3 functions as a tumor suppressor by inducing epithelial and suppressing mesenchymal fates [15][16][17] . GATA3 acts as a pioneer transcription factor during mesenchymal-to-epithelial transition 18 ; chromatin binding of GATA3 is important for the recruitment of other co-factors such as ERa and FOXA1 in breast cancer cells 19,20 .
Based on the The Cancer Genome Atlas (TCGA) data cohort, approximately 10% of breast tumors harbor somatic mutations in the GATA3 gene 5,21 . These mutations are typically heterozygous and highly concentrated in the C-terminal region of GATA3, where the DNA binding domain is located. The high frequency suggests that GATA3 mutations are cancer "drivers". Mutations in the second zinc finger domain cause alterations of DNA binding activity and protein stability of GATA3 [22][23][24] . However, it is still largely unknown how GATA3 mutations influence broader breast cancer properties such as changes in gene regulatory networks and tumor growth 25 .
Here we illuminate the impact of GATA3 mutations by establishing a novel classification strategy. Utilizing the CRISPR-Cas9 technology, we develop a model to demonstrate the molecular outcomes of one class of GATA3 mutation (frame-shift mutation in zinc finger 2) in breast cancer. Our results shed new light on the impact of GATA3 mutations in breast cancer.

ZnFn2 mutations
In breast cancer, GATA3 expression is a prominent marker of luminal breast tumors, and loss of GATA3 expression is associated with aggressive tumor phenotypes. Utilizing the gene expression data from the largest available breast cancer data cohort: the Molecular Taxonomy of Breast Cancer International Consortium(METABRIC) 4 , we created two patient groups based on GATA3 gene expression (Fig. 1a). Consistent with the previous literature, breast tumors with lower GATA3 expression showed significantly worse prognosis than tumors with higher GATA3 expression (Fig. 1a). Within high GATA3 expression cases, patients harboring GATA3 mutations represent better prognosis than GATA3 wild-type cases (Fig. 1a), suggesting that GATA3 mutations are not simple lossof-function mutations.
GATA3 mutations are localized predominantly in Exons 5 and 6, impacting coding regions and splice sites (Fig. 1b, Supplementary fig. 1a-c). The majority are insertion/deletion (indel) mutations that induce frame-shifts or alternative splicing, resulting in protein truncation or extension ( Supplementary   fig. 1d). Based on the predicted protein products, we classified GATA3 mutations into five groups: (1) ZnFn2 mutations, (2) splice site mutations, (3) truncation mutations, (4) extension mutations, and (5) missense mutations (Fig.   1b, Supplementary fig. 1d). To assess features within each mutant group, we analyzed the distribution of breast cancer intrinsic subtypes in the METABRIC cohort. As expected, tumors with high GATA3 expression were frequently observed in luminal tumors, while low GATA3 expression cases were often observed in the basal subtype (Fig. 1c). Among 1,980 patient cases, 230 tumors harbored GATA3 mutations, and 75% of the mutations were observed in luminal tumors (47% in luminal A, 28% in luminal B). GATA3 mutations were very infrequent in basal tumors (Fig. 1c). Interestingly, ZnFn2 mutations were distinct amongst GATA3 mutations, as they were predominantly found in luminal B tumors (52%, 16 out of 29 cases) (Fig. 1d, Supplementary fig. 1e), whereas splice site and truncation mutations were frequently observed (62% or 68% respectively) ( Fig. 1d, Supplementary fig. 1e) in luminal A tumors. The distributions of extension and missense mutations were similar to that of GATA3 wild-type. To confirm these observations in an independent cohort, we explored the TCGA data set. Again, tumors bearing ZnFn2 mutations were typically categorized as luminal B, and the splice site mutant tumors were often categorized as luminal A (Supplementary fig. 1e- fig. 1g).

ZnFn2 mutation promotes tumor growth and alters the transcription program
To dissect the impact of GATA3 mutations in breast cancer properties, we decided to focus on ZnFn2 mutations, because (1) gene expression profiles of established luminal breast cancer cell lines are known to be classified into the luminal B subtype 26 , and the ZnFn2 mutation group is the only group that is significantly frequently classified as luminal B; (2) the ZnFn2 mutations are associated with poor clinical outcomes. The frame-shift mutation at arginine 330 (R330fs) is the only mutation found in all four breast cancer data cohorts examined (Fig. 1a, Supplementary fig. 1a-c). The mutations at R330 are typically heterozygous (9 out of 10 cases) and frequently are 1-nucleotide insertions or 2-  (Supplementary fig. 2a). To model the situation observed in patients, we introduced a 2-nucleotide deletion at R330 into T47D breast cancer cells using CRISPR-Cas9 (Fig. 2b, Supplementary fig. 2b-c) 27 . The mutant cell clone (CR3) expressed both wild-type and mutant GATA3 proteins endogenously (Fig. 2c).
The proliferation rate of mutant cells was comparable to control T47D cells in vitro (Supplementary fig. 2d), while the mutant cells exhibited different morphology as compared to control cells (Fig. 2d). The mutant (CR3) cells formed less rounded colonies, and exhibited increased cell spreading and edge protrusion as well as distinct cell-to-cell boundaries. To measure the impact of this mutation on tumor growth in vivo, we conducted xenograft experiments.
T47D cells bearing either wild-type GATA3 or the R330fs mutation were injected into the mammary fat pads of nude mice carrying subcutaneous estrogen pellets.
Three to five weeks later, GATA3 mutant tumors exhibited significantly higher luminescent signals than controls, indicating that the R330fs mutant tumors had a higher growth rate in vivo (Fig. 2e).
To investigate this phenotypic alteration at a molecular level, we conducted RNA-seq analysis from in vitro cultured cells. At a false discovery rate (FDR) <0.01, |fold change| >2, we identified 1,173 differentially expressed genes (601 genes up-regulated, 572 genes down-regulated) (Fig. 3a, Supplementary table 1). Clustering analysis showed distinctly different gene expression patterns between wild-type and GATA3-mutant cells (Fig. 3b). We performed gene pathway analyses on the differentially expressed genes (DEGs) to extract up-and down-regulated functional networks in the mutant cells. Upregulated genes, including the epithelial to mesenchymal regulators TWIST1 and SNAI2 (SLUG) 28 , are associated with cell movement-and cell invasion-related pathways, consistent with the phenotypes observed in the mutant cells (Fig. 3c).
On the other hand, down-regulated genes, including progesterone receptor (PGR), are enriched with cell differentiation-and development-related pathways, suggesting altered properties in the mutant cells (Supplementary fig. 2e).
Analysis of upstream regulators of DEGs indicated down-regulation of the progesterone receptor pathway (Fig. 3d), suggesting that the decreased PGR mRNA levels observed in the GATA3 mutant cells has a functional outcome. We also performed RNA-seq analysis of tumors excised from the mouse xenograft model. Although the global gene expression profile was distinct from that derived in vitro, the expression patterns within the differentially expressed genes identified by in vitro cell system were largely conserved in the xenografts (Fig.   3e, Supplementary fig. 2f). Finally, gene ontology analyses with cancer data panels revealed common gene signatures between the mutant cells and aggressive tumors (Supplementary table 2 To address potential off-target effects of the CRISPR-Cas9 system, we

R330fs mutation induces relocalization of GATA3 in the genome
The fundamental difference between wild-type and mutant cells is a 2nucleotide deletion at the second zinc-finger domain of GATA3. This deletion was predicted to alter DNA binding activity (Supplementary fig. 2a) and, thus, we hypothesized, to alter the genomic distribution of GATA3. To address this question, we performed ChIP-seq analyses using an N-terminal specific GATA3 antibody, which recognizes both wild-type and mutant proteins. We identified approximately equal numbers of GATA3 binding sites in wild-type and mutant cells. To dissect differential binding, we classified the peaks into 3 groups: (1) To evaluate the impact of differential binding events on gene expression, we assigned the peaks to the nearest transcription start sites (TSS) and performed Gene Set Enrichment Analysis. Increased peaks were strongly associated with up-regulated genes including TWIST1 (Fig. 3a, 3d). Decreased peaks were associated with down-regulated genes including PGR (Fig 3a, 3d, Supplementary   fig. 4a). This bias in transcriptional outcome was not observed in genes associated with unchanged peaks.
To elucidate the impact on chromatin structure, we performed ATACseq 30 . Scatterplot and metaplot analyses of each group revealed a positive correlation between GATA3 binding and chromatin accessibility (Fig. 3e,   Supplementary fig. 4b-c), consistent with its function as a pioneer factor 18,19 . The signal levels at baseline in the decreased peak group were lower than those in the increased peak group (Supplementary fig. 4d), suggesting a potential difference in chromatin architecture.
We further investigated sequence enrichment at the differential binding sites. GATA3 is known to recognize the WGATAR motif; therefore, we explored the frequency of this consensus motif in each binding group (Supplementary fig.   4e). Interestingly, the increased binding peaks contained less of the consensus motif than did the decreased binding peaks, implying the R330fs mutant might have altered DNA-binding properties. These results suggest that relocalization of GATA3 induced by the R330fs mutation contributes to the global transcriptional alterations in breast cancer cells and, importantly, that the impact of the relocalization is due to a subset of the GATA3 sites. Determining the molecular rationale for why specific GATA3 sites are critical to phenotype and outcome thus becomes an important goal for deciphering the role of GATA3 in cancer.

Genomic distribution of mutant GATA3 differs from wild-type
Although ZnFn2 mutations were predicted to have loss-of-function, our ChIP-seq data suggested both gain-and loss-of-function at specific genomic loci.
To further investigate, we determined the localization of the mutant in ExoR330fs cells by using the Ty1 epitope tag antibody (Fig. 5a, Supplementary fig. 5a), identifying 6,239 peaks. Motif enrichment analysis revealed a unique binding motif 'AGATBD' (Fig. 5b). R330fs binding signals were higher in the increased total GATA3 binding group (Fig. 3b-c) as compared to the other groups (Fig. 5c).
We also performed wild-type GATA3 specific ChIP-seq using an antibody, which binds to the C-terminal region of GATA3, thereby recognizing wild-type GATA3 but not the R330fs mutant (Fig. 5a). As expected, the wild-type specific binding peaks contained a significant enrichment of the consensus binding motif both in wild-type and mutant cells (Fig. 5b). Peak overlap analysis indicated colocalization of both wild-type and mutant GATA3 at 2,944 loci, the unique presence of the R330fs mutant alone at 3,295 loci, and wild-type GATA3 unique localization at 6,751 loci ( Fig. 5d-e). The 'AGATBD' motif was significantly enriched in R330fs unique peaks (Fig. 5f, Supplementary fig. 6b) while the consensus motif was depleted. To confirm the mutant specific localization in the endogenous expression cell system, we analyzed the total GATA3 ChIP-seq data along with R330fs and wild-type GATA3 ChIP-seq data. At a subset of R330fs unique peak regions, we observed increased binding signals from total GATA3 ChIP-seq but not from wild-type GATA3 ChIP-seq in mutant cells, suggesting unique localization of the mutant when expressed under physiological control (Supplementary fig. 5a). Consistently, metaplot analysis at R330fs unique peaks showed increased levels of total GATA3 binding in mutant cells compared to wild-type cells, while the wild-type GATA3 binding was decreased at those peaks (Supplementary fig. 5c). These data indicated that the endogenously expressed PR expression is a critical prognostic marker in breast cancer, and lower expression of PR is associated with poor prognosis 31 . We confirmed that PR expression was dramatically reduced at the protein level in the mutant T47D cells We then evaluated potential enhancer function of the GATA3-bound region upstream of PGR. The CRISPR-Cas9 technique was utilized to delete specific chromatin regions. The gRNAs were designed to target two major peaks (Pk1, Pk2) and a control region (Ctrl) (Fig. 6b, Supplementary fig. 6b). PR expression in the Pk1-depleted cells was reduced, but the Pk2-depleted cells and control cells did not exhibit a significant reduction (Fig. 6b, Supplementary fig.   6c). These results implicate the Pk1 locus as rate-limiting for PR expression. Conversely, PR knockdown in control T47D cells generated a progesterone tolerance similar to GATA3 mutant cells (Fig. 6f, Supplementary fig. 6e). We extended these observations to a different system by altering expression levels of wild-type and mutant GATA3 in KPL-1 cells, which has a ZnFn2 mutation (D336fs). Depletion of endogenous GATA3 in these cells (both wild-type and mutant versions) partially rescued PR expression. Overexpression of wild-type GATA3 further restored the expression of PR at both RNA and protein levels (Supplementary fig. 6f-g). These data indicated that ZnFn2 mutations modulate PR expression and progesterone sensitivity in multiple cellular settings.
To further explore this apparent growth advantage in the presence of progesterone, we turned to the xenograft model. In this assay, progesterone pellets were simultaneously implanted into mice along with estrogen pellets. The GATA3-mutant tumors exhibited a distinct growth advantage in vivo over wild-type T47D cells in the presence of progesterone pellets (Fig 6g). Analysis of RNA extracted from the developing tumors at necropsy confirmed a reduced expression of PR in the R330fs mutant tumors (Supplementary fig. 6h). These results suggest that PR is a major downstream target of GATA3, and its expression is down-regulated in the presence of ZnFn2 mutations.

An aberrant PR-regulatory network in GATA3 mutant tumors
We used a systems approach to explore the molecular mechanisms by which decreased PR expression contributes to the phenotypic outcomes in the To understand the contribution of PR action to gene expression responses, we determined the chromatin localization of PR in T47D control cells as well as GATA3 mutant cells. A large number (60,966) of PR binding peaks were observed in control T47D cells after progesterone treatment (Fig 7c). Notably, 39,639 PR binding peaks were also detected in the mutant cells, and many of them were overlapped with PR peaks in T47D control cells. At a FDR <0.05, |fold change| >1.5, 18,175 loci were defined as part of an increased binding group, and 16,214 loci were defined as the decreased binding group, while 33,834 sites were unchanged (Fig. 7d). The GSEA analysis indicated that those loci displaying differential binding levels were strongly correlated with differential gene expression between control and mutant T47D cells (Supplementary fig. 7f).
Finally, to assess the relevance of our findings in these model systems, we returned to patient data. In both the METABRIC 4 and TCGA breast cancer cohorts 21 , PR expression in ZnFn2 tumors was significantly lower than that in other GATA3 mutant tumors or in luminal breast tumors that express wild-type GATA3 (Fig. 7e, Supplementary fig. 7g). Immunohistochemistry data in the TCGA cohort also supported lower expression of PR in ZnFn2 tumors To elaborate on the gene expression analysis, we performed ONCOMINE Concept Analyses. As an input gene set, we used 437 genes uniquely upregulated by progesterone in control T47D cells but not in CR3 cells (Fig. 7a-b).
Consistent with our in vitro data, expression of many of these genes were positively correlated with PR expression status in breast tumors. In the "Overexpression in Breast Carcinoma -Progesterone Receptor Positive 37 " data cohort, the expression data of 341 genes (out of 437 input genes) were available, and 94 genes were significantly (p < 0.05) associated with PR positive status (Fig. 7f).
Gene ontology analysis also revealed a significant overlap between downregulation of those genes and aggressive tumor phenotypes, including worse prognosis and metastasis (Supplementary table 4). In the case of the "Underexpression in Breast Carcinoma -Dead at 3 Years 38 " data, 53 genes out of 345 genes were down-regulated in the category 'patient deceased at 3 years' (Supplementary table 1), and 38 genes were categorized in the top 5% underexpressed genes (Fig. 7g, Supplementary table 4).

Discussion
Recent large-scale breast cancer genomic profiling identified frequent mutations in the GATA3 gene. However, the clinical and molecular outcomes of GATA3 mutations are poorly understood, because there are very few studies describing the functions of mutant GATA3 23,24,39,40 . In this study, classification of the GATA3 mutations revealed different molecular and clinical outcomes.
Overall, patients with tumors with GATA3 mutations live longer than patients with wild-type GATA3. GATA3 mutations found in zinc-finger 2, conversely, lead to distinct features: frequent detection in luminal B tumors and worse prognosis. Splice site and truncation mutations (found in Exons 5-6) are frequently detected in luminal A; these patients have better prognosis than those with wild-type GATA3. The differential outcomes between splice site mutations and ZnFn2 mutations suggest that the impact of complete loss of the second zincfinger motif differs from that of partial loss.
We utilized the CRISPR-Cas9 system to investigate the molecular

Enhancer deletion
The guide RNAs were designed by the Feng Zhang Laboratory web tool.

ChIP-seq
ChIP experiments were performed as previously described 18 . Briefly, cells were fixed with 1% formaldehyde at room temperature for 10 min and quenched with glycine. Fixed cells were further treated with hypotonic buffer 18 and resuspended in lysis buffer (20 mM Tris-HCl pH 8.0, 2 mM EDTA, 0.5 mM EGTA, 0.5 mM PMSF, 5 mM sodium butyrate, 0.1% SDS and protease inhibitor cocktail).
Chromatin was fragmented by sonication with Covaris S220, then diluted to adjust the SDS concentration to 0.05%. Immunoprecipitation was performed with indicated antibodies (Supplementary table 6). A protein A/G Dynabeads mixture (1:1 ratio) was used to capture antibodies. Eluted DNA was reverse crosslinked at 65 o C, followed by the incubation with proteinase K. DNA was purified by AMPure XP (Beckman Coulter). The sequencing libraries were prepared by the NEXTflex Rapid DNA-seq kit (Bioo Scientific Corporation) and sequenced on NextSeq 500 (Illumina) at the NIEHS Epigenomics Core Facility. Reads were filtered based on a mean base quality score >20, and mapped to hg19 genome by Bowtie 0.12.8 allowing only uniquely-mapped hits 49 . Duplicate reads were removed using MarkDuplicates.jar from picard-tools-1.107 package. Paired-end reads were converted to a single fragment for metaplot analyses and visualization on the UCSC Genome Browser. For the PR ChIP-seq, the cells were treated with 150 nM progesterone or the same volume of the vehicle (ethanol) for 3h.

Peak call
Peak calling analyses were performed using HOMER v4.1 with default parameters (Heinz et al., 2010). Only the first read of each read pair was used for peak calling.

Peak classification
EdgeR 50 was used to classify the ChIP-seq peaks (GATA3, ERa, FOXA1) into three groups (Increased, Decreased, Unchanged) based on the ChIP-seq signal differences between control cells and CR3 cells. Read counts per 400 bp window, centered on peaks, were collected then evaluated with the GLM method.
Classification thresholds were set at FDR <0.05 and |fold change| >1.5. For heatmap visualization, read counts were normalized to 35 million total mapped fragments per sample.

Clinical data analysis
The METABRIC clinical data and gene expression/mutation data were obtained from the Cancer Genomics Data Server R (cgdsr) package (hosted by the Computational Biology Center at Memorial-Sloan-Kettering Cancer Center, the data were obtained on October 24th 2016) and cBio Cancer Genomics Portal (http://cbioportal.org). For the PAM50 subtype analyses of the METABRIC cases, we used the previously defined classification, described in 51 . The TCGA breast cancer data were obtained from the public TCGA data portal.

Immunostaining
For the F-actin staining, cells were fixed with 2% formaldehyde for 15 min at room temperature and permeabilized with PBS buffer containing 0.1% Triton-

X100. The cells were further incubated with Phalloidin CruzFluor 488
Conjugate (Santa Cruz Biotechnology) and mounted in Vectashield mounting medium with DAPI.

Immunoblotting
Cells were lysed with sample loading buffer, and extracted proteins were separated by SDS-PAGE. Anti-GATA3 (Cell Signaling), anti-progesterone receptor (Cell Signaling), and anti-b-actin (Abcam) antibodies were used as primary antibodies. Anti-rabbit IgG or mouse IgG HRP-conjugated antibody (GE Healthcare) was used as a secondary antibody. The chemiluminescent signals were detected by the Odyssey Fc Imaging System, and analyzed by the Image Studio software (LI-COR).

Cell proliferation assay
Cells were resuspended in 5% FBS-containing DMEM and spread on a 24-well plate (50,000 cells/well). After incubation at 37 o C for 24 h, the medium was replaced to progesterone-or vehicle-(ethanol) containing medium. The medium was replaced every day for 3 days, at which time the total cell number was counted using a Coulter counter (Beckman).

Gene expression analysis
Total RNAs from cultured breast cancer cells were purified by RNeasy Kit or miRNeasy kit (QIAGEN). For the progesterone stimulation, cells were incubated with 150 nM progesterone (or ethanol as a control) for 3h. Total RNAs from tumor tissues were extracted by QIAzol Reagent and purified by miRNAeasy kit (QIAGEN). The purified RNA concentration was measured by a NanoDrop 1000 Spectrophotometer (Thermo Fisher Scientific). The cDNAs were synthesized by using iScript cDNA Synthesis Kit (Bio-Rad Laboratories). Quantitative PCR was performed using the CFX384 Real-Time PCR Detection System (Bio-Rad Laboratories). Relative expression was calculated by delta-delta Ct method. TBP was used as a reference gene for normalization.

RNA-seq
Library preparation and sequencing were performed by Q2 Solutions on an pairs. Thus, sample pairs with similar counts per gene across all gene models will have a low distance score (corresponding to dark blue color).
For the GSEA analysis, we filtered the expression data to only include genes with a nonzero read count for all samples. We then performed a GSEA analysis, using the GSEA Java software version 2.1.0 and default parameters, with this expression data set and various gene lists obtained from ChIP-seq peaks. We associated the ChIP-seq peaks with genes by identifying the closest RefSeq TSS to each peak center. We excluded any peaks that were more than 50 kbp from the nearest TSS.

ATAC-seq
The sequence library for ATAC-seq (Buenrostro et al., 2013)  low temperature constant pressure simulation to assure a reasonable starting density, (4) minimization, (5) step-wise heating at constant volume, and (6) a 2 ns constant volume, constant temperature molecular dynamics run. The final structure was then subjected to a 10 ns constant temperature simulation to obtain a reasonable solvated structure for the mutant.

Data and Software Availability
Data pertinent to this study are available at Gene Expression Omnibus under Accession Number GSE99479. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99479

Quantification and Statistical Analysis
Kaplan-Meier survival curves, box plots were generated by Prism 6 (GraphPad Software). P-values were calculated by Prism 6 (GraphPad Software). Unpaired ttests were applied to Fig. 2e, 5h, 6g, Supplementary fig. 6f, and 6g. Log rank tests were applied to Fig. 1c and Supplementary fig. 1g. Fisher's exact tests were applied to Supplementary fig. 1e and 7h. Mann-Whitney tests were applied to           Color key and histogram