Differential epigenetic reprogramming in response to specific endocrine therapies promotes cholesterol biosynthesis and cellular invasion

Endocrine therapies target the activation of the oestrogen receptor alpha (ERα) via distinct mechanisms, but it is not clear whether breast cancer cells can adapt to treatment using drug-specific mechanisms. Here we demonstrate that resistance emerges via drug-specific epigenetic reprogramming. Resistant cells display a spectrum of phenotypical changes with invasive phenotypes evolving in lines resistant to the aromatase inhibitor (AI). Orthogonal genomics analysis of reprogrammed regulatory regions identifies individual drug-induced epigenetic states involving large topologically associating domains (TADs) and the activation of super-enhancers. AI-resistant cells activate endogenous cholesterol biosynthesis (CB) through stable epigenetic activation in vitro and in vivo. Mechanistically, CB sparks the constitutive activation of oestrogen receptors alpha (ERα) in AI-resistant cells, partly via the biosynthesis of 27-hydroxycholesterol. By targeting CB using statins, ERα binding is reduced and cell invasion is prevented. Epigenomic-led stratification can predict resistance to AI in a subset of ERα-positive patients.

staining using human-specific antibodies that target a breast-cancer-specific pioneer 4 factor 1 . The right-hand panel contains human-derived breast cancer primary tumours.       identifiers for each signature are in Table S2. overall-and relapse-free survival analysis. The patients were stratified using the same CB 3 gene expression signature as in Fig. S8

Construction of transcriptomics databases 3
Breast cancer datasets were identified in GEO (http://www.ncbi.nlm.nih.gov/gds) using 4 the GEO platform IDs "GPL96" (for HG-U133A), "GPL570" (for HG-U133 Plus 2.0) 5 and the keywords "breast", "cancer" and "survival". The database quality control and 6 removal of duplicate samples was performed as described previously 3 . 660 patients have 7 we re-ran the pre-processing for all gene chips. Raw data was imported into R and 13 summarized using the beadarray package 5 . For annotation, the illuminaHumanv3 14 database was used. Quantile normalization was carried out using the preprocessCore 15 package 6 . 16

RNA-seq data for breast cancer patients 7 is published in The Cancer Genome 17
Atlas (TCGA) of the National Cancer Institute (http://cancergenome.nih.gov/) and we 18 downloaded the pre-processed level 3 data generated using the Illumina HiSeq 2000 19 RNA Sequencing Version 2 platform. 20 21

Survival analysis 22
Kaplan-Meier analysis was performed as described previously 8 . For the 1 expression of the genes, the median expression was used as the cutoff in a Cox regression 2 analysis. The Kaplan-Meier survival plot, and hazard ratio with 95% confidence intervals 3 and logrank P value were calculated and plotted in R using Bioconductor packages. For the ENCODE data as well as for the breast cancer lines, the coordinates of the 16 aligned reads were extended to 200 bps and a maximum of PCR duplicates of 2 was 17 allowed. CoverageBed 14 was then used to compute coverage, which was transformed to 18 RPKM for each region and sample. Using R, the correlation matrix among samples was 19 calculated (Spearman's Rank Correlation Coefficient) and hierarchically clustered using 20 the hclust function. 21

22
Comparative ChIP-seq analysis 23 Comparisons were performed using MACS v1.4 10 with default parameters. The H3K27ac 1 profile from each cell line was compared to MCF7 and vice-versa. A region was 2 considered as differentially acetylated if showing a p-value <= 1e -10 in the comparison 3 between IPs and, if overlapping, it was considered an acetylated region enriched versus 4 the corresponding cell-type-specific input DNA (p-value <= 1e -5 ). 5 6 H3K27ac co-variation analysis 7 Differentially acetylated regions identified in the previous paragraph were pooled and 8 their coordinates merged in case of overlap. CoverageBed 14 was then used to compute 9 coverage, which was transformed to RPKM for each region and sample. RPKM were 10 log2-transformed (after setting zeros to the lowest value except zero) and subjected to 11 hierarchical clustering (average linkage) using 1 -Pearson's correlation as a measure of 12 distance. This allowed us to group together active regulatory elements showing the same 13 pattern of variation across the six conditions under evaluation. In order to identify 14 discrete clusters from the resulting dendrogram, the dynamicTreeCut R package 15 was 15 used (method set to hybrid and minClusterSize to 100). 16 The resulting regions were also annotatedusing a custom scriptas TSS-proximal or 17 TSS-distal according to their proximity to the TSSs of RefSeq genes. All the regions 18 whose central coordinates lay within 2.5 kbps of annotated TSSs were considered 19 proximal. The remaining ones were considered distal. The proximal regions could be 20 unambiguously assigned to the gene they possibly regulated and we could then retrieve 21 the corresponding FPKMs from the RNA-seq data analysis. 22 For visualization purposes, the behaviours of clusters were shown as heatmaps, using the 1 mean of the log2-RPKM of the H3K27ac in each cell line as representative. Considering 2 the mRNA of the regulated genes, the median of the linear FPKM for each cluster and 3 cell line was also shown as a heatmap. Super-Enhancer (SE) regions were called separately in each of the six cell lines 7 considered, using a variation of a previously applied approach 16 . Regions within 2.5 kbps 8 of RefSeq TSSs were excluded and the remaining regions were merged together if less 9 than 12.5 kbps intervened in between. After that, coverageBed 14 was used to compute the 10 H3K27ac reads coverage of each region. These were then sorted by decreasing the total 11 number of reads, and the top 5% of regions were retained as SEs. 12 The lists of cell-type-specific SEs were then pooled and overlapping regions merged. 13 This list was used in all the following analyses. First of all, they were clustered applying 14 the same procedure as that described in the paragraph H3K27ac co-variation analysis, 15 except for the minClusterSize that was set to 20. To annotate the regulatory activity of 16 SE-clusters to RefSeq genes, two complementary approaches were used: 17 -all the transcriptional units overlapping the SE-cluster were defined as regulated 18 by the cluster; 19 -the nearest transcriptional unit not overlapping the SE-cluster was also 20 considered. 21 The retrieval of the FPKM values of these genes as well as the criteria used for 1 visualization followed the same approaches as those described in the paragraph H3K27ac 2 co-variation analysis. 3 4 Quantification of H3K27ac in Topologically Associated Domains 5 Topologically Associated Domains (TADs) from IMR90 cells 17 were downloaded from 6 http://chromosome.sdsc.edu/mouse/hi-c/download.html as hg18 coordinates, which were 7 lifted to hg19 using liftOver 13 . 8 Each gene in the human genome was assigned its TAD, and the H3K27ac levels in each 9 TAD were expressed as RPKM coverage. The Log2-ratio of H3K27ac among cell lines 10 could then be computed for each TAD.  The script findMotifsGenome.pl from the suite HOMER 19 was used to find enrichment 1 for known motifs. For each item of the list of regions subjected to analysis (either a 2 cluster of enhancer-promoters or of SE), we considered the valleys calculated on the 3 profile of the cell line showing the highest mean H3K27ac level (in the cluster). The coordinates of the genes were downloaded from the UCSC genome browser 13 on 7 2012, May 10 th . 8 9

Statistics and plots 10
All plots were drawn and statistics were performed using R. Heatmaps were drawn using 11 the heatmap.2 function of the gplots package. Statistics for expression data from 12 FEMARA trial were conducted using Prism (non-paramentric Mann-Whitney U test).