Promoter-level expression clustering identifies time development of transcriptional regulatory cascades initiated by ErbB receptors in breast cancer cells

The analysis of CAGE (Cap Analysis of Gene Expression) time-course has been proposed by the FANTOM5 Consortium to extend the understanding of the sequence of events facilitating cell state transition at the level of promoter regulation. To identify the most prominent transcriptional regulations induced by growth factors in human breast cancer, we apply here the Complexity Invariant Dynamic Time Warping motif EnRichment (CIDER) analysis approach to the CAGE time-course datasets of MCF-7 cells stimulated by epidermal growth factor (EGF) or heregulin (HRG). We identify a multi-level cascade of regulations rooted by the Serum Response Factor (SRF) transcription factor, connecting the MAPK-mediated transduction of the HRG stimulus to the negative regulation of the MAPK pathway by the members of the DUSP family phosphatases. The finding confirms the known primary role of FOS and FOSL1, members of AP-1 family, in shaping gene expression in response to HRG induction. Moreover, we identify a new potential regulation of DUSP5 and RARA (known to antagonize the transcriptional regulation induced by the estrogen receptors) by the activity of the AP-1 complex, specific to HRG response. The results indicate that a divergence in AP-1 regulation determines cellular changes of breast cancer cells stimulated by ErbB receptors.


Description of the CIDER analysis pipeline
CIDER [1] consists of two steps: first, the distinct expression patterns are identified by unsupervised clustering of time series. Then, motif enrichment analysis is performed on the clusters, to identify the transcription factors likely to induce the transcription of the genes with similar expression pattern ( Figure 3). In details: 1. Time series clustering: A hierarchical clustering approach is adopted to identify expression patterns from longitudinal molecular data. Paragraphs 1.A-1.C provide details on preprocessing, distance, and specific hierarchical clustering methods adopted in CIDER.  [4], The application of DTW on transcriptomics time series profiling has been already demonstrated on microarray platforms [5]. DTW alignment is constrained to align time-points within each replicate, with a Sakoe-Chiba band of width 1 as global constraint [6]. As a further refinement, we implemented in CIDER the Complexity Invariant (CID) correction for DTW (CIDDTW method), one of the latest installations of DTW, proposed as a generalization that accounts for signal complexity [7]. In details, the CIDDTW method introduces a correction term penalizing signals with low variability; the rationale is that for standard distance measures (Euclidean, correlation, non-corrected DTW) pairs of complex objects tend to be gauged further apart than pairs of simple objects, introducing errors in classification. The R package dtw [8] was extended to implement the CIDDTW, and the R package parallel used to parallelize the task on multiple processors. regions of the transcription start sites of a group of genes). Here we detail the specific enrichment technique and parameters used in the pruning phase for the MCF-7 analysis.

2.A Motif Enrichment:
The hierarchical clustering structure produced at step 1.C is used as starting point to perform a multi-level motif enrichment analysis. Each node in the cluster is tested for enriched motifs using the AME tool, part of the MEME software suite [10] with default parameters. The AME application is parallelized by using the GNU parallel utility [11].

2.B Enrichment-based pruning:
The enrichment associations computed by AME on the MCF-7 data were pruned to retain only the most relevant associations between motif activity and expression patterns, by applying the criterion typically used in AME and other reference motif enrichment tools [17][18][19]. The association between a cluster and a motif was deemed significant if the Fisher test P value < 0.01 (absolute thresholding) [17,18], and greater than

Comparison of CIDER and MARA results
We apply the CIDER analysis to the MCF-7 HRG time-course, which belongs to the showing that the main transcriptional modules identified by CIDER for the MCF-7 HRG timecourse are also supported by MARA.
In general, both methods are designed to infer the transcription factor activity from gene expression time series and TF binding motif data. CIDER aims at identifying the most prominent expression patterns and infer the TFs responsible for the coordinated gene expression. MARA, instead, has been developed to estimate the regulatory activity of each TF and its effect on the single genes. A comparison of the two algorithms is presented in Supplementary Table S3.
Despite different background methodologies and purposes, CIDER and MARA analyses of the MCF-7 HRG time-course generated two transcriptional regulatory networks (TRNs) whose overlap is not randomly distributed on sparse interactions, but is organized in 8 regulatory modules controlled by 12 TFs. In particular, three of the overlapping modules are regulated by SRF, ELK1 and AP-1 complex involved in the regulatory cascade triggered by the MAPK activation (Figures 1 and 5).
In details, the MCF-7 HRG transcriptional regulatory network produced by CIDER counts 8470 regulatory interactions between 2575 promoters (1409 genes). The MARA network, instead, is composed by 8861 interactions between 3005 genes. The 136 interactions in overlap between the two regulatory networks are organized in 6 regulatory modules (Supplementary Figure S5). In particular, the overlapping modules regulated by the AP-1 complex ( Figure S5D) and SRF ( Figure S5F) have a statistically significant overlap between MARA and CIDER (Supplementary Table S2).