Genome-wide profiling of active enhancers in colorectal cancer

Colorectal cancer (CRC) is one of the most common cancers in the world. Although genomic mutations and SNPs have been extensively studied, the epigenomic status in CRC patient tissues remains elusive. Here, we profiled active enhancers genome-widely in paired CRC patient tissues through H3K27ac ChIP-Seq, together with genomic and transcriptomic analysis. Totally we sequenced 73 pairs of CRC tissues and generated 147 H3K27ac ChIP-Seq, 144 RNA-Seq, 147 whole genome sequencing and 86 H3K4me3 ChIP-Seq files. Our analysis discovered 5590 gain variant enhancer loci (VEL) and 1100 lost VELs in CRC, and 334 gain variant super enhancer loci (VSEL) and 121 lost VSELs. Multiple key transcription factors in CRC were predicted with motif analysis and core regulatory circuitry analysis. Further experiments verified the functions of 6 super enhancers governing PHF19 , LIF , SLC7A5 , CYP2S1 , RNF43 and TBC1D16 in regulating cancer cell migration, and we identified KLF3 as a novel oncogenic transcription factor in CRC. Taken together, our work provides important epigenomic resource and novel functional factors for epigenetic studies in CRC.


Background
Binding of transcription factors (TFs) to enhancers is one of the critical steps in transcription activation. Recently, the development of epigenomics revealed the novel features of active and silent enhancers and shed light on the study of transcriptional regulation in multiple research fields 1-5 . Epigenetic marks on chromatin are important signatures for cell identification, which co-operate with transcription factors to regulate transcription 2,6,7 . Histone modifications mark enhancers on chromatin and are critical for their activity. H3K4me1 is the mark for primed enhancers 7,8 ; H3K27ac for active enhancers and H3K27me3 for poised enhancers 1 . Though the initial discovery was concluded from ChIP-Seq of mediator subunits, now H3K27ac in the intergenic chromatin is widely used for identification of active enhancers 6,9,10 .
Moreover, it was discovered that many genes are often regulated by multiple enhancers and the state of these enhancers varies in different cell types 1,4 . Therefore, it has emerged as critical questions for many fields how enhancer activity is regulated for signaling pathways and selective gene transcription.
Pioneer studies hypothesized that gain of enhancer activity is one of the common features for cancers 6,[11][12][13] , which is supported by some recent studies in patients and animal models 10,14,15 . However, it is still not clear whether it is a common feature for all the cancers or just a portion of them. Interestingly, many genes related with epigenetic regulation of enhancer activity are frequently mutated in cancer, such as MLL3/4, p300/CBP, UTX and KDM5C 12,[16][17][18][19][20] . Moreover, inhibitors for BRD4, one reader of H3K27ac on enhancer, were shown to be effective in cancer treatment 21 . It is then urgent to clarify the roles of enhancers in cancer and the underlying mechanisms.
It has been shown that the enhancers controlling the transcription of key oncogenic genes, such as MYC, distinguish in different types of cancers 22 . It is probably because the active transcription factors vary in different cancer types, which causes that different transcription factors activate and bind to different enhancers in response to variant genome mutations and upstream signals. Thus, enhancer profiling may reflect the feature of distinguished cancers and be used for classification 10,13,[23][24][25] .
Colorectal cancer (CRC) is one of the most common cancers in the world. Recent studies about aberrant DNA methylation have gain sight of the field, and epigenetic regulation becomes one of the critical regulatory factors for CRC [26][27][28][29][30][31] . Some groups have studied the genome wide distribution of active enhancers in CRC 23,32 . The early studies used H3K4me1 as a mark which was not suitable to identify functional active enhancers 23 . The recent study was mostly based on cell lines, only with very few patient tissues 32 . These data do not reflect the real clinical features of CRC patients and are not very helpful to enhancer studies in CRC.
To establish a comprehensive map for active enhancers in CRC, we performed H3K27ac ChIP-seq analysis for 73 pairs of CRC tissues (tumor tissues with paired adjacent native tissues), as well as the corresponding genomic and transcriptomic sequencing. We identified thousands of novel enhancers and multiple TFs involved in CRC, and a portion of them were experimentally verified. Our study provides important epigenomic resource and novel research candidates for future studies in CRC.

Genome-wide study of enhancer distribution in CRC patient tissues
To establish a comprehensive genome-wide view of active enhancers of CRC patient tissues, we totally collected 80 pairs of tissues (cancer and their adjacent tissues) from CRC patients. We optimized the ChIP-Seq protocol and performed H3K27ac ChIP-Seq for these samples, as well as the corresponding mRNA and input DNA sequencing. Some samples failed in the study, and eventually we got high quality of sequencing data from 74 CRC tissues and 73 native tissues, among which 73 were paired (Fig. 1A, Extended Data Fig. S1&Table 1). We also performed H3K4me3 ChIP -Seq for 43 pairs of tissues. Totally we generated 524 high quality sequencing   files, including 147 H3K27ac, 86 H3K4me3, 144 RNA-Seq and 147 genomic   sequencing files (Extended Data Table 2), which will provide important epigenomic information for related studies.  S2E). MYC is a well-known oncogene and H3K27ac track on its enhancer was shown as an example (Fig. 1E). In the adjacent native tissues, MYC expression was very low, and H3K27ac signal on its enhancer and eRNA were close to the background. In tumor tissues, MYC was highly expressed, together with the elevation of H3K27ac on its enhancer and eRNA level (Fig. 1E). One early study has reported active enhancers in multiple CRC cell lines and a few CRC samples 32 . Comparison of the two studies revealed that we identified 11796 new active enhancers in CRC, which was 32.4% of the total enhancers (Fig. 1F&G).

Identification of variant enhancer loci in tumor
To identify significant active enhancers specific in tumors, we first compared the enhancers of all samples and found that some tissues had relatively low number of H3K27ac peaks (less than 2,500) or variant enhancer loci (VEL, less than 500) compared with the corresponding adjacent tissues (Extended Data Fig. S3A&B). We considered it may due to the sampling problem and ruled them out in the following statistical analysis. We totally identified 6690 significant VELs, including 5590 gain VELs and 1100 lost VELs ( Fig. 2A, Extended Data Table 3 Fig. S6B-D). The above study indicates that CMS2 group is more homogenous than others; and it has more specific active enhancers, which might be a novel feature for it.
Then, we studied the function of the VEL-associated genes, and identified the enhancers and genes specifically activated in each subgroup. Some representative enhancers and genes for each group were shown (Extended Data Fig. S6E-L). For CMS2, we found its specific gain-VEL-associated genes are mainly involved in WNT signaling, cell migration and lipid metabolic process (Fig. 3G). Activation of WNT signaling and enhanced cell migration are expected, since APC is one of the most frequent mutated genes in CRC and cell migration is a hallmark for cancer cells 34,35 .
Lipid metabolism was linked with CRC but ambiguous results from different groups exist [36][37][38] . Our analysis suggested dysregulation of lipid metabolic homeostasis is possibly associated with certain CRC subgroup. Some VELs of CMS2 subgroup and their associated genes were shown, including those involved in lipid metabolism, such as CEL and DPEP1 (Fig. 3H, Extended data Fig. S7).

Analysis and verification of variant super enhancer loci
Activation of super enhancers associated with oncogenes is considered as one of the important features for cancer 6 . Using the similar approaches as VEL identification, The difference of proliferation was not very significant (Data not shown), however, quite a few cell lines exhibited attenuated migration ability, including PHF19, LIF, SLC7A5, CYP2S1, RNF43, VEGFA and TBC1D16 (Fig. 4E).

Predication and verification of functional transcription factors
To investigate the potential transcription factors (TFs) playing key roles in CRC, the DNA sequences of VELs was used for prediction by HOMER software. The top hits of gain and lost VELs were listed ( Fig. 5A and Extended Data Fig. S11A&B). The hypothesis of core regulatory circuitry was raised to identify core TFs in cells 40,41 . To improve TF prediction, we utilized the method to identify key TFs in CRC tissues ( Fig. 5B&C, Extended Data Fig. S11C). ASCL2 was predicted as a CRC specific TF with the highest score (Fig. 5C). The H3K27ac level of ASCL2 enhancer greatly increased in tumors (Extended Data Fig. S11D), and the gene expression analysis based on TCGA datasets suggested ASCL2 was highly expressed specific in colorectal cancer (Extended Data Fig. S11E). These suggest ASCL2 is a key TF in CRC and further experiments are required to verify our prediction.
Combining the above results and the published literatures, we selected 4 TFs for experimental verification, including KLF3 and MAFK, two novel TFs, MAZ and RUNX1, two recently reported TFs functioning in CRC but not well characterized [42][43][44][45] ,. The significance of the selected TFs in each patient was calculated (Extended Data  (Fig. 5B), and our experimental results did not find its role in proliferation or migration, suggesting MAFK not involved in CRC although it was predicted in DNA motif assay.

Discussion
CRC is one of the most common cancers in the world. Though early screening greatly improves the curative ratio, novel classification approaches and drugs are still urged to be developed. The current study provides a comprehensive map of H3K27ac and active enhancers in CRC patients. The early studies used cell lines to determine the enhancers 32 . Our study uses paired patient tissues, which provides much more reliable data and identifies many novel CRC specific enhancers. Moreover, we experimentally confirmed the roles of more than 10 SEs in CRC. These provide important information for future CRC research.
Our analysis predicted many TFs functioning in CRC, some of which have never been reported before, such as KLF3. We confirmed the function of KLF3 in HCT116 cells, which supported its oncogenic role in CRC. H3K27ac on ASCL2 is highly increased in tumor tissues and it is specific high expressed in CRC. Further studies about these TFs will provide important information for CRC research.
Interestingly, verification of the identified VSELs and TFs indicated that most of them were more related with cell migration, but less with proliferation. The CRC samples we collected are mostly at relative late stages (Extended Data Fig. S1A & Table 1), and the tumor cells were probably at the metastasis stage or ready for it. The difference of enhancers and genes governing migration was probably much more significant than other genes between paired tissues, and the chosen VSELs were all among the most significant ones.
It was believed that gain of H3K27ac on the oncogene enhancers is a common feature for cancers 21 . Our analysis indicated that only the CMS2 group has the obvious feature of enhancer activity elevation, and the other subgroups have much less gained VELs. Our data indicate that in CRC, global increase of active enhancers is an important feature for just one subgroup, not for all.
The homeostasis of lipid metabolism has been linked with CRC for many years, however, the detailed mechanisms is not clear and the use of statin analogues failed in CRC treatment 36 . We found that H3K27ac significantly increased on the enhancers of genes related with lipid metabolism. The bioinformatics analysis also pointed out that the gain VELs of CMS2 subgroup were enriched with genes involved in lipid metabolic processes. So, it is possible that the dysregulation of lipid homeostasis is only associated with CMS2 group, which should be explored by the future studies.
Protein G-Sepharose beads were from GE Healthcare. PCR primers were custom synthesized by BGI and siRNAs by GenePharma. HCT116 Cell line was purchased from Cell Bank of Chinese Academy and cultured under recommended conditions according to the manufacturer's instruction with 10% FBS.

ChIP assay and ChIP-sequencing
ChIP assay was performed as previously described 46 . Briefly, around sixty milligrams of each tissue were cut into 1 mm 3 pieces in PBS with protease inhibitor. Tissue

ChIP-seq data processing
The adaptor sequence was removed using Cutadapt (version 1.16) to clean ChIP-seq raw data. Cleaned reads were mapped into human reference genome (hg19) using BWA (version 0.7.15) with default settings. Peak calling for tissues was performed by MACS2 with a p-value threshold of 1E-10. The patients with the peak number less than 2,500 were excluded from further analysis, no matter in native or tumor tissue (Patient 20, 21, 22 and 24 were excluded).
We calculated the normalized RPM as the ChIP-seq signal in specific region. Briefly, ChIP-seq reads aligning to the region were extended by 200 bp and the density of reads per bp was calculated using Python package HTSeq (version 0.9.1). The density of reads in each region was normalized to the total number of million mapped reads, producing read density in units of reads per million mapped reads per bp (RPM per bp).

Plotting meta representation of ChIP-seq signal
Considering the sample number of our patient data, we utilized a way of calculating the mean to compactly display the integrated H3K27ac ChIP-seq signal in specific groups. For an individual region, we calculated the aligned read number per bp within this region using the R package HTSeq mentioned above, and then normalized to RPM. H3K27ac ChIP-seq signal is smoothed using a simple spline function and plotted as a translucent shape or a line in units of RPM per bp.

RNA-seq data processing and DEG identification
The adaptor sequence was removed using Cutadapt (version 1.16) to clean RNA-seq raw data. Cleaned reads were aligned to the human reference genome (hg19) using HISAT2 (2.1.0) with default settings. Uniquely aligned reads were counted at gene regions using the package featureCounts (version 1.4.6) based on Gencode v19 annotations. Differential gene expression analysis between native and tumor tissue was performed using the R/Bioconductor package DESeq2 (version 1.26.0) with contrast adjustment for multiple groups comparison. Genes whose log2FC < 1 and p.adj < 1E-2 were identified as differential expressed genes (DEGs).

Promoter, enhancer and super enhancer analysis
For both H3K4me3 and H3K27ac ChIP-seq data, peaks that could not be identified in at least two same kind of tissues were excluded from further analysis. H3K4me3 peaks located within the region surrounding ±2.5 kb of transcriptional start sites (TSS) were identified as promoters; and H3K27ac peaks away from the ±2.5 kb flank region of TSS were identified as enhancers. The promoters and enhancers of each samples were merged into one single set. Super enhancers were identified as following: firstly, super enhancers (ROSE) algorithm was used to classify and rank sets of two or more H3K27ac peaks (detected by MACS2, p-value < 1E-10) within 12.5 kb distance and further than 2.5 kb from a transcriptional start site; secondly, a plot was graphed and a tangent line of the curve was drawn with the slope value of 1; finally, the enhancers above the point of tangency were defined as super enhancers.
HOMER module annotatePeaks.pl was used to calculate the number of enhancers located in different chromatin elements.

Identification of VELs
To identify the significant VELs between native and tumor tissue, we first identified all VELs in paired native and tumor tissues. Individual sample VEL were defined as enhancers whose H3K27ac fold change (FC) was larger than > 2 between native and tumor tissues. The patients with VEL number (GAIN + LOST VELs) less than 500 were excluded from further analysis. We merged all VELs into one single coordinate file, and calculated the recurrence and significance (Benjamini-Hochberg corrected pvalue) for all VELs. We used recurrence of 14 and 19 as significance threshold for gain and lost VELs, respectively, because gain and lost VELs achieved the significant percentage cut-off (0.95) when recurrence larger than these numbers.

Identification of VSELs
For variant super enhancer loci (VSEL), the identifying procedure was similar as described above in "Identification of VELs". If the VSELs number in individual patient was less than 10, the patient would be excluded from further analysis (Patient 52 and 67 were excluded). And the significant percentage cut-off was changed to 0.9.

Identification of genes associated with VSELs
SE-associated Genes were identified by rose2 (https://github.com/linlabbcm/rose2) software and all these genes were merged into a single list. We considered the variation of a SE-associated gene by calculating the variant recurrence generated by its recurrence in tumor minus which in native tissue.

PCA analysis
We performed PCA analysis for gene expression, enhancer H3K27ac and promoter H3K4me3 in native and tumor tissues. For gene expression, we quantified sequencing fragments as reads per kilobase per million (FPKM) in each sample. And for two ChIP-seq signals, we used RPM. R package FactoMineR (version 2.3) was used to perform PCA analysis.

Human disease ontology and GO analysis
The coordinate file of GAIN and LOST VELs were submitted to GREAT website (version 3.0.0) and the results of human disease ontology and GO analysis (biological process) were obtained for plotting.

Identification of CMS subgroup specific gain VELs
For an individual gain VEL, if the H3K27ac signal on corresponding region in a CMS subgroup was 1.5 times higher than other 3 subgroups, we called it a specific gain VEL for this subgroup. For all CMS subgroups, significant GAIN VELs were identified as the procedure described above in "Identification of VELs" and the significant percentage cut-off was changed to 0.9.

Pathway analysis for CMS2 specific GAIN VELs
Functional characterization of CMS2 specific gain-VEL-associated genes was conducted using the ClueGO plugin for Cytoscape (version 3.8.0). These tested genes were queried against a compendium of gene sets from GO (Biological Process), KEGG and REACTOME to identify significantly enriched processes and pathways.
Analyses were performed using the GO Term Fusion option in ClueGO and only processes/pathways with a p-value < 0.01 (right-sided hypergeometric test) following p-value correction (Bonferroni step down) were visualized.

Core regulatory circuitry for super enhancer associated TFs
To quantify the interaction network of transcription factor regulation, we calculated the inward and outward binding degree of all SE-associated TFs. All SE-associated genes annotated to encode a transcription factor were considered as the node-list for network construction. For any given TFi, the IN degree was defined as the number of TFs with an enriched binding motif at the proximal SE or promoter of TFi. The OUT degree was defined as the number of TF-associated SEs containing an enriched binding site for TFi. The IN and OUT degree were generated by crc software (https://github.com/linlabcode/CRC) and the total degree was defined as IN degree plus OUT degree.

Reverse transcription and quantitative PCR
Cells were scraped down and collected with centrifugation. Total RNA was extracted with RNA extraction kit (Aidlab) according to the manufacturer's manual.
Approximately 1μg of total RNA was used for reverse transcription with a first strand cDNA synthesis kit (Toyobo). The resulted cDNA was then assayed with quantitative PCR. β-actin was used for normalization. The sequences of primers are in Extended Data Table 6. Assays were repeated at least three times. Data were shown as average values ± SD of at least three representative experiments. P value was calculated using student's t test.

Cell proliferation assay
The proliferation of colorectal cancer cells in vitro was measured using the MTT assay. Briefly, 1,000 cells were seeded into 96-well plate per well. Six well of each group were detected every day. MTT (0.25μg) was put into each well and incubated at 37℃ for 4 hours. The medium with the formazan sediment was dissolved in 50% DMF and 30% SDS (pH4.7). The absorbance was measured at 570 nm. Assays were repeated at least three times. Data were shown as average values ± SD of at least three representative experiments and p value was calculated using student's t test.
Transwell assay 1 × 10 5 HCT116 cells were plated in medium without serum or growth factors in the upper chamber with a Matrigel-coated membrane (24-well insert; pore size, 8 µm; BD Biosciences), and medium supplemented with 10% fetal bovine serum was used as a chemoattractant in the lower chamber. After 36 h of incubation, cells that did not invade through the membrane were removed by a cotton swab. Cells on the lower surface of the membrane were stained with crystal violet and counted. Assays were repeated at least three times. Data were shown as average values ± SD of at least three representative experiments and p value was calculated using student's t test.