Transcriptional, epigenetic and retroviral signatures identify regulatory regions involved in hematopoietic lineage commitment

Genome-wide approaches allow investigating the molecular circuitry wiring the genetic and epigenetic programs of human somatic stem cells. Hematopoietic stem/progenitor cells (HSPC) give rise to the different blood cell types; however, the molecular basis of human hematopoietic lineage commitment is poorly characterized. Here, we define the transcriptional and epigenetic profile of human HSPC and early myeloid and erythroid progenitors by a combination of Cap Analysis of Gene Expression (CAGE), ChIP-seq and Moloney leukemia virus (MLV) integration site mapping. Most promoters and transcripts were shared by HSPC and committed progenitors, while enhancers and super-enhancers consistently changed upon differentiation, indicating that lineage commitment is essentially regulated by enhancer elements. A significant fraction of CAGE promoters differentially expressed upon commitment were novel, harbored a chromatin enhancer signature, and may identify promoters and transcribed enhancers driving cell commitment. MLV-targeted genomic regions co-mapped with cell-specific active enhancers and super-enhancers. Expression analyses, together with an enhancer functional assay, indicate that MLV integration can be used to identify bona fide developmentally regulated enhancers. Overall, this study provides an overview of transcriptional and epigenetic changes associated to HSPC lineage commitment, and a novel signature for regulatory elements involved in cell identity.

activation of MAPKK activity epidermal growth factor receptor signaling pathway regulation of cell shape mRNA 3'-end processing membrane protein proteolysis nuclear export actin polymerization or depolymerization very long-chain fatty acid metabolic process regulation of protein homodimerization activity regulation of epidermal growth factor-activated receptor activity EPP-specific active enhancers MPP-specific active enhancers -log10 (Binomial p value) GO Biological Process regulation of defense response 10 12 14 16 18 20 22 24 regulation of cytokine production anti-apoptosis actin cytoskeleton organization inflammatory response regulation of innate immune response activation of innate immune response positive regulation of innate immune response regative regulation of cytokine production leukocyte migration stress-activated protein kinase signaling cascade regulation of translation regulation of cytokine biosynthetic process regulation of tumor necrosis factor production regulation of interleukin-6 production Ras protein signal transduction regulation of tumor necrosis factor biosynthetic process leukocyte chemotaxis unsaturated fatty acid biosynthetic process regulation of interleukin-6 biosynthetic process -log10 (Binomial p value)

CFU assay
We plated multipotent and lineage-committed progenitors cells at 1x10 3 cells/ml in methylcellulose medium (GFH4434, Stem Cell Technologies) under conditions supporting both erythroid and myelomonocytic differentiation. We scored BFU-E, CFU-GM and CFU-GEMM colonies after 14 days.

Erythroid and myeloid liquid culture
We performed in vitro erythroid differentiation as described by Roselli et al. 1 . For myeloid differentiation, we cultured CD34 + cells for 7 days as described above (see "Isolation of MPP"). At day 7, we maintained the cells for 7 days in IMDM medium (Lonza) containing 10% FBS (Hyclone) and supplemented with 100 ng/ml G-CSF (ITALFARMACO SpA).

Promoter construction
We defined level-1 promoters ("transcription start sites") by summing the weighted number of CAGE tags at each genome position. 567254, 560724 and 651829 TSSs were identified in HSPC, EPP and MPP, respectively. Then we clustered level-1 promoters into level-2 promoters ("CAGE promoters") if they were within 20 bp of each other on the same chromosomal strand. Level-2 clustering defined >13,000 promoters of similar average length (160 ±101 bp) in each cell type (13,852 in HSPC, 13,609 in MPP and 14,041 in EPP). We calculated the expression level for each level-1 and level-2 promoter by dividing the number of CAGE tags of each promoter in each experimental condition by the total number of mapped CAGE tags in that condition, and multiplying by 1,000,000 (tags-per-million, TPM). The expression of each level-2 promoter of at least 10 TPM in at least one experimental condition was imposed.
Promoter genomic annotation Using a custom R script, we annotated level-2 promoters using RefSeq genes, ENSEMBL ncRNA, ncRNA included in publicly available data sets 6 7 8 , Vertebrate Genome Annotation (Vega) pseudogenes 9 , Yale Gerstein Group pseudogenes 10 and GENCODE (release 14). For each dataset, the script performed the annotation considering the smallest distance to the CAGE-defined promoter on the same chromosome strand. We defined the distance as follows: 1. If the 3' end of the promoter was upstream of the 5' end of the annotation, then we used the distance between the 3' end of the promoter and the 5' end of the annotation. 2. If the 5' end of the promoter was downstream of the 5' end of the annotation, then we used the distance between the 5' of the annotation and the 5' end of the promoter. 3. If the promoter overlapped the 5' end of the annotation, we considered the distance to be zero. If this distance was less than 400 bp, we associated the promoters to the gene or transcript.

Statistical analyses
To compare CAGE and microarray expression levels, we calculated the Pearson correlation coefficients of CAGE log2 tpm counts and log2 expression values for the CAGE promoters and the Affymetrix probe sets annotated to the same gene. For genes associated to more than 1 CAGE promoter, we calculated the average expression level of all the CAGE promoters assigned to the gene. We used the chi-squared test and a threshold p-value <0.001 to analyze differentially expressed CAGE level 2 promoters 11 .

qRT-PCR
RNA was reverse-transcribed using Superscript II as per the manufacturer's instructions (Invitrogen). Quantitative reverse-transcription PCR (qRT-PCR) was performed using SYBR green (Applied Biosystems).

ChIP-seq ChIP assay
We prepared chromatin from EPP and MPP after cross-linking for 10' at RT with 1% formaldehyde-containing medium, using truChIP™ High Cell Chromatin Shearing Kit with SDS Shearing Buffer (Covaris). We sonicated nuclear extracts to obtain DNA fragments averaging 200 bp in length and immunoprecipitated the equivalent of 10 7 cells overnight with 10 µg of rabbit antibodies against H3K4me1 (ab8895, Abcam), H3K4me3 (ab8580, Abcam), and H3K27ac (ab4729, Abcam), as previously described 12,13 . We used real-time SYBR Green PCR to validate genomic regions enriched in H3K4me1, H3K4me3 and H3K27ac in 3 biological replicates. Four genomic regions were used as positive (HBG promoter, GATA1 binding site, CSF3R promoter) and negative (DEFB122) controls, as shown below.

Identification of cis-regulatory elements
We developed a custom R-workflow to identify promoters and enhancers. The pipeline analyzes the histone modification islands generated by SICER and includes three steps. In the first step, the R script invokes BEDtools 16 to detect regions where H3K4me1 overlap or do not overlap with H3K4me3. Regions marked exclusively with H3K4me3 or H3K4me1 were classified as putative promoters and enhancers, respectively. In the second step, H3K4me3 + H3K4me1and H3K4me3 -H3K4me1 + regions are classified as putative promoters and enhancers, respectively. In this step, the R script first normalizes the tag counts of H3K4me3 and H3K4me1 using the sequencing depths of both libraries and then calculates the log-ratios between H3K4me3 and H3K4me1 tag counts for H3K4me3 + H3K4me1 + regions. If the H3K4me3/H3K4me1 ratio is greater than 0, the region is defined as putative promoter, otherwise as putative enhancer. In addition, we defined putative enhancers as H3K4me3 -/me1 high and H3K4me3 low /me1 high regions 2kb far from TSSs of annotated CAGE promoters and RefSeq genes. Finally, we intersected putative promoters and enhancers with H3K27ac + regions to identify active chromatin regions.

Definition of super-enhancers
H3K27ac peaks were used as constituent enhancers for super-enhancers identification. Enhancers were stitched and super-enhancers were defined using ROSE code 21 , as already described 22 . Briefly, the algorithm stitches enhancers together if they lie within a certain distance and ranks the enhancers by their input-subtracted signal of H3K27ac. It then separates super-enhancers from typical enhancers by identifying an inflection point of H3K27ac signal versus enhancers rank. ROSE was run with stitching distance of 12,500 bp. In addition, all the enhancers wholly contained in a window ± 2,500 bp around an annotated transcriptional start site (RefSeq hg19) were excluded from stitching, allowing for a total 5,000 bp promoter exclusion zone. Super-enhancers were then assigned to the RefSeq gene whose TSS was the nearest to the center of the stitched enhancers.

Retroviral scanning
Sequencing and mapping of retroviral integration sites Retroviral integration sites were cloned by linker-mediated PCR (LM-PCR) adapted to the GS-FLX Genome Sequencer (Roche/454 Life Sciences, Branford, CT) pyrosequencing platform, as previously described 23 . Briefly, genomic DNA was extracted and digested with MseI and a second enzyme PstI to prevent amplification of internal 3' LTR fragments. An MseI double-stranded linker was then ligated and LM-PCR performed with nested primers specific for the linker and the 3' LTR, containing a bead-capture tag and a sequencing tag.Raw sequence reads were processed and mapped onto the human genome by an automated bioinformatics pipeline that eliminates viral and linker sequences, as previously described 23 . Unique sequences 20-bp or longer were mapped onto the human genome (UCSC Human Genome Project Working Draft, hg19) using Blat sequence alignment tool, requiring univocal match and a 95% identity over the entire sequence length.

Statistical definition of retroviral integration clusters in HSPC, EPP and MPP.
We plotted the percentage distribution of the distances (log scale on the x axis) between any insertion site and the second consecutive integration in the MLV datasets (red lines), together with the same number of control random sites (black lines). Control sites were re-sampled 1,000 times from the entire collection of 11,655,601 random sites. Dashed vertical lines identify the threshold in the normal site distribution containing 1% of the sites, representing the false discovery rate (FDR) for cluster definition. A bimodal distribution was observed for MLV, identifying two populations, one (left part) of highly clustered integration, and the other (right part) almost coinciding with the random curve. We chose the distance between three consecutive integrations and a 1% FDR as a threshold for cluster definition (12,598, 14921 and 10,169bp for HSPC, EPP and MPP, respectively) as the best trade-off between low background noise and good discrimination of the two MLV populations (as described in 23 ).