Identification and interrogation of the gene regulatory network of CEBPA-double mutant acute myeloid leukemia

Acute myeloid leukemia (AML) is a heterogeneous hematological malignancy caused by mutations in genes encoding transcriptional and epigenetic regulators together with signaling genes. It is characterized by a disturbance of differentiation and abnormal proliferation of hematopoietic progenitors. We have previously shown that each AML subtype establishes its own core gene regulatory network (GRN), consisting of transcription factors binding to their target genes and imposing a specific gene expression pattern that is required for AML maintenance. In this study, we integrate gene expression, open chromatin and ChIP data with promoter-capture Hi-C data to define a refined core GRN common to all patients with CEBPA-double mutant (CEBPAN/C) AML. These mutations disrupt the structure of a major regulator of myelopoiesis. We identify the binding sites of mutated C/EBPα proteins in primary cells, we show that C/EBPα, AP-1 factors and RUNX1 colocalize and are required for AML maintenance, and we employ single cell experiments to link important network nodes to the specific differentiation trajectory from leukemic stem to blast cells. Taken together, our study provides an important resource which predicts the specific therapeutic vulnerabilities of this AML subtype in human cells.


)-4 C E B P A N/ C -4 C E B P A N/ C -6 C E B P A N/ C -2 C E B P A N/ C -5 C E B P A N/ C -8 C E B P A N/ C -9 C E B P A N/ C -3 C E B P A N/ C -1 C E B P A N
Gene expression values (log2 FPKM + 1) in LSC and Blasts with LSC and Blast specific genes highlighted
KO52 cells expressing the inducible dnC/EBP and dnFOS were induced prior to the start of the time-course and GFP+ FACS sorted cells were used to set up the assay as described above.
After 7 days, cell viability was assessed by using Trypan Blue (Sigma-Aldrich).

RNA extraction and Real-Time RT-qPCR
Total RNA was extracted with NucleoSpin RNA (Machery-Nagel) and 1 ug of nucleic acid was retrotranscribed using SuperScriptTM II Reverse Transcriptase kit (Thermo Fisher) and 0.5 μg of Oligo(dT) primers (Promega), according to the manufacturer's instruction. Target gene expression was calculated using target specific standard curve obtaining a sample cDNA in water at the serial dilution 1:10, 1:50, 1:250, 1:1250 and 1:6250. Sample cDNA was diluted 1:100 in water prior to quantification. Real time PCRs were performed using a 2x dilution of SYBR® Green PCR Mix (Applied Biosystems) and 100 nM forward and reverse primers. 2.5 uL of standard or sample diluted as specified above were added in a final volume of 10 uL.
GAPDH was used as reference gene for relative gene expression calculation Analysis were performed on an ABI 7500 real-time PCR system using StepOneTM Plus software.

PCR conditions for the pre-amplification of transposed fragments
To determine the number of additional cycles needed for an optimal library amplification, 5 μL of pre-amplified library were used to set up a RT-qPCR reaction (  The following IP protocol is optimized for a maximum input of 5x10 6

DNase I-seq and ATAC-seq data analysis
Sequencing adapters and low-quality reads and were removed using Trimmomatic (8). Reads were then mapped to the human genome version hg38 using Bowtie v2.2.3 (9), and PCR duplicated reads were removed using Picard (http://broadinstitute.github.io/picard). Only reads aligned to unique genomic positions were retained using Samtools. The bamCoverage function from BEDTools (10) was used to generate read density profiles that were then visualized using the UCSC Genome Browser (11). DHSs, correspond to regions of open chromatin (peaks) were identified using MACS2 (12). Peaks included in the ENCODE hg38 blacklist (13) were removed. Peaks were annotated to their associated gene using Promoter Capture Hi-C. Where a peak did not occur in a restriction fragment and therefore could not be mapped to a gene promoter by Hi-C, the peak was annotated to the closest gene using the annotatePeaks.pl function in Homer (14). Peaks were further annotated as promoter proximal if within -/+2 kb of a gene transcription start site (TSS) and were classified as distal otherwise. If two peaks had summits within 200 bp of each other, these peaks were combined into a single peak with a new summit position defined as the mid-point between the summits of the original peaks.
To maximise the precision and accuracy of peak detection, a reference peak set was created using a merged dataset of all sequencing experiments from all primary samples. To do this, the aligned reads (bam files) of all the samples were merged into a single alignment using the merge function in samtools. Peaks were then identified using this merged alignment using MACS2 as described in section 2.2.31, and these peak positions were used as the reference peak position in all further downstream analysis. Tag counts were retrieved from these peaks using featureCounts (15). Counts were then normalized as count per million (CPM) using the DEseq2 (16) package in R and then further quantile normalized using quantile normalization.

AML subtype-specific hypersensitive sites analysis
To identify genomic regions of differential chromatin accessibility specifically enriched in CEBPA N/C , t(8;21) or PBSC blasts, the average CPM values for each hypersensitive site within each group were calculated and further log2-transformed as log2(CPM +1). A genomic site was considered to be differentially accessible if the fold-difference of the CPM between two subtype was greater than

Clustering of chromatin accessibility data from primary samples
To generate the heatmap in Fig. 1C, the Pearson correlation coefficients for each pair of samples were calculated using the log2 transformed CPM values. The resulting correlation matrix was then hierarchically clustered using Euclidean distance with complete linkage clustering in R Low-quality reads and sequencing adapters were removed using Trimmomatic (8) and aligned to the human genome version hg38 using HISAT2 (17). StringTie (18) was used to assemble reads into transcripts and calculate gene expression values as fragment per kilobase of transcript per million fragments (FPKM). Gene models from the Ensembl database were used as reference transcriptome (19). Only genes that were considered expressed with a FPKM > 1 in at least one sample were retained for further analysis. Values were log2-transformed as log2(FPKM +1

Clustering of RNA-seq data from primary samples
To generate the heatmap in Fig. 1B, the Pearson correlation coefficients for each pair of samples were calculated using the log2 transformed FPKM values and hierarchically clustered using Euclidean distance with complete linkage clustering in R

Differential gene expression analyses
Gene expression was initially measured as number of reads mapped to a transcript with featureCounts (15) using the hg38 assembly from the Ensembl database as reference transcriptome (19) (-p -s 2 parameters). Counts were filtered and normalized to Counts Per Million (CPM) using EdgeR (20). Differential gene expression analysis were carried out using the Limma package (21) in R. A gene was considered as differentially expressed between two conditions if it had a fold-change greater than 2 and a Benjamini-Hochberg adjusted p-value < 0.05. For primary samples analyses, samples within the PBSC, CEBPA N/C and t(8;21) blast groups were treated as biological replicates

KEGG pathway analysis
Kyoto Encyclopaedia for Genes and Genomes (KEGG) pathway enrichment analysis were performed using the ClueGo v3.8.2 plugin (22) for Cytoscape v2.5.7 (23) using a right-sided hypergeometric test with Benjamini-Hochberg adjusted p-value for multiple testing. A pathway was considered significantly enriched if the adjusted p-value was < 0.05.

Promoter Capture Hi-C data analysis
Promoter Capture Hi-C data were processed with the HiCUP pipeline (24). The paired-end sequencing reads, called di-tag, were separately mapped to the hg38 genome, repaired and filtered for experimental artifact (i.e., self-ligation and re-ligation products) and PCR duplicates. Mapped di-tags were then analysed with GOTHiC (25). This package uses a cumulative binomial test to assign a p-value to each interaction and detects di-tags significantly enriched compared to a background model of random interactions. The HOMER software package was used to determine statistically significant interactions, taking into account the p-value and false discovery rate (FDR) relative to a background model.

CEBPA N/C -specific gene regulatory network construction
Motif search within hypersensitive sites specific for CEBPA N/C and PBSCs was performed using the findMotifGenome.pl function in HOMER (14). Sites were linked to their associated gene promoters using Chi-C data where possible, or otherwise to the nearest expressed gene. The number of motifs was counted for each hypothetical target gene, and the significance of motif enrichment was determined by bootstrapping analyses. To do this, the number of transcription factor motifs in the peak set being examined were counted using the annotatePeaks.pl function in Homer. A random set of peaks equal in size to the peak set being examined was then randomly sampled from the complete set of peaks across all patients. This procedure was repeated 1000 times and produced a distribution of expected motif counts.
The mean () and standard deviation () of this distribution was then calculated, and were then used to calculate a Z-score as where x is the number of motifs in the actual peak set. Here, a positive Z-score suggests that the number of motifs in the peak set is greater than what could be expected by chance, representing an enrichment of that motif. Transcription factor family-specific motifs with positive Z-score values selected for further analysis and were then linked to their putative target genes using either CHi-C or closest gene as described above in section 2.2.35.

ChIP-Seq data analysis
Raw sequencing reads were trimmed to remove low-quality sequences and adaptors using Trimmomatic. Processed reads were then aligned to the human genome version hg38 using Bowtie2 v2.2.3, with only uniquely aligned reads retained for further analysis. PCR duplicated reads were identified and removed from the alignments using the MarkDuplicates function in Picards tools. Peaks were called using MACS2 with default settings. The resulting peaks were then filtered using the hg38 Blacklist from ENCODE to remove potential artefacts from the data. Only peaks that were found within open chromatin regions as defined by ATAC-Seq data were retained. Peaks were then annotated to their target genes based on the Chi-C data were available or by closest gene using the annotatePeaks.pl function in Homer

Motif co-localization analysis
The genomic co-ordinates for each transcription factor binding motifs were retrieved from within the KO52 ATAC-Seq peaks that were found to be bound by C/EBPα, FOS and RUNX1 using the annotatePeaks.pl function in Homer. Motif co-occurrence was then measured by counting the number of times a pair of motifs were found within 50bp of each other in this peak set. To measure the significance of this co-occurrence, we next applied a re-sampling analysis whereby we sampled a random set of ATAC peaks from the set of all ATAC peaks found in KO52 cells. The number of motif pairs in this random set was then counted. This procedure was repeated 1000 times and produced a distribution of co-occurrence counts for each pair of motifs. A Z-score could then be computed for each pair as where x is the number of motif pairs in the C/EBPα, FOS and RUNX1 bound sites,  is the average number of motif pairs in 1000 random peak sets and  is the standard deviation. The resulting z-score matrix was then hierarchically clustered using complete linkage clustering of the Euclidean distance in R and displayed as a heatmap.

Single Cell RNA-Seq data analysis
Reads from single-cell RNA-Seq experiments were aligned to the human genome (version hg38) and quantified using the count function in CellRanger v3.1.0 from 10x Genomics and using gene models from Ensembl as the reference transcriptome. Unique Molecular Identifier (UMI) count data was filtered for low quality cells by removing cells with less than 500 and more than 5000 detectable genes. Cells that had more than 15% of UMIs aligned to mitochondrial transcripts were also excluded from further analysis. UMI counts were normalized using the log-normalize method in the Seurat package v3. Approximation and Projection (UMAP). Cluster marker genes, corresponding to genes that are significantly higher expressed in a cluster compared to all other cells outside of that cluster, were identified using the FindAllMarkers function. Genes that had an average log2fold change of at least 0.25 with an adjusted p-value less than 0.05 were selected as marker genes.
In order to classify a single-cell cluster as either blast or LSC, we first identified blast and LSC specific gene expression signatures using bulk RNA-Seq from sorted cell populations. To do this, raw paired-end reads from bulk RNA-Seq experiments were trimmed to remove lowquality reads and sequencing adaptors using Trimmomatic v0.39 (8). Processed reads were then aligned to the human genome (version hg38) using Hisat2 v2.1.0 (17). Gene expression was then measured as Fragments Per Kilobase of transcript per Million mapped reads (FPKM) using Stringtie v2.1.1 (18). Only genes that were considered expressed with an FPKM greater than or equal to 1 in either LSCs or blasts were retained for further analysis. Gene expression