A genetic disorder reveals a hematopoietic stem cell regulatory network co-opted in leukemia

The molecular regulation of human hematopoietic stem cell (HSC) maintenance is therapeutically important, but limitations in experimental systems and interspecies variation have constrained our knowledge of this process. Here, we have studied a rare genetic disorder due to MECOM haploinsufficiency, characterized by an early-onset absence of HSCs in vivo. By generating a faithful model of this disorder in primary human HSCs and coupling functional studies with integrative single-cell genomic analyses, we uncover a key transcriptional network involving hundreds of genes that is required for HSC maintenance. Through our analyses, we nominate cooperating transcriptional regulators and identify how MECOM prevents the CTCF-dependent genome reorganization that occurs as HSCs differentiate. We show that this transcriptional network is co-opted in high-risk leukemias, thereby enabling these cancers to acquire stem cell properties. Collectively, we illuminate a regulatory network necessary for HSC self-renewal through the study of a rare experiment of nature.


Identification of direct interactions.
To find the interactions between genes and cisREs, we searched for all possible connections between gene and cisREs within 500 kb of gene TSS. We used two criteria to define the interactions which the cisRE could exert a direct effect on gene regulation: (1) experimental evidence of physical interaction in three-dimensional space or (2) a strong correlation between chromatin accessibility of cisRE and target gene expression. To this end, we annotated the nominated links to assess whether cisREs and target genes are spatially colocalized (i.e. in a chromatin loop). A published dataset spanning 17 hematopoietic cell types of promoter capture Hi-C (PCHi-C) data was used 28 and only loops with CHiCAGO score > 5 were considered. Next, we computed ATAC-seq reads falling within cisREs across the hematopoietic cell populations and performed normalization using the count per million (CPM) method. We calculated the Pearson correlation coefficient between chromatin accessibility of cisREs and gene expression across 16 hematopoietic cell types for each possible interaction pair. To determine the significance, we applied Fisher's z transformation to correlation coefficients. All the interactions with > 0.345 (equivalent to P value < 0.05) were kept. Finally, the nominated links that passed either of these two analyses were retained and a total of 1,218,933 direct interactions were identified.

Identification of indirect interactions.
A gene regulatory network is established by a chain of cisREs which connect to the target though direct or indirect manners. Previous studies 35 reported that a number of cooperative cisREs could associate with the promoter and other cisREs related in multi-way contacts in chromatin loops. Co-accessible chromatin has been reported to be highly connected and functionally related, which is useful to evaluate the connectivity between cisREs. To identify the indirect interactions, we first computed the co-accessibility across 18 cell types between cisREs (not including gene promoters) whose genomic distance less than 500 kb. By using the Pearson correlation measurement and Fisher's z transformation as described above, the co-accessible cisRE-cisRE links with a correlation coefficient > 0.362 (equivalent to P value < 0.05) were selected. Next, to find the shortest path between a cisRE and its target promoter, we constructed a regulatory network using the direct gene-cisRE interactions and co-accessible cisRE-cisRE links, and found the shortest paths between cisREs and genes in this network. Specifically, the network was built using the igraph R package with gene-cisRE interactions and cisRE-cisRE links. Dijkstra's algorithm is designed for searching for the shortest paths between nodes in a graph. In our network, we used this method to find all the potential indirect interactions mediated by the cisREs that have direct gene interactions identified in the first step of our analysis. Given that a smaller weight indicates a greater chance in participating in the shortest path found by the Dijkstra's method, we added the weight to each edge in the network: weight of a pseudo number of 1e-5 for direct gene-cisRE interactions and for cisRE-cisRE links, respectively. All of the gene-cisRE pairs that did not pass the direct interaction identification were analyzed by Dijkstra's method. The cisREs were filtered out if they were not linked to any gene. In total, 4,315,536 interaction pairs are included in HemeMap.

HSC specific regulatory network.
To define the strengths of cis-regulatory interactions in each cell type, we calculated the HemeMap score by using the geometric mean of ATAC-seq signal over all the cisREs involved in each interaction to avoid potential bias introduced by the outliers. To get the HSC-specific regulatory network, we used the cumulative Chi-Square distribution to determine an interaction strength threshold of greater than 8.91 which filtered out 95% of the interactions. The remaining interactions were used to build an HSC-specific regulation network containing 12,808 genes and 372,491 cisREs.

Benchmarking of HemeMap.
To validate the activity of cisREs, we used different epigenomic marks including those present active enhancers (H3K4me1 and DNase I) and those present repressive domains (H3K27me3) of HSPCs from the Roadmap Consortium 31 . We also employed genomic interaction data of HiC 30 and predicted interactions in HSPCs from the Roadmap Consortium 31 to validate that the nominated interactions are active in the HSPC compartment.

De novo motif discovery
To explore the MECOM mediated regulatory network, we retrieved all of the cisREs associated with MECOM network genes identified as differentially expressed after MECOM editing. We used the 200 bp sequences centered on cisREs, i.e. the genomic regions around summits of peaks or TSS, as input for the de novo motif discovery analysis. The MEME suite 61 was used and all the motifs with reported E value < 1e-20 were collected from results of DREME 62 and MEME. Similarity of de novo motifs and the putative TF motifs from a comprehensive collection of 401 human TFBS models (HOCOMOCO V11) 63 was performed using Tomtom 64 . We also correlated the similarity of the ETS family motif identified via de novo motif discovery with the EVI1 binding motif from a published dataset 11 by calculating the Pearson correlation coefficient of the Position Frequency Matrix (PFM) of the two motifs using universal motif R package.

TF Footprinting analysis
A TF footprint is a particular pattern of Tn5 enzyme cleavage sites generated by ATAC-seq data that enables analysis of chromatin occupancy at the base-pair resolution. There is a depletion of cleavage events at the specific site of TF binding on open chromatin, which allowed for the identification of TF binding events with the consensus motifs of interest from the de novo motif discovery analysis 65 . For each de novo motif, including ETS, RUNX, JUN, KLF, CTCF and GATA, we scanned all of the consensus motif sequences that occur within the cisREs in MECOMmediated regulatory network using the software FIMO 66 with default parameters, except for a significance threshold of 5e-4. To create a nucleotide resolution cleavage frequency profile for each TF, we used make_cut_matrix function (https://github.com/Parkerlab/atactk) to count the Tn5 enzyme cleavage frequency at the recognized motif sites and their flanking +/-250 bp sequences, using ATAC-seq data from HSCs. Then, we used CENTIPEDE 67 to build an unsupervised Bayesian mixture model with the cleavage frequency profile to generate a posterior probability value for each motif instance. A motif instance was considered a footprint that is bound by a particular TF when the posterior probability score was greater than 0.95. The plot of cleavage frequency around the footprints was created by aggregating both strands using a custom R script.

Footprint co-occurrence analysis
To explore how these TFs cooperate with each other via combinatorial binding on the cisREs of MECOM network genes, we evaluated the co-occurrence of the TF footprints. Specifically, a hypergeometric test was employed to determine the statistical significance of co-occurrence of two different footprints, as depicted by the following equation: where is the total number of cisREs, 1 and 2 are the number of cisREs containing footprints of each of the two tested TFs, respectively. P value measuring the significance of enrichment is the tail probability of observing ′ or more cisREs containing both TF footprints.