Single-cell sortChIC identifies hierarchical chromatin dynamics during hematopoiesis

Post-translational histone modifications modulate chromatin activity to affect gene expression. How chromatin states underlie lineage choice in single cells is relatively unexplored. We develop sort-assisted single-cell chromatin immunocleavage (sortChIC) and map active (H3K4me1 and H3K4me3) and repressive (H3K27me3 and H3K9me3) histone modifications in the mouse bone marrow. During differentiation, hematopoietic stem and progenitor cells (HSPCs) acquire active chromatin states mediated by cell-type-specifying transcription factors, which are unique for each lineage. By contrast, most alterations in repressive marks during differentiation occur independent of the final cell type. Chromatin trajectory analysis shows that lineage choice at the chromatin level occurs at the progenitor stage. Joint profiling of H3K4me1 and H3K9me3 demonstrates that cell types within the myeloid lineage have distinct active chromatin but share similar myeloid-specific heterochromatin states. This implies a hierarchical regulation of chromatin during hematopoiesis: heterochromatin dynamics distinguish differentiation trajectories and lineages, while euchromatin dynamics reflect cell types within lineages.


Supplementary Discussion
Technologies to profile histone modifications in single cells by sequencing are still in their infancy, but has the potential to unlock the spectrum of chromatin states in the genome of individual cells. The ideal assay strives to have high sensitivity, high throughput, and robustness in both active and repressive chromatin states. Current techniques to map histone modifications in single cells use one of three approaches: ChIP-based, pA-Tn5-based, and pA-MNase-based. ChIP-based strategies utilize microfluidics systems or combinatorial barcoding to overcome the low sensitivity of ChIP 21-23 . pA-Tn5-based strategies profile histone modifications with very high throughput, but due to the intrinsic affinity of Tn5 to open chromatin regions [29][30][31][32][33][34] , high specificity can so far only be achieved at the cost of some sensitivity. pA-MNase-based methods profile histone modifications with high sensitivity, and have robust detection of modifications associated with euchromatic regions as well as heterochromatic regions, but has generally less throughput compared with Tn5-based methods [26][27][28] . SortChIC is a unique single-cell method that combines cell enrichment to greatly enhance throughput of rare cells, while achieving high sensitivity and robustness to profile active and repressive chromatin states (Extended Data Fig. 2 This comprehensive profiling of rare progenitors and their multiple cell fates enables new systematic analyses, such as quantifying chromatin dynamics that are cell fateindependent during differentiation. This analysis reveals that cell fate-independent changes during differentiation occur frequently for repressive chromatin, while such changes for active chromatin are rare. Our strategy combines rare progenitor cell enrichment with comprehensive differentiated cell type profiling to allow systematic analysis of chromatin dynamics during differentiation into multiple cell fates.  We selected cells with more than 500 total unique cuts for H3K4me1 and H3K4me3, and more than 1000 total unique cuts for H3K27me3 and H3K9me3. Cells also needed to have more than 50% of their cuts occur in an "AT" context. We also counted cut fragments that map in 50 kb nonoverlapping bins genome-wide, and calculated the fraction of bins that contains exactly zero cuts. Cells with a small fraction of zero cuts relative to other cells are more likely to have unspecific cuts. For each mark, we removed cells with a fraction of zero cuts that was below 2 standard deviations from the mean across all cells.

Comparison of sortChIC data with other single-cell chromatin profiling assays
In order to perform a fair comparison of sortChIC data with other similar assays, we For the analysis of enrichment, we took the de-duplicated counts in 50kb bins, and subsetted for 1 sample per batch (Bartosovic: cell_line_mix_v1, brain_P15_GFP_pos; Grosselin: jurkatramos, HBXc-95, Wu21: GBM, sortChIC: BM-SL-5, K562-all). We then calculated the Fraction of reads in the top 25% bins (ranked using the mean signal of cells), and Gini scores per cell.
If Xf_i is the count for bin i , the gini score for cell j is defined as:

Dimensionality reduction based on multinomial models
We counted the number of cuts mapped to peaks across cells and applied the latent Dirichlet allocation (LDA) model 39 , which is a matrix factorization method that models discrete counts across predefined regions as a multinomial mixture model. LDA can be thought of as a discrete version of principal component analysis (PCA), replacing the normal likelihood with a multinomial one 76 .
LDA models the genomic distribution of cuts from a single cell using a hierarchy of multinomials: 1. "⃑ $~ Dirichlet( ) to specify the distribution over genomic regions for each topic k (length G genomic regions).
2. " "⃑ )~ Dirichlet( ) to specify the distribution over topics for a cell i (length K topics).
To generate the genomic location of the jth read in cell i: 3. Choose a topic ),7~ Multinomial( " "⃑ 7 , 1) 4. Choose a genomic region ),7~M ultinomial( "⃑ : ;,< , 1) We used the LDA model implemented by the topicmodels R package 77 , to infer the cell-totopic matrix (analogous to the scores matrix in PCA) and topic-to-region matrix (analogous to the loadings matrix in PCA) using Gibbs sampling with hyperparameters =50/K, =0.1, where K is the number of topics. We used K=30 topics for all of our analyses.

Defining eight sets of blood cell type-specific genes for cell typing
We defined cell type-specific genes for cell type calling by counting reads at +/-5 kb centered

Batch correction in dimensionality reduction
Initial LDA of the count matrix revealed batch effects in H3K4me1 and H3K9me3 between cell types of plates that contained only one sorted type (i.e., entire plate was either unenriched, lineage-negative, or LSK cells, referred to as "single-type") and cell types from plates that contained a mixture of unenriched and non-mature cells (referred to as "balanced"). We corrected batch effects in H3K4me1, H3K4me3, and H3K9me3. Since H3K27me3 did not have single-type plates, we did not correct batch effects in H3K27me3. We considered balanced plates as the reference for differences between cell types, and corrected deviations in single-type plates to match the balanced plates. We used the imputed sortChIC-seq signal inferred from LDA as a denoised signal Y for each genomic region g for every cell c: M is the intercept of the model. C,7 is the effect from a cell belonging to cell type j. R,7 is the interaction effect from a cell belonging to cell type j and being from batch s (singletype plate).
is Gaussian noise.
We inferred the effects for each genomic region using lm() in R with the formula syntax: ~1 + batch + celltype + batch:celltype and estimated the batch-corrected signal: ] =^ if cell from complete plate − H − R,7 if cell from single-type plate .
For cells that belong to complete plates, P ( ) = 0. Therefore, this batch-correction only corrects signal from cells belonging to single-type plates. In the bone marrow analysis this corresponds to nine plates for H3K4me1, H3K4me3, and H3K9me3.
The corrected signal is used to refine the cell-to-topic and topic-to-region matrix by GLM-PCA 40 . We applied SVD to the batch-corrected matrix to use as initializations for U and V, and included batch ID as cell-specific covariates (in glmpca R package: glmpca(), with fam="poi", minibatch="stochastic", optimizer="avagrad", niterations = 500). The batch-corrected U matrix is then visualized with uniform manifold approximation and projection (UMAP).

Differential histone mark levels analysis
To calculate the fold change in histone mark levels at a genomic region between a cell type versus HSPCs, we modeled the discrete counts Y across cells as a Poisson regression. We fitted a null model, which is independent of cell type, and a full model, which depends on the cell type and compared their deviances to predict whether a region was "changing" or "dynamic" across cell types. We implemented the model in R using glm().
We used G as a deviance test statistic: where the deviance is two times the log-likelihood, which for Poisson is: For the full model, the logarithm of the expected value is: While for the null model, it is:

log( ) = M + H P ,
We fitted the model such that the estimated log2 fold change of a cell type j, u v w,< xyz(C) , is always relative to HSPCs.
Under the null hypothesis, G is chi-squared distributed with degrees of freedom equal to the difference in the number of parameters in the two models. We use this test statistic to estimate a p-value and infer whether a 50kb bin is "changing" or "dynamic" across cell types. For H3K4me1, H3K4me3, and H3K27me3, we used a Benjamini-Hochberg adjusted p-value of q<10 -50 . For H3K9me3, where fold changes were generally smaller, we used q<10 -9 . This separate cutoff for H3K9me3 allowed comparable number of differential bins for downstream analysis.

Calculating bins that change independent of cell type
We defined "changing bins" or "dynamic bins" using the deviance test statistic, detailed above in "Differential histone mark analysis", using a q-value<10 -50 for H3K4me1, H3K4me3, and H3K27me3, and q<10 -9 for H3K9me3. We defined these bins to be changing in a cell fateindependent manner if the estimated cell type effect, { C,7 , was either greater than 0 for all cell types (gained relative to HSPCs) or less than 0 for all cell types (lost relative to HSPCs).

Predicting Activities of Transcription Factors in Single Cells
We adapted MARA (Motif Activity Response Analysis) described in 42 to accommodate the sortChIC data. Briefly, we model the log imputed sortChIC-seq signal as a linear combination of TF binding sites and activities of TF motifs using a ridge regression framework: ] >,? = E >,} },?
• }GH + Where ] >,? is the batch-corrected sortChIC-seq signal in genomic region g in cell c; >,} is the number of TF binding sites in region g for TF motif m; },? is the activity of TF motif m in cell c; is Gaussian noise. The L2 penalty for ridge regression was determined automatically using a 80/20 cross-validation scheme. Z-scores of motifs greater than 0.7 were kept as statistically significant motifs.
The single-cell motif activity, },? , is then overlaid onto the UMAP to show cell typespecific activities.
For H3K4me1, we defined genomic regions based on peak calling from hiddenDomains. For repressive marks, where domains can be larger, we used 50 kb bins that were significantly changing across cell types as genomic regions.

Creating the TF binding site matrix
We predicted the TF binding site count occurrence under each peak using the mm10 Swiss Regulon database of 680 motifs (http://swissregulon.unibas.ch/sr/downloads). We used the Motevo method to predict transcription factor binding sites. Posterior probabilities < 0.1 are rounded down to zero.

Joint H3K4me1 and H3K9me3 analysis by double incubation
To simultaneously infer the H3K4me1 and H3K9me3 cluster from single-cell double-incubated cuts, we focused on regions that were most informative to distinguish between clusters in H3K4me1 and in H3K9me3. For H3K9me3, we used 6085 statistically significant changing bins (q<10 -9 , Poisson regression). For H3K4me1, we used regions near cell type-specific genes that were used to determine cell types from the data (811 regions). Since H3K4me1 had strong signal at both the TSS and gene bodies, we defined regions for each gene from transcription start site (TSS) to either its end site or 50 kb downstream of the TSS, whichever is smaller. We counted cuts mapped to these 6896 regions for H3K4me1, H3K9me3, as well as H3K4me1+H3K9me3 cells.
For a single cell, we assumed that the vector of H3K4me1+H3K9me3 counts ⃗ was generated by drawing N reads from a mixture of two multinomials, one from a cell type c from H3K4me1 (parametrized by relative frequencies ⃗ ? ) and one from a lineage l from H3K9me3 (parametrized by relative frequencies ⃗ n ): where w is the fraction of H3K4me1 that was mixed with H3K9me3.
Genomic region probabilities ⃗ ? and ⃗ n were inferred by the single-incubated data by averaging the imputed signal across cell types: where n is the set of cells that belong to lineage . V and U are estimated from LDA.
The log-likelihood for the H3K4me1+H3K9me3 counts coming from cluster pair (c, l), can be defined as: where g is a genomic region.
To assign a cluster pair to a double-incubated single cell, we calculated the loglikelihood for each possible pair (we had four lineages from H3K9me3 and eight clusters from H3K4me1, creating a 32 possible pairs) and selected the pair with the highest log-likelihood.
We used the Brent method implemented in R (optim) to infer w that maximizes the loglikelihood for each pair.
We also add basophils identified from our initial unsupervised clustering analysis as a reference cell type. We assume the vector of counts ⃗ of total counts N generated from a cell c belonging to cell type s is multinomial distributed, ( ⃗ ? | ? = ) ∼ Multinomial( , ⃗ P ) Where ⃗ are event probabilities (∑ ) = 1).
We use outputs from LDA to infer > | = for gene g and cell s by taking the mean probabilities across cells that belong to cell type s, {S}. Once we have the ⃗ P for each cell type s, we can calculate the log-likelihood of an unlabeled cell to belong to cell type s given a raw count vector ⃗, log Pr( ⃗| = ) = P ∝ E > log >,P . >∈{Ž} We can then calculate the probability of a cell c to belong to cell type s: We assign each cell to the most probable cell type label. Subroutines for this method is implemented as an R package: https://github.com/jakeyeung/annotatecelltypes

Inferring pseudotime across different differentiation trajectories
We manually selected two principal components (PCs) for each cell type trajectory, selecting components that show large variation from progenitors (HSCs, LT, ST, MPPs), committed progenitors (e.g., CMPs, MEPs), to mature cell types (e.g., neutrophils, DCs, basophils, monocytes, pDCs, NK cells, B cells) of interest. The selected cells are then fit along a onedimensional vector on the 2D space by finding the main eigenvector along the trajectory. For all trajectories, we used HSCs, LTs, STs, and MPPs to define the root. For myeloid cell types (neutrophils, DCs, basophils, and monocytes), we also included CMPs as committed progenitors. For erythroblasts, we included MEPs as commited progenitors. This eigenvector defines the pseudotime, with the end closest to HSCs defined as the root (pseudotime = 0) and the end closest to the mature cell type defined as the end state (pseudotime = 1).

Chromatin velocity in each histone modifications
After assigning each cell to a trajectory (cells annotated as HSCs, LTs, STs, and MPPs are annotated to all trajectories, while each mature cell type is assigned to one trajectory), we take each dynamic bin and fit a trajectory-specific cubic spline across pseudotime using the mgcv R package with the formula: mgcv::gam(formula = signal ~ s(pseudotime, k = 4, bs = "cs", by = trajectory), gamma = 10, method = "REML") We calculate the derivatives of this cubic spline for each bin along pseudotime using finite differences, as implemented by the gratia R package.
For each cell and for each bin, we take the derivatives of the cubic spline function with respective to pseudotime and project to a future pseudotime step of 0.01. Cells that are assigned to multiple trajectories (e.g., HSCs are assigned to all trajectories) take the mean of the future sortChIC signal across different trajectories. The high-dimensional predicted sortChIC signal at the future pseudotime step is then projected onto the PCA. We use the velocity grid flow visualization as implemented in velocity 78 to visualize the velocity vectors on the PCA space.