Unsupervised clustering and epigenetic classification of single cells

Characterizing epigenetic heterogeneity at the cellular level is a critical problem in the modern genomics era. Assays such as single cell ATAC-seq (scATAC-seq) offer an opportunity to interrogate cellular level epigenetic heterogeneity through patterns of variability in open chromatin. However, these assays exhibit technical variability that complicates clear classification and cell type identification in heterogeneous populations. We present scABC, an R package for the unsupervised clustering of single-cell epigenetic data, to classify scATAC-seq data and discover regions of open chromatin specific to cell identity.

Notably, scABC determined that the total number of clusters K is equal to one ( Supplementary Fig. 3), however, we set K = 4 to further investigate the robustness of our method to batch effects. Similarity is defined as the Spearman correlations of cells and landmarks (details in Method). For a better comparison of landmarks, throughout figures, we also normalize the correlations between each cell and landmarks using the mean of their absolute values. The majority of cells are split into two similar clusters that are not specific to any replicates, suggesting that scABC is not affected by batch biases. Although clusters 2 and 3 are quite analogous, we observe small variations between these two clusters that could indicate the heterogeneity of GM12878 cell line (see Supplementary Fig. 12 for detailed analysis). (c) Heat map for the read counts across peaks and cells. Each entry shows the number of reads in a peak (column) per cell (row), with rows colored by batch. Despite the variations in sequencing depth (see part a), Spearman based hierarchical clustering on landmark peaks does not reveal batch specific patterns. Supplementary Figure 5: scABC sensitivity for detection of small sub-populations. Each sub-population (a cell line from the in silico mixture) was separately downsampled, with cells removed at random, and then clustered using scABC. We calculated the cluster accuracy as the probability of downsampled cell sub-population forming a distinct cluster. A cluster is distinct provided that (i) at least 85% of the downsampled cell sub-population are clustered together, and (ii) within this cluster, at least 85% of the cells are from the downsampled cell sub-population. Cluster accuracy is then defined as (number of runs that results in a distinct cluster)/(total number of runs), where the total number of runs was set to 50. We observe that scABC identifies sub-populations that are as low as 1% of the total population. We also note that the performance is cell line specific, as we would expect dissimilar populations to be more easily recognized while similar populations are difficult to distinguish without sufficient representation.  Figure 6: scABC sensitivity for detection of similar sub-populations. We randomly picked 100 cells from each cell line of the in silico mixture, where sampling with replacement was performed for HL-60, TF-1, and H1 since the total number of available cells was less than 100. We then randomly divided the 100 cells into two groups of 50 cells and considered an arbitrary subset of peaks. For each cell in one group, we used a BJ cell with comparable sample depth (h i ) and replaced their read counts within the selected peaks. Cells from the two groups were then clustered using scABC and the cluster accuracy for cells with no replaced peak was defined similar to Supplementary Figure 5. Overall, scABC achieves 0.7 cluster accuracy when 30-50 percent of peaks are replaced (50%-70% cell line similarity). We emphasize that the six cell lines have a large number of peaks in common (see Fig. 1c) and thus, a small portion of replaced peaks derives dissimilarity between the two groups. This suggests that our analysis underestimates the true capability of scABC in distinguishing similar sub-populations. To better demonstrate our method performance, we also evaluated scABC clustering results on the RA-treated mESCs, a true biological mixture with highly similar cells compared to the in silico mixture (see Results).  , calculated using all narrow peaks from the in silico mixture of six cell lines (rows), with cell line assignments shown on the left. Notably, the well known H1 specific TF, POU5F1 (or other POU factors with similar motifs), is not detected while cluster specific narrow peaks capture the activity of POU motifs (Fig. 2a). (b) chromVAR t-SNE plot for the six clusters obtained by scABC, calculated using cluster specific narrow peaks, with cluster assignments shown on the right. Clear separation of the clusters indicates that the cluster specific peaks are responsible for the intrinsic variation in transcription factor motifs while using only 15% of all narrow peaks. (c) chromVAR t-SNE plot, calculated using all narrow peaks, with cluster assignments shown on the right. Here, separation of clusters is also comparable to part b. The cluster specific peaks for clusters 1, 2, and 3 are sparse, indicating that a few cells influence these clusters. (c) chromVAR deviations for the top 20 TF motifs (columns) and cells (rows), calculated using cluster specific peaks, shows that cluster 4 is enriched for cluster specific TF motifs (ATF4 and XBP1). Similar motifs are enriched in clusters 1 and 2, but, this was expected since they both have analogous landmarks. Dysregulation of JUN and JUNB were previously shown to be essential for leukemic stem cell function [4], which may explain the low activity of AP-1 in cluster 3. For each in silico cell line, we took 10 random subsamples of size approximately equal to 1%, 2%, . . . , 9% of the total population. We clustered the cells using scABC and K-medoids with Spearman dissimilarity. The cluster accuracy is defined as the proportion of cells in each cell line that are assigned to a cluster that is dominated by that cell line. The points are the average cluster accuracy and the error bars defined as two times the standard error of the mean.  HNF1A  HNF1B  GATA3  GATA2  GATA5  TEAD1  TEAD3  TEAD4  FOXC2  FOXA1  FOXP2  FOXP3  FOXO6  FOXO4  FOXD2  FOXI1  FOXL1  SOX9  SOX10  RFX5  RFX4  RFX2  RFX3  CTCF  POU2F1  POU5F1B  POU3F3  POU1F1  POU3F1  We clustered the RA-treated mESCs using K-medoids with Spearman dissimilarity. Here, we show chromVAR deviations for the 50 most variable TF motifs, calculated using all narrow peaks, with K-medoids cluster assignments depicted on the left. TFs are colored based on previous studies (Results, see [5] for RFX3). Motifs enriched in cluster 1 are partially enriched in cluster 2, suggesting that the clustering given by naive K-medoids is not biologically meaningful.

Supplementary Tables
DNase-seq peaks top 50,000 condition 1 condition 2       Table 7: The number of cells from the in silico mixture designated to the six clusters defined by naive K-medoids with Spearman dissimilarity measure. K-medoids arrives at 10 misclassifications, slightly worse than scABC (Supplementary Table 2).  Table 9: We binned the reads into large genomic intervals (100 kb) based on the windowCounts function in the csaw bioconductor package [6] and removed regions that had no counts across cells. We then used these large intervals to cluster cells using naive K-medoids with Spearman dissimilarity measure. Here, we show the number of cells from the six cell line mixture corresponding to the six clusters defined by K-medoids. In total 34 cells are misclassified, higher than the 10 misclassifications by naive K-medoids on the unbinned counts (Supplementary Table 7     Supplementary Table 13: Clustering results for RA-treated mESCs obtained from scABC and naive K-medoids with Spearman dissimilarity with K = 2 for both methods. K-medoids divides the mixture into two groups with the similar number of cells. These two groups are not in agreement with the analysis of TF enrichment ( Supplementary Fig. 16).  Table 14: scABC, applied to the in silico mixture of six cell lines, is robust to ranges of various and c in the weighting scheme. Rows and columns correspond to and c, respectively. Entries represent the number of cells that are correctly clustered (out of the 966 total cells), assuming the majority of cluster's cells with the same type shows the cluster identity.

Data processing
Single cell chromatin accessibility data The scATAC-seq data from [2] was used to construct a in silico heterogeneous population, publicly available in the Gene Expression Omnibus (GEO) under the accession number GSE65360. Single-cell data for LSC, blast, LMPP, and monocyte cells are available under GEO accession GSE74310 [7]. GSE68103 also provides sequencing data for the mixtures of GM12878/HEK293T and GM12878/HL-60 cell lines [1]. In addition, we generated 96 RA-treated mESCs (see Results and Methods), available at GEO with the accession GSE107651.

Read alignment and peak calling
We used the Kundaje pipeline (https://github.com/kundajelab/atac_dnase_pipelines) to process the raw scATAC-seq reads, aligning reads to hg19 and removing duplicates (mm9 for RA-treated mESCs). Alternative schemes may improve data preprocessing, however, our main objective is to determine cell identities using only the aligned sequencing reads as inputs. Similar to previous studies, such as [2], we employed MACS2 to perform peak calling on accessibility data aggregated across all cells [8]. MACS2 generates i) narrow peaks generally called for transcription factor binding sites and ii) broad peaks suitable for detection of functional regulatory elements such as promoters and enhancers. We utilized both narrow and broad regions, called gapped peaks, to better classify cell identities.
For the GM12878/HEK293T and GM12878/HL-60 mixtures, sequenced through combinatorial cellular indexing, we used the processed data available through GEO, accession number GSE68103.

Reliable peaks
Given the sparse nature of single-cell data, we suspected that open chromatin peaks, identified using aggregated data, may differ from the bulk data. To characterize the variability between bulk and merged scATAC-seq data, we focused on K562 myeloid leukemia cell line because it has numerous publicly available bulk datasets. Specifically, we compared aggregated single cell accessibility peaks to DNase hypersensitivity regions retrieved from the ENCODE project, that we refer to as bulk peaks. As indicated in Supplementary Table 1, roughly 30% of the top 50,000 peaks from aggregated data have overlaps with the bulk peaks. We assume that peaks A and B intersect if B overlaps at least 10% of A, or vice versa. Peaks that appear in multiple cells are naturally more reliable than peaks that appear in only one single cell, similar to SNP calling in single cell analysis [9]. Therefore, we focused on the subset of peaks that are present (1 or more reads) in at least 10 single cells, resulting in 19,331 peaks. The peak intersections between the selected subset and bulk data was significantly improved compared to the top 50,000 peaks of the aggregated data (columns 3 in Supplementary Table 1). Increasing the minimum number of reads to 2 slightly reduces peaks to 17,484 and slightly improves the intersections (columns 4 in Table 1). We also observed that an additional growth in the number of required reads or cells notably decreases reliable peaks.
For the analysis of in silico cell lines, GM12878/HEK293T, and GM12878/HL-60 mixtures, we used peaks that are presented in 10 cells with 2 reads. We obtained 94,934 peaks for the in silico cell lines, 12,938 for GM12878/HEK293T mixture, and 10,431 for GM12878/HL-60 mixture. However, for the leukemia mixture and RA-treated mESCs, we considered peaks with minimum 2 reads in at least 5 cells since the leukemia mixture has less than 91 extremely sparse samples for each cell type and only 95 RA-treated mESCs are available. In total, we arrived at 52,336 peaks for the leukemia mixture and 9,661 for RA-treated mESCs.

Sample quality and depth
We only considered scATAC-seq samples that have at least min(500, number of reliable peaks/50) read counts across reliable peaks, leading to 966 cells for the in silico mixture of six cell lines, 95 for the RA-treated mESCs, and 390 for the leukemia, LMPP, and monocyte mixture. However, we did not apply this thresholding to GM12878/HEK293T (748 cells) and GM12878/HL-60 (700 cells) mixtures due to the extreme sparsity of these datasets (Supplementary Figure 1). We define background regions as 500,000 base pairs upstream and downstream from reliable peak centers. We use the median of background reads for each cell, denoted as h i in Method, to represent sample depth.

Single cell expression data and analysis
Raw sequencing files were downloaded from SRA and mapped with STAR [10] to hg38 using the GENCODE v24 transcriptome annotation. Gene counts were obtained with HTSeq [11] using the intersection-nonempty mode and converted to transcripts per million (TPM) for subsequent analysis. We corrected for batch effects with the R package scran [12].

Choosing the number of clusters
To systematically determine the number of clusters for the weighted K-medoids algorithm, we take advantage of the gap statistic [13] with some modifications. Let W K represent a measure for within-cluster dispersion, which is a function of the cluster number K. The gap statistic compares the curve log W K between the data and the expectation under the null reference distribution of the data [13]. The number of clusters K can be determined using the following procedure: Step 1: Perform the weighted K-medoids clustering algorithm on the observed data, varying the total number of clusters from 1 to the maximum number of clusters N , K = 1, · · · , N, and use the weighted K-medoids optimized objective function in the last iteration as W K .
Step 2: Generate B reference data sets, by performing permutation for each feature (peak). Let Y denote the cell⇥feature data matrix, the permutation shuffles the entries in each column. Perform the weighted K-medoids clustering algorithm on the B reference data sets, varying the total number of clusters, and calculate the weighted K-medoids objective W ⇤ Kb , for b = 1, · · · , B and K = 1, · · · , N.
Step 3: compute the estimated gap statistics and the standard deviation Finally we choose the number of clusters bŷ K = smallest K such that Gap(K) Gap(K + 1) sd(K + 1) We made two major modifications compared with the originally proposed gap statistic [13]. First, we use the objective function of the weighted K-medoids algorithm as the within cluster dispersion W K . The original version uses the within cluster sum of pairwise distance. This modification accounts for the presence of noisy samples with low sequencing depth. Second, we perform permutation to generate the null reference distribution. The originally version uses simulated data uniformly distributed over a rectangle containing the data. This modification accounts for the sparsity in single cell data.
In practice, we found that if the data matrix is excessively sparse, even performing permutation may not work well. Therefore we only use the top 5, 000 peaks, those with the highest read counts when summing reads across all cells, to select the number of clusters.

Computation of the maximum likelihood estimate of regression coefficient
The maximum likelihood estimate mle can be computed using the Fisher scoring method [14]. Let X denote the n ⇥ K design matrix, where n is the number of cells and K is the number of clusters, with entry x ik = 1 if cell i belongs to cluster k and x ik = 0 otherwise. The computation can be implemented separately for each peak. Let Y r and r represent the read counts for each peak and coefficients for cell r, respectively.
In Fisher scoring method, r is updated iteratively. At the (t + 1)th iteration, the update for r is r is a column vector with the ith entry µ r is a n ⇥ n diagonal matrix with entries (µ (t) ri ) i=1,··· ,n on the diagonal. We initialize the algorithm from a vector of all zeros and iterate until convergence. In the iterations, the matrix X T W (t) r X can be close to singular for some peaks, especially for the peaks with low read counts. We do not incorporate those peaks in estimating the empirical prior r .
To evaluate our unsupervised clustering algorithm on experimental mixtures, scABC was applied to the GM12878/HEK293T and GM12878/HL-60 cells, where K = 2 clusters were detected for both mixtures (Supplementary Fig. 3). As illustrated in Supplementary Figure 10, landmarks distinguish GM12878 from HEK293T cells with a few misclassifications compared to the cell line assignments of [1]. Note that our mispredictions could be correct since the ground truth is not known and the computationally determined cell lines in [1] may differ from the ground truth. In addition, our procedure provides cell line specific peaks, which were determined by bulk data in the previous analysis. In a similar fashion, Supplementary Figure 11 exhibits our clustering results on GM12878/HL-60 cells, which are in a good agreement with [1]. Notably, cells are well clustered considering that GM12878 and HL-60 are two analogous cell lines and this mixture has lower sequencing depth compared to GM12878/HEK293T.

The evaluation of cluster specific peaks
We used scABC and SCDE [18] to obtain cell type specific peaks for the experimental mixtures of GM12878/HEK293T and GM12878/HL-60 (cell types were defined based on scABC clustering results). SCDE is a Bayesian approach that identifies differentially expressed genes between two groups of samples only (here, we replace genes with peaks). Therefore we did not consider the in silico mixture of six cell types. To evaluate the accuracy of cell type specific peaks, we relied on the bulk DNase-seq from the ENCODE project (www.encodeproject.org) with accession numbers ENCSR000EMT, ENCSR000EJR, and ENCSR000ENU for GM12878, HEK293T, and HL-60, respectively. In particular, chromatin regions that are differentially accessible between GM12878 and HEK293T (also GM12878 and HL-60) bulk samples were treated as the gold standard. To determine these regions, we were not able to take advantage of existing methods such as DESeq2 [19] due to the lack of biological replicates for the DNase-seq samples. Instead, we applied the following heuristic method. Given a peak with length L, we define R as the number of read counts mapped to the peak. We also define R 0 as the number of read counts within 500,000 base pairs upstream and downstream from the peak center (background). We then calculate the bulk accessibility of the peak as R R 0 1000, 000 L , which is read counts fold change between the peak and background, normalized by their lengths. We assume that scABC (or SCDE) correctly identifies a peak, for instance, specific to HEK293T in the GM12878/HEK293T mixture, if the peak has i) the bulk accessibility > 2 in HEK293T and ii) bulk accessibility in HEK293T bulk accessibility in GM12878 > T We set T to 2 and 3 in Supplementary Figures 17 and 18, respectively. Condition i examines if the peak is open in the bulk data and condition ii tests if the open peak is differential in the mixture. Applying the above conditions, we found that scABC calculated cell type specific peaks have 10%-60% higher overlaps with bulk differentially open regions, compared to SCDE (Supplementary Figs. 17 and 18). In addition, the top cell type specific peaks identified by scABC (i.e. the smallest p-values) are almost distinct across the cell types ( Supplementary Figs. 10 and 11) while SCDE's top peaks (i.e. the largest/smallest z-scores) are not well separated, especially in the GM12878/HL-60 mixture (Supplementary Fig. 19).

The evolution of acute myeloid leukemia
To examine the capability of scABC in characterizing cancer progression, we drew our attention to leukemic evolution. Corces et al. [7] performed scATAC-seq on LSC and blast cells from two patient with leukemia as well as scATAC-seq on healthy LMPP and monocyte cells. In that study, bulk ATAC-seq data across myeloid developmental stages was obtained to build principle components (PC) that represent myeloid progression. Single cells were then projected onto myeliod PCs to determine the relationship between leukemia cells and myelopoietic differentiation.
We used scABC to cluster the mixture of LSC, blast, LMPP and monocyte cells without the use of bulk data. The sequencing depth of these cells is low (median: 13,937 unique mapped reads) compared to the previously analyzed mixture of six lines (median: 65,186 unique mapped reads), presenting significant computational challenges ( Supplementary Fig. 1). As suggested by previous studies [20,21], we hypothesized that distal regulatory elements better capture blood cell identities. Hence, we collected predicted human enhancers from FANTOM5 (http://fantom.gsc.riken.jp/5/), ENCODE [22], and the Yue Lab (http://promoter.bx.psu. edu/ENCODE/download.html) and filtered out peaks that are not placed within the enhancers, leading to 7,602 peaks. We applied scABC to the total mixture of blood cells, resulting in K = 2 clusters (Supplementary Fig. 3). Supplementary Table 5 shows the clustering results and Supplementary Figure 13 depicts landmarks separability, heat map of cluster specific peaks, and chromVAR results. We observed that LMPPs and monocytes are clustered separately with the majority of blasts coupled with monocytes and LSCs mostly associated with LMPP. Furthermore, chromVAR identifies cluster specific active transcription factors through motif enrichment analysis. Such clear separation was not achievable when we did not incorporate enhancers into our analysis.
We also suspected that intermediate stages between LMPP and monocyte might be missing and thus, increased the number of cluster to four. As demonstrated in Supplementary Table 6 and Figure 14, one cluster dominated by LSCs and blasts might correspond to intermediate stages while others reflect LMPP and monocyte stages. In both scenarios, our conclusion is consistent with Corces et al. [7].