Single-cell ATAC-seq (scATAC) yields sparse data that make conventional analysis challenging. We developed chromVAR (http://www.github.com/GreenleafLab/chromVAR), an R package for analyzing sparse chromatin-accessibility data by estimating gain or loss of accessibility within peaks sharing the same motif or annotation while controlling for technical biases. chromVAR enables accurate clustering of scATAC-seq profiles and characterization of known and de novo sequence motifs associated with variation in chromatin accessibility.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Thurman, R.E. et al. Nature 489, 75–82 (2012).
Tang, F. et al. Cell Stem Cell 6, 468–478 (2010).
Jaitin, D.A. et al. Science 343, 776–779 (2014).
Buenrostro, J.D. et al. Nature 523, 486–490 (2015).
Jin, W. et al. Nature 528, 142–146 (2015).
Cusanovich, D.A. et al. Science 348, 910–914 (2015).
Fan, J. et al. Nat. Methods 13, 241–244 (2016).
Corces, M.R. et al. Nat. Genet. 48, 1193–1203 (2016).
Farlik, M. et al. Cell Rep. 10, 1386–1397 (2015).
Weirauch, M.T. et al. Cell 158, 1431–1443 (2014).
van der Maaten, L. & Hinton, G. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Fujiwara, Y., Browne, C.P., Cunniff, K., Goff, S.C. & Orkin, S.H. Proc. Natl. Acad. Sci. USA 93, 12355–12358 (1996).
Ramos-Mejía, V. et al. Blood 124, 3065–3075 (2014).
Nerlov, C. & Graf, T. Genes Dev. 12, 2403–2412 (1998).
Gordon, S.M. et al. Immunity 36, 55–67 (2012).
Goardon, N. et al. Cancer Cell 19, 138–152 (2011).
Zhang, P. et al. Immunity 21, 853–863 (2004).
Roadmap Epigenomics Consortium. et al. Nature 518, 317–330 (2015).
Lavin, Y. et al. Cell 159, 1312–1326 (2014).
Martin, M. EMBnet.journal 17, 10–12 (2011).
Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).
Zhang, Y. et al. Genome Biol. 9, R137 (2008).
Mathelier, A. et al. Nucleic Acids Res. 44, D110–D115 (2015).
Heinz, S. et al. Mol. Cell 38, 576–589 (2010).
Kheradpour, P. & Kellis, M. Nucleic Acids Res. 42, 2976–2987 (2014).
Korhonen, J., Martinmäki, P., Pizzi, C., Rastas, P. & Ukkonen, E. Bioinformatics 25, 3181–3182 (2009).
Danon, L., Díaz-Guilera, A., Duch, J. & Arenas, A. J. Stat. Mech. 2005, P09008 (2005).
This work was supported by National Institutes of Health (NIH) P50HG007735 (to W.J.G.), U19AI057266 (to W.J.G.) HG00943601 (to W.J.G.), the Rita Allen Foundation (to W.J.G.), and the Baxter Foundation Faculty Scholar Grant and the Human Frontiers Science Program (to W.J.G). W.J.G. is a Chan Zuckerberg Biohub investigator. J.D.B. acknowledges support from the Harvard Society of Fellows and Broad Institute Fellowship. A.N.S. acknowledges support from the National Science Foundation (NSF) GRFP (DGE-114747). We thank C. Lareau for valuable suggestions for improvements on the package as well as members of the Greenleaf and Buenrostro labs for useful discussions.
J.D.B. and W.J.G. are listed as inventors on a patent (PCT/US2014/038825) for the ATAC- seq method. W.J.G. is a scientific cofounder of Epinomics.
Integrated supplementary information
(a) First, 1) the number of fragments per peak is determined for each cell, then 2) motifs or annotations of interest are assigned to peaks, and 3) the expected fragment count per peak per cell is determined assuming identical read probability per peak for each cell with a sequencing depth matched to that cell’s observed sequencing depth. (b) A “raw deviation” is calculated for each motif or annotation feature by summing the fragments in all peaks that contain that feature, then subtracting then dividing by the expected number of fragments in all peaks containing that feature. (c) A “raw deviation” is computed for “background sets” of peaks matched in GC content and fragment count to the sets of peaks containing the features of interest. (d) The raw deviations for the background sets are used to compute a bias corrected deviation and deviation z-score for each feature. (e) Bias corrected deviations and Z-scores can be used for a variety of downstream applications.
(a) Raw accessibility deviations for GM12878 cells for peak sets based on GC content. Peaks were grouped into 10 bins based lowest to highest GC content. Two cells are highlighted in orange and purple. (b) Same as (a) except bins defined by average accessibility across the cells. (c) GC content versus average accessibility for a sample of peaks. (d) Same data from (c) after Mahalanobis transformation. Peaks are placed into “bins” based on the values of this transformed data; the grid lines show example bins. e) For the peak indicated with the yellow diamond, the probability of selecting a peak from within a given bin as its background peak. f) Similar to (e), but showing the probability of selecting an individual peak as the background peak for the indicated peak in terms of the untransformed space. (g-i) Distribution of GC content per peak for all peaks (grey), the given motif of interest (black), and a background set for that motif (red). (j-l) Distribution of average fragment count per peak (log10) for all peaks (grey), the given motif of interest (black), and a background set for that motif (red).
Supplementary Figure 3 Effect of background peak selection parameters on variability of motifs and control sets.
All panels show effect of varying two parameters involved in background peak selection: the number of bins in each dimension (bs) and the smoothing parameter (w). (a) Variability of all motifs in JASPAR database using different parameter choices, (b) Variability of motif sets chosen to match JASPAR motifs in GC content and average accessibility, (c) Variability of sets of peaks chosen at random from specific quantiles of the GC content and/or average accessibility distributions.
Supplementary Figure 4 Effect of background peak selection parameters on similarity between motif sets and their background peak sets.
All panels show effect of varying two parameters involved in background peak selection: the number of bins in each dimension (bs) and the smoothing parameter (w). (a) Difference in GC content distribution for motif sets and their background peak sets as measured by Kolmogorov-Smirnov (KS) test statistic. (b) Difference in average accessibility (log) distribution for motif sets and their background peak sets as measured by Kolmogorov-Smirnov (KS) test statistic. (c) Fraction of peaks in background peak sets that contain motif. For each panel, all motifs from human core JASPAR database are shown. Peak sets are based on all single cell ATAC-seq data sets shown in Figure 2. Points are colored based on the fraction of all peaks that contain the motif.
Supplementary Figure 5 Effects of varying number of background iterations, sequence bias correction, and peak width.
(a) RMSD and normalized RMSD of variability when using a given number of background sets relative to when using 1000 background sets. Normalized RMSD is RMSD divided by the range of the variability when using 1000 background sets. (b) Same as (a) but for normalized deviations instead of variability. (c) Same as (a) but for deviation Z-scores instead of variability. (d) Correlation between Tn5 bias model average versus GC content for peaks used in scATAC-seq analysis, (e) Correlation in bias-corrected deviations per motif using GC content versus Tn5 bias model for background peak selection versus variability (determined using GC content for bias). (f) Variability of motifs across cells using either Tn5 bias model or GC content for background peak selection. (g-h) chromVAR was run using peaks fixed to a width of 100bp, 250bp, 500bp, 750bp, or 1000bp based on windows around MACS2 summits or using the peaks output directly from MACS2 (variable width). (g) Correlation of the bias corrected deviations for the top 50 most variable TF motifs when using different peak width choices. (h) Correlation of the variability for different motifs when using different peak width choices. (i) Average variability for the top 50 most variable motifs when using different peak width choices. For determining the top 50 most variable motifs in (h) and (i), the top 50 from each peak width strategy were included.
Supplementary Figure 6 Screenshot example of interactive data browsing platform enabled with chromVAR outputs.
The plot on the left can be colored based on an annotation chosen from the dropdown menu above it. The plot on the right can be colored based on the normalized deviations of a motif chosen from the table above it. The table may be sorted based on any of the columns or searched for a particular name using the search box. The chromVAR R package includes a function for generating this type of interactive application.
(a) Comparison of the distribution of fragments in peaks for scATAC-seq and bulk ATAC-seq data down-sampled to 10,000 fragments. For scATAC-seq, cells from the LMPP and Monocyte populations are shown, with fragments mapped to the same peak set as used in the bulk analysis. (b) Pearson correlation between chromVAR computed variability for down-sampled data versus full data set. Box plot shows results from 10 different down-sampling iterations. (c) Pearson correlation between chromVAR deviation Z-scores for down-sampled data versus full data set. Each point represents the median correlation for a different motif for the 10 downsampling iterations; only motifs in the top 20% of variability in the full data set are shown. Several key regulators are highlighted. This panel is analogous to Figure 1b, except the correlation is shown for the deviation z-scores instead of the bias-corrected deviations. (d) Correlation of bias-corrected deviations between chromVAR applied to data down-sampled to 10,000 reads and applied to full data for each motif versus the variability as determined from the down-sampled data. In the top panel, points (motifs) are colored by their variability in the full data, while in the bottom panel they are colored by the number of peaks containing the motif. (e) Clustering accuracy when clustering based on correlation of chromVAR bias corrected deviations or a variety of alternative peak-based approaches. For all methods, the resulting distance matrix was used as input to hierarchical clustering with complete linkage; 13 clusters were identified by cutting the tree, and the similarity of those clusters to the real labels was determined using normalized mutual information.
Variability in chromatin accessibility across single cells shown in Figure 2 for for peak sets defined by (a) motifs or (b) 7mers.
(a) tSNE performed for all cell types, (b) tSNE performed only on HL60, LMPP, Monocyte, and AML cells. When using correlation (Pearson or Correlation), one minus the correlation was used as a distance input into Rtsne. When using distance or PCA, counts were first normalized using the total counts in peaks for each cell. When using PCA + euclidean distance, we used the default settings for Rtsne, which performs PCA and uses the first 50 PCs to compute a distance matrix. For each method, the same perplexity (25) was used.
(a) Number of peaks containing each kmer 1 mismatch away from ‘AGATAAG’ (b) the average accessibility of peaks containing each kmer 1 mismatch away from ‘AGATAAG’. For (a) and (b) the gray dotted line indicates the value for the ‘AGATAAG’ kmer itself (c) Variability (top) and shared variability of 7mers with 1bp mismatch (bottom) for the top 25 most variable kmers for which no kmer differing by a single base pair had greater variability. For each kmer, the shared variability of each kmer with a single base pair is shown, with the point colored based on the position of the mismatch within the kmer.
(a) Schematic of de novo motif assembly method. (b) Motifs identified de novo by chromVAR from accessibility deviations for 7mers. Each de novo motif is shown with the closest matching known motif below it. (c) The variability for both the de novo motif and the known motif (d) motif similarity score between the de novo motif and the known motif (See Methods). (e) the correlation between the normalized deviations of the de novo motif and the known motif.
Supplementary Figure 12 DNase accessibility variation and footprint at de novo motifs identified by chromVAR
(a) Comparison of accessibility variation between single cell ATAC-seq and DNase data for de novo motifs identified by chromVAR. Each panel corresponds to a different de novo motif and shows the DNase deviation Z-score along the x-axis and the median scATAC-seq deviation Z-score along the y-axis. (b) DNase footprints at de novo motifs identified by chromVAR. Each panel shows the DNase profile aggregated around all de novo motifs in the cell type with the highest accessibility deviation.
Supplementary Figure 13 chromVAR identifies relevant TFs distinguishing cell types based on scATAC-seq generated via combinatorial indexing protocol.
(a) Difference in bias corrected deviations for motifs between GM12878 and HEK293T cells versus the p-value for the difference. Some of the most significant motifs are labelled. Identification of cells as being of one cell type or another was based on the labels previously inferred (Cusanovich et al. 2015). b) Difference in variability of chromatin accessibility associated with motifs between GM12878 and HEK293T cells. c) tSNE visualization of cells using chromVAR as the input. In the top panel, cells are labelled based on the cell type assignments previously inferred (Cusanovich et al. 2015). In the middle and bottom panel, cells are colored based on the deviation z-scores for IRF8 and RELA, respectively, as examples of motifs with greater accessibility (IRF8 and RELA) and greater variability (RELA) in GM1278 cells.
Supplementary Figure 14 chromVAR identifies TF motifs associated with chromatin accessibility variation between macrophages resident in different tissues.
ATAC-seq data from Lavin et al. Sample labels for macrophages are based on the tissue of origin, while Monocytes and Neutrophils are labelled based on their cell type. a) Bias corrected deviations for variable TFs using data down-sampled to approximately 10,000 fragments per sample, b) Bias corrected deviations for variable TFs using full data. For both panels, TFs were chosen based on the variability using the down-sample data, with highly correlated motifs omitted. The row and column ordering were based on clustering of the motifs and samples using the down-sampled data.
Supplementary Figure 15 chromVAR identifies TF motifs associated with chromatin accessibility variation across different tissues profiled using DNase-seq by the Roadmap Epigenomics Project.
Motifs are based on the collection published by the ENCODE consortium and used for analysis by the Roadmap Epigenomics Project. a) Bias corrected deviations for variable TFs using data down-sampled to approximately 10,000 fragments per sample, b) Bias corrected deviations for variable TFs using full data. c) Enrichment scores computed by the Roadmap Epigenomics Consortium for each cell type based on the enrichment of motifs in various clusters of enhancers and the activity of each cluster in each tissue. For all panels, TFs were chosen based on the variability using the down-sample data, with highly correlated motifs omitted. The row and column ordering were based on clustering of the motifs and samples using the down-sampled data.
About this article
Cite this article
Schep, A., Wu, B., Buenrostro, J. et al. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat Methods 14, 975–978 (2017). https://doi.org/10.1038/nmeth.4401
Genome Biology (2021)
Current Opinion in Hematology (2021)
Chromatin accessibility maps provide evidence of multilineage gene priming in hematopoietic stem cells
Epigenetics & Chromatin (2021)
Nature Biotechnology (2020)