chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data

Schep, Alicia N; Wu, Beijing; Buenrostro, Jason D; Greenleaf, William J

doi:10.1038/nmeth.4401

Brief Communication
Published: 21 August 2017

chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data

Alicia N Schep^1,2,
Beijing Wu^1,2,
Jason D Buenrostro^3,4 &
…
William J Greenleaf ORCID: orcid.org/0000-0003-1409-3095^1,2,5,6

Nature Methods volume 14, pages 975–978 (2017)Cite this article

40k Accesses
595 Citations
101 Altmetric
Metrics details

Subjects

Abstract

Single-cell ATAC-seq (scATAC) yields sparse data that make conventional analysis challenging. We developed chromVAR (http://www.github.com/GreenleafLab/chromVAR), an R package for analyzing sparse chromatin-accessibility data by estimating gain or loss of accessibility within peaks sharing the same motif or annotation while controlling for technical biases. chromVAR enables accurate clustering of scATAC-seq profiles and characterization of known and de novo sequence motifs associated with variation in chromatin accessibility.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: chromVAR enables interpretable analysis of sparse chromatin-accessibility data.**

**Figure 2: chromVAR can be used to cluster single cells and interpret motifs underlying chromatin-accessibility variation.**

**Figure 3: chromVAR identifies *de novo* motifs associated with chromatin-accessibility variation in single cells.**

Intrinsic bias estimation for improved analysis of bulk and single-cell chromatin accessibility profiles using SELMA

Article Open access 21 September 2022

Shengen Shawn Hu, Lin Liu, … Chongzhi Zang

ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis

Article Open access 25 February 2021

Jeffrey M. Granja, M. Ryan Corces, … William J. Greenleaf

Chromatin accessibility profiling by ATAC-seq

Article 27 April 2022

Fiorella C. Grandi, Hailey Modi, … M. Ryan Corces

Accession codes

Primary accessions

Gene Expression Omnibus

References

Thurman, R.E. et al. Nature 489, 75–82 (2012).
Article CAS Google Scholar
Tang, F. et al. Cell Stem Cell 6, 468–478 (2010).
Article CAS Google Scholar
Jaitin, D.A. et al. Science 343, 776–779 (2014).
Article CAS Google Scholar
Buenrostro, J.D. et al. Nature 523, 486–490 (2015).
Article CAS Google Scholar
Jin, W. et al. Nature 528, 142–146 (2015).
Article CAS Google Scholar
Cusanovich, D.A. et al. Science 348, 910–914 (2015).
Article CAS Google Scholar
Fan, J. et al. Nat. Methods 13, 241–244 (2016).
Article CAS Google Scholar
Corces, M.R. et al. Nat. Genet. 48, 1193–1203 (2016).
Article CAS Google Scholar
Farlik, M. et al. Cell Rep. 10, 1386–1397 (2015).
Article CAS Google Scholar
Weirauch, M.T. et al. Cell 158, 1431–1443 (2014).
Article CAS Google Scholar
van der Maaten, L. & Hinton, G. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Fujiwara, Y., Browne, C.P., Cunniff, K., Goff, S.C. & Orkin, S.H. Proc. Natl. Acad. Sci. USA 93, 12355–12358 (1996).
Article CAS Google Scholar
Ramos-Mejía, V. et al. Blood 124, 3065–3075 (2014).
Article Google Scholar
Nerlov, C. & Graf, T. Genes Dev. 12, 2403–2412 (1998).
Article CAS Google Scholar
Gordon, S.M. et al. Immunity 36, 55–67 (2012).
Article CAS Google Scholar
Goardon, N. et al. Cancer Cell 19, 138–152 (2011).
Article CAS Google Scholar
Zhang, P. et al. Immunity 21, 853–863 (2004).
Article CAS Google Scholar
Roadmap Epigenomics Consortium. et al. Nature 518, 317–330 (2015).
Lavin, Y. et al. Cell 159, 1312–1326 (2014).
Article CAS Google Scholar
Martin, M. EMBnet.journal 17, 10–12 (2011).
Article Google Scholar
Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).
Article CAS Google Scholar
Zhang, Y. et al. Genome Biol. 9, R137 (2008).
Article Google Scholar
Mathelier, A. et al. Nucleic Acids Res. 44, D110–D115 (2015).
Article Google Scholar
Heinz, S. et al. Mol. Cell 38, 576–589 (2010).
Article CAS Google Scholar
Kheradpour, P. & Kellis, M. Nucleic Acids Res. 42, 2976–2987 (2014).
Article CAS Google Scholar
Korhonen, J., Martinmäki, P., Pizzi, C., Rastas, P. & Ukkonen, E. Bioinformatics 25, 3181–3182 (2009).
Article CAS Google Scholar
Danon, L., Díaz-Guilera, A., Duch, J. & Arenas, A. J. Stat. Mech. 2005, P09008 (2005).
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Institutes of Health (NIH) P50HG007735 (to W.J.G.), U19AI057266 (to W.J.G.) HG00943601 (to W.J.G.), the Rita Allen Foundation (to W.J.G.), and the Baxter Foundation Faculty Scholar Grant and the Human Frontiers Science Program (to W.J.G). W.J.G. is a Chan Zuckerberg Biohub investigator. J.D.B. acknowledges support from the Harvard Society of Fellows and Broad Institute Fellowship. A.N.S. acknowledges support from the National Science Foundation (NSF) GRFP (DGE-114747). We thank C. Lareau for valuable suggestions for improvements on the package as well as members of the Greenleaf and Buenrostro labs for useful discussions.

Author information

Authors and Affiliations

Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
Alicia N Schep, Beijing Wu & William J Greenleaf
Center for Personal Dynamic Regulomes, Stanford University, Stanford, California, USA
Alicia N Schep, Beijing Wu & William J Greenleaf
Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
Jason D Buenrostro
Harvard Society of Fellows, Harvard University, Cambridge, Massachusetts, USA
Jason D Buenrostro
Department of Applied Physics, Stanford University, Stanford, California, USA
William J Greenleaf
Chan Zuckerberg Biohub, San Francisco, California, USA
William J Greenleaf

Authors

Alicia N Schep
View author publications
You can also search for this author in PubMed Google Scholar
Beijing Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jason D Buenrostro
View author publications
You can also search for this author in PubMed Google Scholar
William J Greenleaf
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.N.S., J.D.B., and W.J.G. conceived the project and wrote the manuscript. A.N.S. wrote the chromVAR R package and performed the analyses with input from J.D.B. and W.J.G. B.W. generated the scATAC-seq data.

Corresponding authors

Correspondence to Jason D Buenrostro or William J Greenleaf.

Ethics declarations

Competing interests

J.D.B. and W.J.G. are listed as inventors on a patent (PCT/US2014/038825) for the ATAC- seq method. W.J.G. is a scientific cofounder of Epinomics.

Integrated supplementary information

Supplementary Figure 1 chromVAR workflow

(a) First, 1) the number of fragments per peak is determined for each cell, then 2) motifs or annotations of interest are assigned to peaks, and 3) the expected fragment count per peak per cell is determined assuming identical read probability per peak for each cell with a sequencing depth matched to that cell’s observed sequencing depth. (b) A “raw deviation” is calculated for each motif or annotation feature by summing the fragments in all peaks that contain that feature, then subtracting then dividing by the expected number of fragments in all peaks containing that feature. (c) A “raw deviation” is computed for “background sets” of peaks matched in GC content and fragment count to the sets of peaks containing the features of interest. (d) The raw deviations for the background sets are used to compute a bias corrected deviation and deviation z-score for each feature. (e) Bias corrected deviations and Z-scores can be used for a variety of downstream applications.

Supplementary Figure 2 Background peak set selection.

(a) Raw accessibility deviations for GM12878 cells for peak sets based on GC content. Peaks were grouped into 10 bins based lowest to highest GC content. Two cells are highlighted in orange and purple. (b) Same as (a) except bins defined by average accessibility across the cells. (c) GC content versus average accessibility for a sample of peaks. (d) Same data from (c) after Mahalanobis transformation. Peaks are placed into “bins” based on the values of this transformed data; the grid lines show example bins. e) For the peak indicated with the yellow diamond, the probability of selecting a peak from within a given bin as its background peak. f) Similar to (e), but showing the probability of selecting an individual peak as the background peak for the indicated peak in terms of the untransformed space. (g-i) Distribution of GC content per peak for all peaks (grey), the given motif of interest (black), and a background set for that motif (red). (j-l) Distribution of average fragment count per peak (log₁₀) for all peaks (grey), the given motif of interest (black), and a background set for that motif (red).

Supplementary Figure 3 Effect of background peak selection parameters on variability of motifs and control sets.

All panels show effect of varying two parameters involved in background peak selection: the number of bins in each dimension (bs) and the smoothing parameter (w). (a) Variability of all motifs in JASPAR database using different parameter choices, (b) Variability of motif sets chosen to match JASPAR motifs in GC content and average accessibility, (c) Variability of sets of peaks chosen at random from specific quantiles of the GC content and/or average accessibility distributions.

Supplementary Figure 4 Effect of background peak selection parameters on similarity between motif sets and their background peak sets.

All panels show effect of varying two parameters involved in background peak selection: the number of bins in each dimension (bs) and the smoothing parameter (w). (a) Difference in GC content distribution for motif sets and their background peak sets as measured by Kolmogorov-Smirnov (KS) test statistic. (b) Difference in average accessibility (log) distribution for motif sets and their background peak sets as measured by Kolmogorov-Smirnov (KS) test statistic. (c) Fraction of peaks in background peak sets that contain motif. For each panel, all motifs from human core JASPAR database are shown. Peak sets are based on all single cell ATAC-seq data sets shown in Figure 2. Points are colored based on the fraction of all peaks that contain the motif.

Supplementary Figure 5 Effects of varying number of background iterations, sequence bias correction, and peak width.

(a) RMSD and normalized RMSD of variability when using a given number of background sets relative to when using 1000 background sets. Normalized RMSD is RMSD divided by the range of the variability when using 1000 background sets. (b) Same as (a) but for normalized deviations instead of variability. (c) Same as (a) but for deviation Z-scores instead of variability. (d) Correlation between Tn5 bias model average versus GC content for peaks used in scATAC-seq analysis, (e) Correlation in bias-corrected deviations per motif using GC content versus Tn5 bias model for background peak selection versus variability (determined using GC content for bias). (f) Variability of motifs across cells using either Tn5 bias model or GC content for background peak selection. (g-h) chromVAR was run using peaks fixed to a width of 100bp, 250bp, 500bp, 750bp, or 1000bp based on windows around MACS2 summits or using the peaks output directly from MACS2 (variable width). (g) Correlation of the bias corrected deviations for the top 50 most variable TF motifs when using different peak width choices. (h) Correlation of the variability for different motifs when using different peak width choices. (i) Average variability for the top 50 most variable motifs when using different peak width choices. For determining the top 50 most variable motifs in (h) and (i), the top 50 from each peak width strategy were included.

Supplementary Figure 6 Screenshot example of interactive data browsing platform enabled with chromVAR outputs.

The plot on the left can be colored based on an annotation chosen from the dropdown menu above it. The plot on the right can be colored based on the normalized deviations of a motif chosen from the table above it. The table may be sorted based on any of the columns or searched for a particular name using the search box. The chromVAR R package includes a function for generating this type of interactive application.

Supplementary Figure 7 chromVAR results are robust to down-sampling and compare favorably to PCA.

(a) Comparison of the distribution of fragments in peaks for scATAC-seq and bulk ATAC-seq data down-sampled to 10,000 fragments. For scATAC-seq, cells from the LMPP and Monocyte populations are shown, with fragments mapped to the same peak set as used in the bulk analysis. (b) Pearson correlation between chromVAR computed variability for down-sampled data versus full data set. Box plot shows results from 10 different down-sampling iterations. (c) Pearson correlation between chromVAR deviation Z-scores for down-sampled data versus full data set. Each point represents the median correlation for a different motif for the 10 downsampling iterations; only motifs in the top 20% of variability in the full data set are shown. Several key regulators are highlighted. This panel is analogous to Figure 1b, except the correlation is shown for the deviation z-scores instead of the bias-corrected deviations. (d) Correlation of bias-corrected deviations between chromVAR applied to data down-sampled to 10,000 reads and applied to full data for each motif versus the variability as determined from the down-sampled data. In the top panel, points (motifs) are colored by their variability in the full data, while in the bottom panel they are colored by the number of peaks containing the motif. (e) Clustering accuracy when clustering based on correlation of chromVAR bias corrected deviations or a variety of alternative peak-based approaches. For all methods, the resulting distance matrix was used as input to hierarchical clustering with complete linkage; 13 clusters were identified by cutting the tree, and the similarity of those clusters to the real labels was determined using normalized mutual information.

Supplementary Figure 8 chromVAR variability in single cells.

Variability in chromatin accessibility across single cells shown in Figure 2 for for peak sets defined by (a) motifs or (b) 7mers.

Supplementary Figure 9 tSNE for single cell ATAC-seq using chromVAR or peak-based strategies.

(a) tSNE performed for all cell types, (b) tSNE performed only on HL60, LMPP, Monocyte, and AML cells. When using correlation (Pearson or Correlation), one minus the correlation was used as a distance input into Rtsne. When using distance or PCA, counts were first normalized using the total counts in peaks for each cell. When using PCA + euclidean distance, we used the default settings for Rtsne, which performs PCA and uses the first 50 PCs to compute a distance matrix. For each method, the same perplexity (25) was used.

Supplementary Figure 10 Effects of mismatches on kmer frequency, accessibility, and variability.

(a) Number of peaks containing each kmer 1 mismatch away from ‘AGATAAG’ (b) the average accessibility of peaks containing each kmer 1 mismatch away from ‘AGATAAG’. For (a) and (b) the gray dotted line indicates the value for the ‘AGATAAG’ kmer itself (c) Variability (top) and shared variability of 7mers with 1bp mismatch (bottom) for the top 25 most variable kmers for which no kmer differing by a single base pair had greater variability. For each kmer, the shared variability of each kmer with a single base pair is shown, with the point colored based on the position of the mismatch within the kmer.

Supplementary Figure 11 De novo motif assembly.

(a) Schematic of de novo motif assembly method. (b) Motifs identified de novo by chromVAR from accessibility deviations for 7mers. Each de novo motif is shown with the closest matching known motif below it. (c) The variability for both the de novo motif and the known motif (d) motif similarity score between the de novo motif and the known motif (See Methods). (e) the correlation between the normalized deviations of the de novo motif and the known motif.

Supplementary Figure 12 DNase accessibility variation and footprint at de novo motifs identified by chromVAR

(a) Comparison of accessibility variation between single cell ATAC-seq and DNase data for de novo motifs identified by chromVAR. Each panel corresponds to a different de novo motif and shows the DNase deviation Z-score along the x-axis and the median scATAC-seq deviation Z-score along the y-axis. (b) DNase footprints at de novo motifs identified by chromVAR. Each panel shows the DNase profile aggregated around all de novo motifs in the cell type with the highest accessibility deviation.

Supplementary Figure 13 chromVAR identifies relevant TFs distinguishing cell types based on scATAC-seq generated via combinatorial indexing protocol.

(a) Difference in bias corrected deviations for motifs between GM12878 and HEK293T cells versus the p-value for the difference. Some of the most significant motifs are labelled. Identification of cells as being of one cell type or another was based on the labels previously inferred (Cusanovich et al. 2015). b) Difference in variability of chromatin accessibility associated with motifs between GM12878 and HEK293T cells. c) tSNE visualization of cells using chromVAR as the input. In the top panel, cells are labelled based on the cell type assignments previously inferred (Cusanovich et al. 2015). In the middle and bottom panel, cells are colored based on the deviation z-scores for IRF8 and RELA, respectively, as examples of motifs with greater accessibility (IRF8 and RELA) and greater variability (RELA) in GM1278 cells.

Supplementary Figure 14 chromVAR identifies TF motifs associated with chromatin accessibility variation between macrophages resident in different tissues.

ATAC-seq data from Lavin et al. Sample labels for macrophages are based on the tissue of origin, while Monocytes and Neutrophils are labelled based on their cell type. a) Bias corrected deviations for variable TFs using data down-sampled to approximately 10,000 fragments per sample, b) Bias corrected deviations for variable TFs using full data. For both panels, TFs were chosen based on the variability using the down-sample data, with highly correlated motifs omitted. The row and column ordering were based on clustering of the motifs and samples using the down-sampled data.

Supplementary Figure 15 chromVAR identifies TF motifs associated with chromatin accessibility variation across different tissues profiled using DNase-seq by the Roadmap Epigenomics Project.

Motifs are based on the collection published by the ENCODE consortium and used for analysis by the Roadmap Epigenomics Project. a) Bias corrected deviations for variable TFs using data down-sampled to approximately 10,000 fragments per sample, b) Bias corrected deviations for variable TFs using full data. c) Enrichment scores computed by the Roadmap Epigenomics Consortium for each cell type based on the enrichment of motifs in various clusters of enhancers and the activity of each cluster in each tissue. For all panels, TFs were chosen based on the variability using the down-sample data, with highly correlated motifs omitted. The row and column ordering were based on clustering of the motifs and samples using the down-sampled data.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15 and Supplementary Notes 1–4.

Life Sciences Reporting Summary

Life Sciences Reporting Summary.

Supplementary Software 1

ChromVar software. See www.github.com/GreenleafLab/chromVAR for version updates.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schep, A., Wu, B., Buenrostro, J. et al. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat Methods 14, 975–978 (2017). https://doi.org/10.1038/nmeth.4401

Download citation

Received: 21 February 2017
Accepted: 21 July 2017
Published: 21 August 2017
Issue Date: 01 October 2017
DOI: https://doi.org/10.1038/nmeth.4401

This article is cited by

txci-ATAC-seq: a massive-scale single-cell technique to profile chromatin accessibility
- Hao Zhang
- Ryan M. Mulqueen
- Darren A. Cusanovich
Genome Biology (2024)
Chronic hypoxia remodels the tumor microenvironment to support glioma stem cell growth
- J. G. Nicholson
- S. Cirigliano
- H. A. Fine
Acta Neuropathologica Communications (2024)
Mosaic loss of Y chromosome is associated with aging and epithelial injury in chronic kidney disease
- Parker C. Wilson
- Amit Verma
- Benjamin D. Humphreys
Genome Biology (2024)
GNNMF: a multi-view graph neural network for ATAC-seq motif finding
- Shuangquan Zhang
- Xiaotian Wu
- Yan Wang
BMC Genomics (2024)
Identification of lineage-specific epigenetic regulators FOXA1 and GRHL2 through chromatin accessibility profiling in breast cancer cell lines
- Liying Yang
- Kohei Kumegawa
- Reo Maruyama
Cancer Gene Therapy (2024)