Abstract

We present cisTopic, a probabilistic framework used to simultaneously discover coaccessible enhancers and stable cell states from sparse single-cell epigenomics data (http://github.com/aertslab/cistopic). Using a compendium of single-cell ATAC-seq datasets from differentiating hematopoietic cells, brain and transcription factor perturbations, we demonstrate that topic modeling can be exploited for robust identification of cell types, enhancers and relevant transcription factors. cisTopic provides insight into the mechanisms underlying regulatory heterogeneity in cell populations.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The data generated for this study have been deposited in NCBI’s Gene Expression Omnibus and are accessible through GEO Series accession number GSE114557.

Code availability

cisTopic is available as an R package at http://github.com/aertslab/cistopic.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Fiers, M. W. E. J. et al. Brief Funct. Genomics 17, 246–254 (2018).

  2. 2.

    Cusanovich, D. A. et al. Cell 174, 1309–1324 (2018).

  3. 3.

    Cusanovich, D. A. et al. Science 348, 910–914 (2015).

  4. 4.

    Zamanighomi, M. et al. Nat. Commun. 9, 2410 (2018).

  5. 5.

    Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. Nat. Methods 14, 975–978 (2017).

  6. 6.

    de Boer, C. G. & Regev, A. BMC Bioinformatics 19, 253 (2018).

  7. 7.

    Ji, Z., Zhou, W. & Ji, H. Bioinformatics 33, 2930–2932 (2017).

  8. 8.

    Blei, D. M., Ng, A. Y. & Jordan, M. I. J. Mach. Learn. Res. 3, 993–1022 (2003).

  9. 9.

    Griffiths, T. L. & Steyvers, M. Proc. Natl Acad. Sci. USA 101, 5228–5235 (2004).

  10. 10.

    Buenrostro, J. D. et al. Cell 173, 1535–1548 (2018).

  11. 11.

    Vilagos, B. et al. J. Exp. Med. 209, 775–792 (2012).

  12. 12.

    Cisse, B. et al. Cell 135, 37–48 (2008).

  13. 13.

    Gupta, P., Gurudutta, G. U., Saluja, D. & Tripathi, R. P. J. Cell. Mol. Med. 13, 4349–4363 (2009).

  14. 14.

    Elagib, K. E. Blood 101, 4333–4341 (2003).

  15. 15.

    Nottingham, W. T. et al. Blood 110, 4188–4197 (2007).

  16. 16.

    Lake, B. B. et al. Nat. Biotechnol. 36, 70–80 (2017).

  17. 17.

    Preissl, S. et al. Nat. Neurosci. 21, 432–439 (2018).

  18. 18.

    Aibar, S. et al. Nat. Methods 14, 1083–1086 (2017).

  19. 19.

    Gray, L. T. et al. eLife 6, e21883 (2017).

  20. 20.

    Verfaillie, A. et al. Nat. Commun. 6, 6683–6683 (2015).

  21. 21.

    Gembarska, A. et al. Nat. Med. 18, 1239–1247 (2012).

  22. 22.

    Bernd, A. et al. Melanoma Res. 4, 287–291 (1994).

  23. 23.

    Iozumi, K., Hoganson, G. E., Pennella, R., Everett, M. A. & Fuller, B. B. J. Invest. Dermatol. 100, 806–811 (1993).

  24. 24.

    Buac, K. et al. Pigment Cell Melanoma Res. 22, 773–784 (2011).

  25. 25.

    Laurette, P. et al. eLife 4, e06857 (2015).

  26. 26.

    Corces, M. R. et al. Nat. Genet. 48, 1193–1203 (2016).

  27. 27.

    Kundaje, A. et al. mod/mouse/humanENCODE: blacklisted genomic regions for functional genomics analysis. The Kundaje Lab https://sites.google.com/site/anshulkundaje/projects/blacklists (2014).

  28. 28.

    Chang, J. lda: collapsed Gibbs sampling methods for topic models. R package version 1.2.3 http://CRAN.R-project.org/package=lda (2015).

  29. 29.

    Krijthe, J. & van der Maaten, L. Package ‘Rtsne’. R package version 0.13 https://github.com/jkrijthe/Rtsne (2017).

  30. 30.

    McInnes, L. & Healy, J. arXiv Preprint at https://arxiv.org/abs/1802.03426v1 (2018).

  31. 31.

    Angerer, P. et al. Bioinformatics 32, 1241–1243 (2016).

  32. 32.

    Yu, G., Wang, L.-G. & He, Q.-Y. Bioinformatics 31, 2382–2383 (2015).

  33. 33.

    Gu, Z. rGREAT: client for GREAT analysis. R package version 3.7 https://github.com/jokergoo/rGREAT (2018).

  34. 34.

    Imrichová, H., Hulselmans, G., Kalender Atak, Z., Potier, D. & Aerts, S. Nucleic Acids Res. 43, W57–W64 (2015).

  35. 35.

    Portales-Casamar, E. et al. Nucleic Acids Res. 38, D105–D110 (2010).

  36. 36.

    Weirauch, M. T. et al. Cell 158, 1431–1443 (2014).

  37. 37.

    Kulakovskiy, I. V. et al. Nucleic Acids Res. 46, D252–D259 (2018).

  38. 38.

    Janky, R. et al. PLoS Comput. Biol. 10, e1003731 (2014).

  39. 39.

    Grün, B. & Hornik, K. J. Stat. Softw. 40, 1–30 (2011).

  40. 40.

    Taddy, M. On stimation and selection for topic models. In Proc. 15th International Conference on Artificial Intelligence and Statistics (eds. Lawrence, N. D. & Girolami, M.) 1184–1193 (PMLR, 2012).

  41. 41.

    Pliner, H. A. et al. Mol. Cell 71, 858–871 (2018).

  42. 42.

    Langfelder, P., Zhang, B. & Horvath, S. Bioinformatics 24, 719–720 (2008).

  43. 43.

    Finak, G. et al. Genome Biol. 16, 278 (2015).

  44. 44.

    Herrmann, C., Van De Sande, B., Potier, D. & Aerts, S. Nucleic Acids Res. 40, e114 (2012).

  45. 45.

    Heinz, S. et al. Mol. Cell 38, 576–589 (2010).

  46. 46.

    Thomas-Chollier, M. et al. Nat. Protoc. 6, 1860–1869 (2011).

  47. 47.

    Thomas-Chollier, M. et al. Nucleic Acids Res. 40, e31 (2012).

  48. 48.

    Frith, M. C., Li, M. C. & Weng, Z. Nucleic Acids Res. 31, 3666–3668 (2003).

  49. 49.

    Mahony, S. & Benos, P. V. Nucleic Acids Res. 35, W253–W258 (2007).

  50. 50.

    Luo, C. et al. Science 357, 600–604 (2017).

  51. 51.

    Edgar, R. C. Nucleic Acids Res. 32, 1792–1797 (2004).

  52. 52.

    Aerts, S. et al. Nucleic Acids Res. 31, 1753–1764 (2003).

  53. 53.

    Aerts, S. et al. Nucleic Acids Res. 33, W393–W396 (2005).

  54. 54.

    Corces, M. R. et al. Nat. Methods 14, 959–962 (2017).

  55. 55.

    Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Nat. Methods 10, 1213–1218 (2013).

  56. 56.

    Buenrostro, J. D. et al. Nature 523, 486–490 (2015).

  57. 57.

    Mo, A. et al. Neuron 86, 1369–1384 (2015).

  58. 58.

    Su, Y. et al. Nat. Neurosci. 20, 476–483 (2017).

  59. 59.

    Habib, N. et al. Nat. Methods 14, 955–958 (2017).

  60. 60.

    Kozlenkov, A. et al. Sci. Adv. 4, eaau6190 (2018).

Download references

Acknowledgements

This work was funded by an ERC Consolidator Grant to S. Aerts (no. 724226_cis-CONTROL), and by the KU Leuven (grant no. C14/18/092 to S. Aerts), the Harry J. Lloyd Charitable Trust, the Foundation Against Cancer (grant no, 2016-070 to S. Aerts), the FWO (grant no. G0B5619N), PhD fellowships from the FWO (C.B.G.-B., no. 11F1519N; L.M., no. 1S03317N; D.P., no. 1S75219N) and a postdoctoral research fellowship from Kom op tegen Kanker (Stand up to Cancer), the Flemish Cancer Society (J.W.). Computing was performed at the Vlaams Supercomputer Center. Single-cell infrastructure was funded by the Hercules Foundation (grant no. AKUL/13/41). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. The authors thank J.-C. Marine, F. Rambow and M. Dewaele for helpful discussions; and B. Lake and K. Zhang for the information provided regarding the human brain data. The authors also thank members of various groups that make curated position weight matrices publicly available, including T. Hughes (cis-bp), M. Bulyk (Uniprobe), A. Mathelier (Jaspar), V. Makeev (Hocomoco) and many others.

Author information

Author notes

  1. These authors contributed equally: C. Bravo González-Blas, L. Minnoye.

Affiliations

  1. VIB Center for Brain & Disease Research, Leuven, Belgium

    • Carmen Bravo González-Blas
    • , Liesbeth Minnoye
    • , Dafni Papasokrati
    • , Sara Aibar
    • , Gert Hulselmans
    • , Valerie Christiaens
    • , Kristofer Davie
    • , Jasper Wouters
    •  & Stein Aerts
  2. Department of Human Genetics, KU Leuven, Leuven, Belgium

    • Carmen Bravo González-Blas
    • , Liesbeth Minnoye
    • , Dafni Papasokrati
    • , Sara Aibar
    • , Gert Hulselmans
    • , Valerie Christiaens
    • , Kristofer Davie
    • , Jasper Wouters
    •  & Stein Aerts

Authors

  1. Search for Carmen Bravo González-Blas in:

  2. Search for Liesbeth Minnoye in:

  3. Search for Dafni Papasokrati in:

  4. Search for Sara Aibar in:

  5. Search for Gert Hulselmans in:

  6. Search for Valerie Christiaens in:

  7. Search for Kristofer Davie in:

  8. Search for Jasper Wouters in:

  9. Search for Stein Aerts in:

Contributions

S. Aerts, C.B.G.-B. and L.M. conceived the study. C.B.G.-B. developed cisTopic and implemented the R package. L.M. performed the experimental work with the help of V.C., K.D. and J.W. C.B.G.-B. and L.M. analyzed the data with the help of D.P., S. Aibar, G.H. and K.D. C.B.G.-B., L.M. and S. Aerts wrote the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Stein Aerts.

Integrated supplementary information

  1. Supplementary Figure 1 Comparison of parameter estimation methods for latent Dirichlet allocation.

    Comparison of parameter estimation (region–topic and topic–cell distributions) algorithms for LDA, namely, collapsed Gibbs sampling (as implemented by Chang (http://CRAN.R-project.org/package=lda) or Grün and Hornik (https://doi.org/10.18637/jss.v040.i13)), variational expectation maximization (VEM; https://doi.org/10.18637/jss.v040.i13) and maximum a posteriori (MAP; Proc. Fifteenth International Conference on Artificial Intelligence and Statistics (eds. Lawrence, N. D. & Girolami, M.) Vol 22, 1184–1193 (PMLR, 2012)), based on their capability to cluster simulated single cells from bulk ATAC-seq data from hematopoietic populations (Nat. Genet. 48, 1193–1203 (2016)). Fifty single cells per bulk profile were simulated, using a different read depth in each experiment (50,000, 10,000 or 3,000 reads per cell), resulting in 650 single-cell profiles per experiment. a, t-SNE based on topic–cell contributions using each of the mentioned parameter estimation methods. b, Adjusted Rand index (ARI) based on cell clustering on the t-SNE projections of the topic–cell contributions. c, Running time in hours of the different parameter-estimation approaches in each experiment. Collapsed Gibbs sampling (as implemented by Chang; http://CRAN.R-project.org/package=lda) is the most accurate, most robust and fastest approach for the estimation of topic distributions. *Models with 50 and 100 topics could not be obtained because of memory issues. **Model with 100 topics could not be obtained because of memory issues.

  2. Supplementary Figure 2 Benchmark of single-cell ATAC-seq analysis methods using simulated single cells from bulk ATAC-seq profiles from the hematopoietic system.

    Comparison of scATAC-seq analysis methods, namely, BROCKMAN (BMC Bioinformatics 19, 253 (2018)), chromVAR (Nat. Methods 14, 975–978 (2017)), SCRAT (Bioinformatics 33, 2930–2932 (2017)), latent semantic indexing (Science 348, 910–914 (2015); Cell 174, 1309–1324 (2018); Nature 555, 538–542 (2018)), scABC (Nat. Commun. 9, 2410 (2018)), Cicero (Mol. Cell 71, 858–871 (2018)) and cisTopic (this work), based on their capability to cluster simulated single cells from bulk ATAC-seq data from hematopoietic populations (Nat. Genet. 48, 1193–1203 (2016)). Fifty single cells per bulk profile were simulated, using a different read depth in each experiment (50,000, 10,000 or 3,000 reads per cell), resulting in 650 single-cell profiles per experiment. a, t-SNE coordinates are provided by the method itself (BROCKMAN) or obtained by running t-SNE on the deviation scores for chromVAR, the LSI matrix for LSI, the cell-to-landmark correlation matrix for scABC, the enrichment of co-regulated DNase I hypersensitive sites from ENCODE for SCRAT, the Z-score per chromatin hub for Cicero and the topic–cell distributions for cisTopic. b, Adjusted Rand index (ARI), based on clustering on t-SNE projections, for each method in each simulation. c, Model selection and convergence curves for cisTopic in each simulation. Models with 21, 14 and 12 topics were selected per experiment based on the highest log-likelihood in the last iteration per model (top), and the convergence of the Markov chain Monte Carlo (MCMC) during sampling (after a burn-in) is assessed based on the stabilization of the log-likelihood (bottom). AFor the method comparison (a,b), scABC and Cicero were run with minor adaptations to their original workflow (see Methods).

  3. Supplementary Figure 3 cisTopic on single-cell ATAC-seq data simulated from bulk ATAC-seq profiles from the hematopoietic system.

    cisTopic accurately clusters single cells and simultaneously reveals bona fide regulatory topics that recapitulate the patterns observed in the bulk profiles of origin when using 650 simulated single cells with high coverage (50,000 reads per cell) from bulk ATAC-seq data from hematopoietic populations (Nat. Genet. 48, 1193–1203 (2016)). a, Heat map based on the normalized topic contributions per cell. cisTopic correctly clusters cells by cell type, while simultaneously identifying general topics, branch-specific topics and cell-state-specific topics. b, Enrichment of the different region classes within the topics. General topics are enriched for promoter regions. c, Enrichment ATAC-seq signatures derived from the bulk FACS-sorted cell types. The bulk ATAC-seq peaks match with the topics. d, Coverage of the bulk ATAC-seq profiles on the top regions of representative topics of each class. General topic regions are accessible across all cell types, while branch-specific and cell-type-specific regions are uniquely accessible on the cell types expected from the topic contributions on the single cells. e, Average topic contribution per cell type. cisTopic identified branch-specific topics for which motif enrichment reveals known master regulators in the corresponding lineage. Only the top motif per topic and its normalized enrichment score (NES) are shown. f, Coverage heat maps of bulk ATAC-seq data per cell type to validate the predicted regions per topic. Top regions per topic are represented between dashed horizontal lines, ordered by descending topic score. Enrichment patterns match with those predicted with cisTopic from the simulated single cells.

  4. Supplementary Figure 4 Benchmark of single-cell ATAC-seq analysis methods using simulated single cells from bulk H3K27Ac profiles from melanoma cell lines.

    Comparison of scATAC-seq analysis methods, namely, BROCKMAN (BMC Bioinformatics 19, 253 (2018)), chromVAR (Nat. Methods 14, 975–978 (2017)), SCRAT (Bioinformatics 33, 2930–2932 (2017)), latent semantic indexing (Science 348, 910–914 (2015); Cell 174, 1309–1324 (2018); Nature 555, 538–542 (2018)), scABC (Nat. Commun. 9, 2410 (2018)), Cicero (Mol. Cell 71, 858–871 (2018)) and cisTopic (this work), based on their capability to cluster simulated single cells from 14 bulk H3K27Ac profiles from different melanoma cell lines using different coverages. In each experiment, 50 cells were simulated per bulk profile, resulting in 700 single cells per experiment. a, t-SNE coordinates are provided by the method itself (BROCKMAN), or obtained by running t-SNE on the deviation scores for chromVAR, the LSI matrix for LSI, the cell-to-landmark correlation matrix for scABC, the enrichment of co-regulated DNase I hypersensitive sites from ENCODE for SCRAT, the Z-score per chromatin hub for Cicero and the topic–cell distributions for cisTopic. b, Adjusted Rand index (ARI), based on clustering on t-SNE projections, for each method in each simulation. c, Heat map based on the normalized topic contributions per cell. cisTopic correctly clusters cells by cell type, while simultaneously identifying general topics, phenotypic topics (mesenchymal-like and melanocyte-like), cell-line-specific topics and low-contribution topics. d, Coverage heat maps with the bulk H3K27Ac ChIP-seq data to validate the predicted regions per topic. Top regions per topic are represented between dashed horizontal lines, ordered by descending topic score. Topic regions show the expected patterns in the bulk data (expected patterns are surrounded by squares). Key motifs found enriched in the phenotype-specific regions by RcisTarget are shown (right), revealing SOX and E-box as key motifs for melanocyte regions and AP-1 and TEAD motifs for mesenchymal-like regions, as validated by literature (Nat. Commun. 6, 6683–6683 (2015)). e, Enrichment of bulk specific signatures of each melanoma cell line within the topics. The enrichment patterns of the topics within the bulk H327Ac profiles match those found in the simulated cells. f, Enrichment of the different region classes within the topics. General topics are enriched for promoter regions. AscABC and Cicero were run with minor adaptations to their original workflow (see Methods).

  5. Supplementary Figure 5 Benchmark of single-cell ATAC-seq analysis methods for finding rare subpopulations using simulated single cells from bulk H3K27Ac profiles from melanoma cell lines.

    Simulated single-cell epigenomes were generated from 14 bulk H3K27Ac profiles from different melanoma cell lines using different coverages. In each experiment, 50 cells from each cell line were simulated except for three (A375, MM001, MM099), for which only 5 cells were simulated, acting as rare subpopulations. In total, each experiment includes 565 simulated cells. a, t-SNE coordinates are provided by the method itself (BROCKMAN) or obtained by running t-SNE on the deviation scores for chromVAR, the LSI matrix for LSI, the cell-to-landmark correlation matrix for scABC, the enrichment of co-regulated DNase I hypersensitive sites from ENCODE for SCRAT, the Z-score per chromatin hub for Cicero and the topic–cell distributions for cisTopic. b, Adjusted Rand index (ARI), based on clustering on t-SNE projections, for each method in each simulation. b, Precision of the different methods tested for identifying the rare subpopulations, based on cell clustering on the t-SNE projections. c, Recall of the different methods tested for identifying the rare subpopulations, based on cell clustering on the t-SNE projections. AscABC and Cicero were run with minor adaptations to their original workflow (see Methods).

  6. Supplementary Figure 6 cisTopic analysis of single-cell DNA methylation data from the human brain.

    cisTopic analysis of single-cell DNA methylation data (snmC-seq) from the human brain with 2,764 neurons (Science 357, 600–604 (2017)). a, Cell t-SNE by cisTopic, based on the topic–cell contributions, clusters the neurons into their respective class (i.e., inhibitory and excitatory neurons) and subtypes, as annotated by the original authors. b, cisTopic is robust to biases related to the number of counts, overall methylation level and GC content.

  7. Supplementary Figure 7 cisTopic analysis of the sciATAC-seq Mouse Cell Atlas.

    cisTopic analysis of the Mouse Cell Atlas dataset, which contains 80,254 cells and 436,206 genomic regions. a,b, t-SNE based on cisTopic's topic–cell contributions coloured by (a) tissue of origin and (b) cell type. cisTopic clusters cells by cell type, overcoming tissue-specific batch effects. c, Cell normalized topic scores for representative topics from brain-related (top) and immune-system-related (bottom) cell types, accompanied by top enriched motifs (and NES) per topic. For example, the oligodendrocyte topic (top-left) is enriched for Sox motifs, while an excitatory neuron topic (top-center) is enriched for Mef and Egr motifs (among others) and an inhibitory neuron topic (top-right) is enriched for Maf motifs. Three other examples are included from topics enriched in immune cell types, which show enrichment in Pu.1, Ets-like and Irf motifs.

  8. Supplementary Figure 8 cisTopic reconstructs the chromatin accessibility landscape during hematopoietic differentiation.

    Comparison of scATAC-seq analysis methods, namely BROCKMAN (BMC Bioinformatics 19, 253 (2018)), chromVAR (Nat. Methods 14, 975–978 (2017)), SCRAT (Bioinformatics 33, 2930–2932 (2017)), latent semantic indexing (Science 348, 910–914 (2015); Cell 174, 1309–1324 (2018); Nature 555, 538–542 (2018)), scABC (Nat. Commun. 9, 2410 (2018)), Cicero (Mol. Cell 71, 858–871 (2018)) and cisTopic (this work), based on their capability to detect FAC-sorted cell types from 2,755 single-cell ATAC-seq profiles from the hematopoietic lineage from ref. 10. a, t-SNE coordinates are provided by the method itself (BROCKMAN and Cicero) or obtained by running t-SNE on the deviation scores for chromVAR, the LSI matrix for LSI, the cell-to-landmark correlation matrix for scABC, the enrichment of co-regulated DNase I hypersensitive sites from ENCODE for SCRAT and the topic–cell distributions for cisTopic. b, Based on the highest log-likelihood in the last iteration, a model with 17 topics is selected. c, The convergence of the Markov chain Monte Carlo (MCMC) during sampling (after a burn-in) is assessed based on the stabilization of the log-likelihood. d, Topic 7 is enriched in a patient-specific subset of hematopoietic stem cells, linked to AP-1 factors (top motif with (normalized enrichment score) is shown). Removal of this topic, followed by t-SNE on the remaining topic–cell contributions, corrects for the patient batch effect. e, Example of 4 of the 17 topics found by the analysis of FACS-sorted populations from the hematopoietic system. Top: t-SNE based on topic–cell distributions colored by the normalized topic contribution in each cell. Middle: t-SNE based on the region–topic distributions colored by the topic normalized region score. Bottom: top enriched motifs in each topic with normalized enrichment score (NES). f, cisTopic does not show biases to fraction of reads in peaks, number of counts, number of regions or GC content. The higher GC content in regions linked to promotors is related to the biology of promoters.

  9. Supplementary Figure 9 cisTopic reveals cell-state-specific transcription factor cistromes.

    cisTopic reveals cell-state-specific GATA cistromes across the different stages of differentiation from hematopoietic stem cells (HSC) to megakaryocyte-erythroid progenitors (MEP) in a dataset with 2,755 FACS-sorted single-cell profiles from hematopoietic populations10. a, Three different topics, enriched in different stages from HSC to MEP, show enrichment for GATA motifs and GATA2 binding sites (as shown by enrichment of GATA2 ChIP-seq regions (GEO GSE32465) in the topics). Top: cisTopic cell-t-SNE based on topic–cell distributions, colored by cell normalized topic score. Middle: cisTopic region-t-SNE based on region–topic distributions, colored by topic normalized region score. Bottom: top enriched motifs in each of the topics, accompanied by their normalized enrichment score (NES). Coloring of GATA2 ChIP-seq regions in the region-t-SNE matches with the GATA enriched topics. b, Differential motif enrichment between the GATA enriched regions in each of these topics reveal GATA cell-state-specific cofactors, such as PU.1 in HSC and RUNX during the differentiation13–15. c, While cisTopic reveals cell-state-specific GATA cistromes, region-based tools such as chromVAR, which first groups regions into cistromes based on motif enrichment, define a unique GATA cistrome pattern.

  10. Supplementary Figure 10 cisTopic and SCENIC analysis on scTHS-seq and snDrop-seq data from the human brain.

    a, cisTopic analysis on a dataset with 34,520 scTHS-seq profiles from different areas of the human brain resulted in 23 cis-regulatory topics as selected based on the highest log-likelihood in the last iteration. b, The convergence of the Markov chain Monte Carlo (MCMC) during sampling (after a burn-in) is assessed based on the stabilization of the log-likelihood. c, Correspondence between cell-type-specific topics and cell-type-specific signatures derived by the authors using a gradient-boosted regression model in combination with scRNA-seq. d, SCENIC18 regulon-based t-SNE based on 15,884 single-cell RNA-seq profiles from the same tissue. e, Correspondence between topics and SCENIC regulons. Genes were linked to their closest regions (see Methods), and the resulting region-sets were used as input for AUCell together with the topic rankings. The motifs shown are found both in the matching regulon and the topic (based on gene-based and region-based RcisTarget, respectively).

  11. Supplementary Figure 11 cisTopic analysis on scATAC-seq data from the mouse prefrontal cortex.

    a, cisTopic analysis on a dataset with 3,034 cells from the mouse prefrontal cortex resulted in 23 cis-regulatory topics as selected based on the highest log-likelihood in the last iteration. b, The convergence of the Markov chain Monte Carlo (MCMC) during sampling (after a burn-in) is assessed based on the stabilization of the log-likelihood. c, cisTopic cell clustering is not biased by the number of peaks, promoter coverage or number of reads in peaks. d, Correspondence of mouse neuronal topics with previously published cell-type-specific signatures from interneurons (Neuron 86, 1369–1384 (2015)), cortical layers (eLife 6, e21883 (2017)), and dentate gyrus (Nat. Neurosci. 20, 476–483 (2017)). f, Excitatory neuron (18,799 cells) scRNA-seq data retrieved as a .loom file from mousebrain.org, and visualized using Scope (http://scope.aertslab.org). Expression of Egr4, Rorb and Fezf2 is marked for each of the cortical layers (ExL23, ExL4 and ExL56, respectively).

  12. Supplementary Figure 12 cisTopic reveals conservation between oligodendrocyte and astrocyte topics and their regulatory code.

    a, cisTopic t-SNEs (cell- and region-based) for the analyses on human (34,520 cells, 287,381 regions) and mouse (3,034 cells, 139,504 regions) brains colored by species topic score, cross-species topic score, transcriptome signature enrichment (with gene-to-region mapping based on linking the gene to the closest region), and enrichment of cell-type-specific methylation signatures validate oligodendrocyte and astrocyte topics. b, Example of conservation of enhancer architectures between conserved regions. In oligodendrocytes, SOX and E-box motifs are conserved in position in the example region (probably representing SOX10 and OLIG1/2 (Trends Neurosci. 28, 583–588 (2005); Cell 152, 248–261 (2013))), while SOX and NFI are in the astrocyte example region (Neuron 74, 79–94 (2012); J. Neurosci. 37, 4493–4507 (2017)).

  13. Supplementary Figure 13 Validation of scATAC-seq data on SOX10 knockdown.

    Aggregate (scATAC) and bulk (OmniATAC) ATAC-seq profiles on two melanoma cell lines over different time points after SOX10 KD. Profiles of bulk ATAC-seq and aggregated scATAC-seq per condition resemble each other in the same condition. The region highlighted in gray corresponds to a SOX10 target site (Pigment Cell Melanoma Res. 22, 773–784 (2011)) that decreases in accessibility after SOX10 KD.

  14. Supplementary Figure 14 Identification of cell-type-specific SOX cofactors.

    a, Overlap of melanoma, oligodendrocyte and astrocyte SOX cistromes; derived from the human brain analysis16 and the SOX10 KD in melanoma cell lines. b, GO terms enriched in the melanoma, oligodendrocyte and astrocyte SOX topics, based on GREAT analysis, which applies a one-sided binomial test over the regions per annotation for assessing their significance (Nat. Biotechnol. 28, 495–501 (2010)). c, Ternary plot showing enrichment scores of motifs found in the melanoma, oligodendrocyte and astrocyte SOX regions. Each corner represents a cell-type-specific SOX topic, dots represent enriched motifs, and axes represent average enrichment scores for each topic. The colors of the dots are used to indicate which motifs belong to the same transcription factor (based on STAMP clustering (Nucleic Acids Res. 35, W253–W258 (2007))).

  15. Supplementary Figure 15 cisTopic predicts drop-outs on scATAC-seq data.

    Left: Heat map based on the binary accessibility matrix from a low-coverage simulation using the bulk ATAC-seq profiles from ref. 26. Middle: Heat map colored based on the region probability derived from the multiplication of the topic–cell and the region–topic distributions obtained by the low-coverage simulation. Right: Heat map based on the binary accessibility matrix from the high-coverage simulation using the bulk ATAC-seq profiles from ref. 26. In the three heat maps, regions are clustered based on the predictive distribution values (from the low-coverage simulation).

Supplementary information

  1. Supplementary Information

    Supplementary Figs. 1–15, Supplementary Notes 1 and 2, and Supplementary Tables 1 and 2

  2. Reporting Summary

About this article

Publication history

Received

Accepted

Published

Issue Date

DOI

https://doi.org/10.1038/s41592-019-0367-1

Further reading