Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling

Article metrics

Abstract

Single-cell RNA sequencing has enabled the decomposition of complex tissues into functionally distinct cell types. Often, investigators wish to assign cells to cell types through unsupervised clustering followed by manual annotation or via ‘mapping’ to existing data. However, manual interpretation scales poorly to large datasets, mapping approaches require purified or pre-annotated data and both are prone to batch effects. To overcome these issues, we present CellAssign, a probabilistic model that leverages prior knowledge of cell-type marker genes to annotate single-cell RNA sequencing data into predefined or de novo cell types. CellAssign automates the process of assigning cells in a highly scalable manner across large datasets while controlling for batch and sample effects. We demonstrate the advantages of CellAssign through extensive simulations and analysis of tumor microenvironment composition in high-grade serous ovarian cancer and follicular lymphoma.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of CellAssign.
Fig. 2: Performance of CellAssign on simulated data.
Fig. 3: CellAssign infers the composition of the HGSC microenvironment.
Fig. 4: CellAssign infers the composition of the follicular lymphoma microenvironment.
Fig. 5: Temporal changes in nonmalignant cells in the follicular lymphoma microenvironment.

Data availability

Raw sequencing data for all experiments in this paper are available from the European Genome-phenome Archive (accession no. EGAD00001004585).

Code availability

CellAssign is available as an R package at www.github.com/irrationone/cellassign.

References

  1. 1.

    Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

  2. 2.

    Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).

  3. 3.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

  4. 4.

    Žurauskienė, J. & Yau, C. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics 17, 140 (2016).

  5. 5.

    Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).

  6. 6.

    Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1141 (2018).

  7. 7.

    Freytag, S., Tian, L., Lönnstedt, I., Ng, M. & Bahlo, M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res. 7, 1297 (2018).

  8. 8.

    Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data.Nat. Rev. Genet. 20, 273–282 (2019).

  9. 9.

    Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).

  10. 10.

    Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–362 (2018).

  11. 11.

    Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).

  12. 12.

    Koh, P. W. et al. An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development. Sci. Data 3, 160109 (2016).

  13. 13.

    Tian, L. et al. scRNA-seq mixology: towards better benchmarking of single cell RNA-seq protocols and analysis methods. Preprint at bioRxiv https://doi.org/10.1101/433102 (2018).

  14. 14.

    Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

  15. 15.

    Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).

  16. 16.

    Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, 719–720 (2008).

  17. 17.

    Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  18. 18.

    Zhang, Z. et al. SCINA: A semi-supervised subtyping algorithm of single cells and bulk samples. Genes 10, 531 (2019).

  19. 19.

    MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).

  20. 20.

    McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/pdf/1802.03426.pdf (2018).

  21. 21.

    Zhang, A. W. et al. Interfaces of malignant and immunologic clonal dynamics in ovarian cancer. Cell 173, 1755–1769.e22 (2018).

  22. 22.

    Kristiansen, G. et al. CD24 is expressed in ovarian cancer and is a new independent prognostic marker of patient survival. Am. J. Pathol. 161, 1215–1221 (2002).

  23. 23.

    Hylander, B. et al. Expression of Wilms tumor gene (WT1) in epithelial ovarian cancer. Gynecol. Oncol. 101, 12–17 (2006).

  24. 24.

    Andor, N. et al. Single-cell RNA-Seq of lymphoma cancers reveals malignant B-cell types and coexpression of T-cell immune checkpoints. Blood 133, 1119–1129 (2019).

  25. 25.

    Jefferis, R. & Lefranc, M.-P. Human immunoglobulin allotypes: possible implications for immunogenicity. MAbs 1, 332–338 (2009).

  26. 26.

    Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

  27. 27.

    Hermine, O. et al. Prognostic significance of bcl-2 protein expression in aggressive non-Hodgkin’s lymphoma. Groupe d’Etude des Lymphomes de l’Adulte (GELA). Blood 87, 265–272 (1996).

  28. 28.

    Gu, K. et al. t(14;18)-negative follicular lymphomas are associated with a high frequency of BCL6 rearrangement at the alternative breakpoint region. Mod. Pathol. 22, 1251–1257 (2009).

  29. 29.

    Hatzi, K. & Melnick, A. Breaking bad in the germinal center: how deregulation of BCL6 contributes to lymphomagenesis. Trends Mol. Med. 20, 343–352 (2014).

  30. 30.

    Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018).

  31. 31.

    Freeman, B. E., Hammarlund, E., Raué, H. P. & Slifka, M. K. Regulation of innate CD8+ T-cell activation mediated by cytokines. Proc. Natl Acad. Sci. USA 109, 9971–9976 (2012).

  32. 32.

    Hwang, B., Lee, J. H., Bang, D. & Single-cell, R. N. A. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 96 (2018).

  33. 33.

    Eling, N., Richard, A. C., Richardson, S., Marioni, J. C. & Vallejos, C. A. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7, 284–294.e12 (2018).

  34. 34.

    Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).

  35. 35.

    Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/pdf/1603.04467.pdf (2015).

  36. 36.

    Sinha, D., Kumar, A., Kumar, H., Bandyopadhyay, S. & Sengupta, D. dropClust: efficient clustering of ultra-large scRNA-seq data. Nucleic Acids Res. 46, e36 (2018).

  37. 37.

    Lun, A. T., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).

  38. 38.

    Adam, M., Potter, A. S. & Potter, S. S. Psychrophilic proteases dramatically reduce single-cell RNA-seq artifacts: a molecular atlas of kidney development. Development 144, 3625–3632 (2017).

  39. 39.

    O’Flanagan, C. H. et al. Dissociation of solid tumour tissues with cold active protease for single-cell RNA-seq minimizes conserved collagenase associated stress responses. Preprint at bioRxiv https://doi.org/10.1101/683227 (2019).

  40. 40.

    Schelker, M. et al. Estimation of immune cell content in tumour tissue using single-cell RNA-seq data. Nat. Commun. 8, 2032 (2017).

  41. 41.

    Scialdone, A. et al. Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61 (2015).

  42. 42.

    Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).

  43. 43.

    Shih, A. J. et al. Identification of grade and origin specific cell populations in serous epithelial ovarian cancer by single cell RNA-seq. PLoS ONE 13, e0206785 (2018).

  44. 44.

    Liberzon, A. et al. The Molecular Signatures Database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

  45. 45.

    Uhlen, M. et al. A pathology atlas of the human cancer transcriptome. Science 357, eaan2507 (2017).

  46. 46.

    Perisic Matic, L. et al. Phenotypic modulation of smooth muscle cells in atherosclerosis is associated with downregulation of LMOD1, SYNPO2, PDLIM7, PLN, and SYNM. Arterioscler. Thromb. Vasc. Biol. 36, 1947–1961 (2016).

  47. 47.

    Espagnolle, N. et al. CD146 expression on mesenchymal stem cells is associated with their vascular smooth muscle commitment. J. Cell. Mol. Med. 18, 104–114 (2014).

  48. 48.

    Rocnik, E., Saward, L. & Pickering, J. G. HSP47 expression by smooth muscle cells is increased during arterial development and lesion formation and is inhibited by fibrillar collagen. Arterioscler. Thromb. Vasc. Biol. 21, 40–46 (2001).

  49. 49.

    Mura, M. et al. Identification and angiogenic role of the novel tumor endothelial marker CLEC14A. Oncogene 31, 293–305 (2012).

  50. 50.

    Deenick, E. K. & Ma, C. S. The regulation and role of T follicular helper cells in immunity. Immunology 134, 361–367 (2011).

  51. 51.

    Payne, D., Drinkwater, S., Baretto, R., Duddridge, M. & Browning, M. J. Expression of chemokine receptors CXCR4, CXCR5 and CCR7 on B and T lymphocytes from patients with primary antibody deficiency. Clin. Exp. Immunol. 156, 254–262 (2009).

Download references

Acknowledgements

We thank V. Svensson for his feedback on this manuscript. We also thank W. W. Wasserman, B. H. Nelson, P. T. Hamilton and A. Miranda for helpful discussions. A.W.Z. is funded by scholarships from the Canadian Institutes of Health Research (CIHR) (Vanier Canada Graduate Scholarship, Michael Smith Foreign Study Supplement) and a BC Children’s Hospital (UBC) MD/PhD studentship. K.R.C. is funded by postdoctoral fellowships from the CIHR (Banting) no. 01353-000, the Canadian Statistical Sciences Institute and the UBC Data Science Institute no. 201803. S.P.S. is a Susan G. Komen scholar. We acknowledge the generous funding support provided by the BC Cancer Foundation. In addition, S.P.S. receives operating funds from the CIHR (grant no. FDN-143246), Terry Fox Research Institute (grant nos. 1021 and 1061) and the Canadian Cancer Society (grant no. 705636). This work was supported by Cancer Research UK (grant no. C31893/A25050 to S.A. and S.P.S.). S.P.S. is supported by the Nicholls-Biondi endowed chair and the Cycle for Survival benefitting Memorial Sloan Kettering Cancer Center. C.S. is an Allen Distinguished Investigator supported by the Allen Frontiers Group no. 12829.

Author information

A.W.Z., K.R.C. and S.P.S. designed the study. A.W.Z., K.R.C. and S.P.S. wrote the manuscript. A.W.Z., C.O.F., E.A.C., J.L.P.L., A. McPherson, A. Mottok, N.C., L.C., M.W., T.A., A.P.W., J.N.M., S.A., C. Steidl, K.R.C. and S.P.S. reviewed the manuscript. A.W.Z., S.A., C. Steidl, K.R.C. and S.P.S. interpreted the data. B.H., D.L., L.C. and C. Sarkozy curated the data. A.W.Z., K.R.C., N.C., M.W., P.W., T.C. and X.W. analyzed the data. A.W.Z., K.R.C. and S.P.S. developed the model. C.O.F., E.A.C. and J.L.P.L. performed the single-cell processing. A. Mottok, J.N.M., C. Steidl and C. Sarkozy performed the case identification. K.R.C., S.P.S., C. Steidl and S.A. supervised the study.

Correspondence to Kieran R. Campbell or Sohrab P. Shah.

Ethics declarations

Competing interests

S.P.S. and S.A. are founders, shareholders and consultants of Contextual Genomics.

Additional information

Peer review information: Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Simulation performance across a range of proportions of differentially expressed genes, using differential expression parameters derived from comparing naïve CD8+ and naïve CD4+ T cells.

(a) Accuracy and cell-level F1 score (Methods) for varying proportions of differentially expressed genes per cell type. All methods were provided with expression data for the same set of marker genes. *, **, *** denote FDR-adjusted p-vaues (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. Nine simulated datasets were generated for each parameter setting. Dotted lines separate marker-based, unsupervised, and supervised methods. (b) Correspondence between true simulated log fold change values and log fold change (δ) values inferred by CellAssign. R refers to the Pearson correlation between true and inferred logFC values for CellAssign. (c and d) Performance of CellAssign where a certain proportion of entries in the marker gene matrix are flipped at random, using (c) 5 and (d) 20 marker genes per cell type. Nine simulated datasets were generated for each parameter setting. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary Figure 2 Simulation performance across a range of proportions of differentially expressed genes, using differential expression parameters derived from comparing B and CD8+ T cells.

(a) Accuracy and cell-level F1 score (Methods) for varying proportions of differentially expressed genes per cell type. CellAssign was provided with a set of marker genes (Methods); all other methods were provided with all genes. Asterisks indicate FDR-adjusted statistical significance (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods. Nine simulated datasets were generated for each parameter setting. (b) Accuracy and cell-level F1 score for varying proportions of differentially expressed genes per cell type. All methods were provided with the same set of marker genes. Nine simulated datasets were generated for each parameter setting. *, **, *** denote FDR-adjusted p-vaues (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. (c) Correspondence between true simulated log fold change values and log fold change (δ) values inferred by CellAssign. R refers to the Pearson correlation between true and inferred logFC values for CellAssign. n = 1000 single-cells were simulated. (d) Performance of CellAssign where a certain proportion of entries in the marker gene matrix are flipped at random. Nine simulated datasets were generated for each parameter setting. Dotted lines separate marker-based, unsupervised, and supervised methods. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary Figure 3 Simulation performance (accuracy and cell-level F1 score) across a range of proportions of differentially expressed genes, where clusters from “unsupervised” algorithms are assigned to respective cell types through correlation to the transcriptomes of purified cell types.

Dotted lines separate marker-based, unsupervised, and supervised methods. (a) & (b) Simulating data based on parameters for B and CD8+ T cells, and running algorithms for whole-transcriptome and marker gene data only, respectively. (c) & (d) Simulating data based on parameters for naïve CD4+ and CD8+ T cells, and running algorithms for whole-transcriptome and marker gene data only, respectively. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range. *, **, *** denote FDR-adjusted p-values (two-sided Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. In all cases, n = 1000 single-cells were simulated.

Supplementary Figure 4 Performance (accuracy and cell type-level F1 score, Methods) of CellAssign and the best-performing clustering methods evaluated by6 on FACS-purified H7 human embryonic stem cells in various stages of differentiation.

t-SNE plots of (a) ground-truth FACS annotations; (b) CellAssign-derived annotations; (c) SCINA-derived annotations; (d) SC3 clusters (using all genes); (e) Seurat clusters (resolution = 0.8, using all genes); (f) Seurat clusters (resolution = 0.8, using the same marker gene set used by CellAssign); (g) Seurat clusters (resolution = 1.2, using all genes); (h) Seurat clusters (resolution = 1.2, using the same marker gene set used by CellAssign).

Supplementary Figure 5 Expression of select marker genes in HGSC single cell RNA-seq data.

(a) Expression (log normalized counts) of PECAM1 (for endothelial cells), CD3D (for T cells), CD79A (for B cells), KLHDC8A (for ovary-derived cells), ACTA2 (for myofibroblasts and smooth muscle), MYH11 (for smooth muscle), and MCAM (for vascular cell types including endothelial cells, vascular smooth muscle, and pericytes). (b) Expression (log normalized counts) of marker genes expressed in epithelial ovarian cancers but not in normal ovarian tissue. Expression values were winsorized between 0 and 4.

Supplementary Figure 6 Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches6 on HGSC single cell RNA-seq data.

(a) Expression (log normalized counts) of key marker genes of hematopoietic subpopulations CD3D (for T cells), CD79A (for B cells), and CD14 (for monocytes/macrophages). Expression values were winsorized between 0 and 4. UMAP plots of (b) CellAssign-derived annotations; (c) SC3 clusters (using all genes); (d) Seurat clusters (resolution = 0.8, using all genes); (e) Seurat clusters (resolution = 0.8, using the same marker gene set used by CellAssign); (f) Seurat clusters (resolution = 1.2, using all genes); (g) Seurat clusters (resolution = 1.2, using the same marker gene set used by CellAssign).

Supplementary Figure 7 Cluster-specific HLA expression in HGSC epithelial cells.

(a) Expression (log normalized counts) of HLA class I genes in all HGSC cells. Expression values clipped from 0 to 8. (b) Expression of HLA class I genes across cell types in all HGSC cells. Epithelial (1): epithelial cells from cluster 1. Epithelial (other): epithelial cells from all other clusters. Lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range. n = 4847 single-cells total (15 B cells, 24 T cells, 99 Monocyte/Macrophage, 361 endothelial, 300 vascular smooth muscle cells, 438 ovarian myofibroblast, 809 ovarian stromal 750 epithelial (1), 2051 epithelial (other)).

Supplementary Figure 8 Cluster-specific phenotypes in HGSC epithelial cells.

(a) Hallmark pathway enrichment results for epithelial clusters 3 (n = 161 single cells) vs. 1 (n = 750 single cells) from the left ovary sample. (b) Gene-level differential expression for epithelial clusters 3 vs. 1, with statistical testing performed using the findMarkers function in the scran R package. (c and d) Hallmark pathway enrichment results for epithelial clusters (c) 2 vs. 0; and (d) 2 vs. 4. (e) Expression (log normalized counts) of select hypoxia-associated markers in HGSC epithelial cells. Expression values were winsorized between 0 and 4.

Supplementary Figure 9 Cell-type specific expression in follicular lymphoma.

(a) Expression (log normalized counts) of select marker genes CD2 (for T cells), MS4A1 (for B cells), CD8A and GZMA (for CD8+ T cells), CD4 (for T follicular helper cells and other CD4+ T cells) and CXCR5 and ICA1 (for T follicular helper cells). Expression values were winsorized between 0 and 3. (b) Heatmap of marker gene expression, labeled by maximum probability CellAssign-inferred cell types.

Supplementary Figure 10 Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches6 on follicular lymphoma single cell RNA-seq data (showing only T cell subtypes).

Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches6 on follicular lymphoma single cell RNA-seq data (showing only T cell subtypes).

Supplementary Figure 11 Expression (log normalized counts) of κ and λ light chain constant region genes in nonmalignant B cells.

Class assignments were determined by CellAssign (Methods).

Supplementary Figure 12 Expression (log normalized counts) of selected marker genes (CD2, CD3D, and CD3E for T cells; CD79A, MS4A1, and CD19 for B cells) in scvis embedding of reactive lymph node data.

Expression values were winsorized between 0 and 3.

Supplementary Figure 13 Differential gene regulation for FL1018 and FL2001.

Differential expression results using scran’s findMarkers for malignant vs. nonmalignant B cells in (a) FL1018 and (b) FL2001. Comparisons was performed accounting for timepoint and potential interactions between malignant status and timepoint using a multivariate linear model described in Methods. Genes upregulated among malignant cells have logFC values > 0. P-values were adjusted with the Benjamini-Hochberg method. Significantly enriched Reactome pathways (BH-adjusted P -value ≤ 0.05) among the top 50 most highly upregulated genes (ranked by log fold change) in (c) FL1018 and (d) FL2001. Up to 30 pathways are shown in either plot (Methods). Differentially expressed genes (found using scran’s findMarkers) for (e) T follicular helper and (f) other CD4 T cells between T2 vs. T1. Genes upregulated in T2 have log fold change values > 0. The activation marker CD69 is highlighted. P -values were adjusted with the Benjamini-Hochberg method.

Supplementary Figure 14 Fitting single cell RNA-seq simulation models to the Zheng PBMC 68k dataset, using cell type annotations provided in51 (n = 66205 single cells), and of FACS prufied data from Koh et al. (n = 369 single cells).

(a) Log fold change values computed from differential expression analysis between naïve CD8+ and naïve CD4+ T cells. (b) ‘Null’ log fold change values computed by randomly splitting naïve CD8+ T cells into equally sized halves 10 times. (c) Quantile-quantile (QQ) plot comparing observed log fold change values between naïve CD8+ and naïve CD4+ T cells and posterior predictive samples from the splatter model (Methods). (d) Quantile-quantile (QQ) plot comparing observed log fold change values between naïve CD8+ and naïve CD4+ T cells and posterior predictive samples from the modified model (Methods). (e) Log fold change values computed from differential expression analysis between human embryonic stem cells (hESCs) and day 3 somite cells (ESMT). (f) ‘Null’ log fold change values computed by randomly splitting anterior primitive streak cells into equally sized halves 10 times. (g) Quantile-quantile (QQ) plot comparing observed log fold change values between hESC and ESMT cells and posterior predictive samples from the splatter model (Methods). (h) Quantile-quantile (QQ) plot comparing observed log fold change values between hESC and ESMT cells and posterior predictive samples from the modified model (Methods).

Supplementary Figure 15 Benchmarking results for CellAssign across a range of simulated data set sizes (number of cells), number of cell types being inferred, and number of marker genes per cell type.

(a) Runtime (to convergence, defined as a relative change in log-likelihood < 10−3 between successive iterations, as a function of data set size and the number of marker genes used per cell type, on simulated data (Methods). Two cell types were used. (b) Runtime (to convergence, defined as a relative change in log-likelihood < 10−3 between successive iterations, as a function of the number of cell types and the number of marker genes used per cell type, on simulated data. One thousand cells were used. n = 5 simulated datasets were generated for each parameter setting. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Supplementary Note.

Reporting Summary

Supplementary Table 1

Performance measures on simulated data

Supplementary Table 2

Marker gene matrices used in analysis

Supplementary Table 3

Pathway enrichment results for follicular lymphoma and HGSC data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Further reading