Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling

Abstract

Single-cell RNA sequencing has enabled the decomposition of complex tissues into functionally distinct cell types. Often, investigators wish to assign cells to cell types through unsupervised clustering followed by manual annotation or via ‘mapping’ to existing data. However, manual interpretation scales poorly to large datasets, mapping approaches require purified or pre-annotated data and both are prone to batch effects. To overcome these issues, we present CellAssign, a probabilistic model that leverages prior knowledge of cell-type marker genes to annotate single-cell RNA sequencing data into predefined or de novo cell types. CellAssign automates the process of assigning cells in a highly scalable manner across large datasets while controlling for batch and sample effects. We demonstrate the advantages of CellAssign through extensive simulations and analysis of tumor microenvironment composition in high-grade serous ovarian cancer and follicular lymphoma.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of CellAssign.
Fig. 2: Performance of CellAssign on simulated data.
Fig. 3: CellAssign infers the composition of the HGSC microenvironment.
Fig. 4: CellAssign infers the composition of the follicular lymphoma microenvironment.
Fig. 5: Temporal changes in nonmalignant cells in the follicular lymphoma microenvironment.

Similar content being viewed by others

Data availability

Raw sequencing data for all experiments in this paper are available from the European Genome-phenome Archive (accession no. EGAD00001004585).

Code availability

CellAssign is available as an R package at www.github.com/irrationone/cellassign.

References

  1. Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

    Article  Google Scholar 

  2. Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).

    Article  CAS  Google Scholar 

  3. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  CAS  Google Scholar 

  4. Žurauskienė, J. & Yau, C. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics 17, 140 (2016).

    Article  Google Scholar 

  5. Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).

    Article  CAS  Google Scholar 

  6. Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1141 (2018).

    Article  Google Scholar 

  7. Freytag, S., Tian, L., Lönnstedt, I., Ng, M. & Bahlo, M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res. 7, 1297 (2018).

    Article  Google Scholar 

  8. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data.Nat. Rev. Genet. 20, 273–282 (2019).

    Article  CAS  Google Scholar 

  9. Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).

    Article  CAS  Google Scholar 

  10. Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–362 (2018).

    Article  CAS  Google Scholar 

  11. Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).

    Article  Google Scholar 

  12. Koh, P. W. et al. An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development. Sci. Data 3, 160109 (2016).

    Article  Google Scholar 

  13. Tian, L. et al. scRNA-seq mixology: towards better benchmarking of single cell RNA-seq protocols and analysis methods. Preprint at bioRxiv https://doi.org/10.1101/433102 (2018).

  14. Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article  Google Scholar 

  15. Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).

    Article  CAS  Google Scholar 

  16. Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, 719–720 (2008).

    Article  CAS  Google Scholar 

  17. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    Article  CAS  Google Scholar 

  18. Zhang, Z. et al. SCINA: A semi-supervised subtyping algorithm of single cells and bulk samples. Genes 10, 531 (2019).

    Article  Google Scholar 

  19. MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).

    Article  Google Scholar 

  20. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/pdf/1802.03426.pdf (2018).

  21. Zhang, A. W. et al. Interfaces of malignant and immunologic clonal dynamics in ovarian cancer. Cell 173, 1755–1769.e22 (2018).

    Article  CAS  Google Scholar 

  22. Kristiansen, G. et al. CD24 is expressed in ovarian cancer and is a new independent prognostic marker of patient survival. Am. J. Pathol. 161, 1215–1221 (2002).

    Article  CAS  Google Scholar 

  23. Hylander, B. et al. Expression of Wilms tumor gene (WT1) in epithelial ovarian cancer. Gynecol. Oncol. 101, 12–17 (2006).

    Article  CAS  Google Scholar 

  24. Andor, N. et al. Single-cell RNA-Seq of lymphoma cancers reveals malignant B-cell types and coexpression of T-cell immune checkpoints. Blood 133, 1119–1129 (2019).

    Article  CAS  Google Scholar 

  25. Jefferis, R. & Lefranc, M.-P. Human immunoglobulin allotypes: possible implications for immunogenicity. MAbs 1, 332–338 (2009).

    Article  Google Scholar 

  26. Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

    Article  Google Scholar 

  27. Hermine, O. et al. Prognostic significance of bcl-2 protein expression in aggressive non-Hodgkin’s lymphoma. Groupe d’Etude des Lymphomes de l’Adulte (GELA). Blood 87, 265–272 (1996).

    CAS  PubMed  Google Scholar 

  28. Gu, K. et al. t(14;18)-negative follicular lymphomas are associated with a high frequency of BCL6 rearrangement at the alternative breakpoint region. Mod. Pathol. 22, 1251–1257 (2009).

    Article  CAS  Google Scholar 

  29. Hatzi, K. & Melnick, A. Breaking bad in the germinal center: how deregulation of BCL6 contributes to lymphomagenesis. Trends Mol. Med. 20, 343–352 (2014).

    Article  CAS  Google Scholar 

  30. Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018).

    Article  CAS  Google Scholar 

  31. Freeman, B. E., Hammarlund, E., Raué, H. P. & Slifka, M. K. Regulation of innate CD8+ T-cell activation mediated by cytokines. Proc. Natl Acad. Sci. USA 109, 9971–9976 (2012).

    Article  CAS  Google Scholar 

  32. Hwang, B., Lee, J. H., Bang, D. & Single-cell, R. N. A. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 96 (2018).

    Article  Google Scholar 

  33. Eling, N., Richard, A. C., Richardson, S., Marioni, J. C. & Vallejos, C. A. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7, 284–294.e12 (2018).

    Article  CAS  Google Scholar 

  34. Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).

  35. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/pdf/1603.04467.pdf (2015).

  36. Sinha, D., Kumar, A., Kumar, H., Bandyopadhyay, S. & Sengupta, D. dropClust: efficient clustering of ultra-large scRNA-seq data. Nucleic Acids Res. 46, e36 (2018).

    Article  Google Scholar 

  37. Lun, A. T., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).

    PubMed  PubMed Central  Google Scholar 

  38. Adam, M., Potter, A. S. & Potter, S. S. Psychrophilic proteases dramatically reduce single-cell RNA-seq artifacts: a molecular atlas of kidney development. Development 144, 3625–3632 (2017).

    Article  CAS  Google Scholar 

  39. O’Flanagan, C. H. et al. Dissociation of solid tumour tissues with cold active protease for single-cell RNA-seq minimizes conserved collagenase associated stress responses. Preprint at bioRxiv https://doi.org/10.1101/683227 (2019).

  40. Schelker, M. et al. Estimation of immune cell content in tumour tissue using single-cell RNA-seq data. Nat. Commun. 8, 2032 (2017).

    Article  Google Scholar 

  41. Scialdone, A. et al. Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61 (2015).

    Article  CAS  Google Scholar 

  42. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).

    Article  CAS  Google Scholar 

  43. Shih, A. J. et al. Identification of grade and origin specific cell populations in serous epithelial ovarian cancer by single cell RNA-seq. PLoS ONE 13, e0206785 (2018).

    Article  Google Scholar 

  44. Liberzon, A. et al. The Molecular Signatures Database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

    Article  CAS  Google Scholar 

  45. Uhlen, M. et al. A pathology atlas of the human cancer transcriptome. Science 357, eaan2507 (2017).

    Article  Google Scholar 

  46. Perisic Matic, L. et al. Phenotypic modulation of smooth muscle cells in atherosclerosis is associated with downregulation of LMOD1, SYNPO2, PDLIM7, PLN, and SYNM. Arterioscler. Thromb. Vasc. Biol. 36, 1947–1961 (2016).

    Article  CAS  Google Scholar 

  47. Espagnolle, N. et al. CD146 expression on mesenchymal stem cells is associated with their vascular smooth muscle commitment. J. Cell. Mol. Med. 18, 104–114 (2014).

    Article  CAS  Google Scholar 

  48. Rocnik, E., Saward, L. & Pickering, J. G. HSP47 expression by smooth muscle cells is increased during arterial development and lesion formation and is inhibited by fibrillar collagen. Arterioscler. Thromb. Vasc. Biol. 21, 40–46 (2001).

    Article  CAS  Google Scholar 

  49. Mura, M. et al. Identification and angiogenic role of the novel tumor endothelial marker CLEC14A. Oncogene 31, 293–305 (2012).

    Article  CAS  Google Scholar 

  50. Deenick, E. K. & Ma, C. S. The regulation and role of T follicular helper cells in immunity. Immunology 134, 361–367 (2011).

    Article  CAS  Google Scholar 

  51. Payne, D., Drinkwater, S., Baretto, R., Duddridge, M. & Browning, M. J. Expression of chemokine receptors CXCR4, CXCR5 and CCR7 on B and T lymphocytes from patients with primary antibody deficiency. Clin. Exp. Immunol. 156, 254–262 (2009).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank V. Svensson for his feedback on this manuscript. We also thank W. W. Wasserman, B. H. Nelson, P. T. Hamilton and A. Miranda for helpful discussions. A.W.Z. is funded by scholarships from the Canadian Institutes of Health Research (CIHR) (Vanier Canada Graduate Scholarship, Michael Smith Foreign Study Supplement) and a BC Children’s Hospital (UBC) MD/PhD studentship. K.R.C. is funded by postdoctoral fellowships from the CIHR (Banting) no. 01353-000, the Canadian Statistical Sciences Institute and the UBC Data Science Institute no. 201803. S.P.S. is a Susan G. Komen scholar. We acknowledge the generous funding support provided by the BC Cancer Foundation. In addition, S.P.S. receives operating funds from the CIHR (grant no. FDN-143246), Terry Fox Research Institute (grant nos. 1021 and 1061) and the Canadian Cancer Society (grant no. 705636). This work was supported by Cancer Research UK (grant no. C31893/A25050 to S.A. and S.P.S.). S.P.S. is supported by the Nicholls-Biondi endowed chair and the Cycle for Survival benefitting Memorial Sloan Kettering Cancer Center. C.S. is an Allen Distinguished Investigator supported by the Allen Frontiers Group no. 12829.

Author information

Authors and Affiliations

Authors

Contributions

A.W.Z., K.R.C. and S.P.S. designed the study. A.W.Z., K.R.C. and S.P.S. wrote the manuscript. A.W.Z., C.O.F., E.A.C., J.L.P.L., A. McPherson, A. Mottok, N.C., L.C., M.W., T.A., A.P.W., J.N.M., S.A., C. Steidl, K.R.C. and S.P.S. reviewed the manuscript. A.W.Z., S.A., C. Steidl, K.R.C. and S.P.S. interpreted the data. B.H., D.L., L.C. and C. Sarkozy curated the data. A.W.Z., K.R.C., N.C., M.W., P.W., T.C. and X.W. analyzed the data. A.W.Z., K.R.C. and S.P.S. developed the model. C.O.F., E.A.C. and J.L.P.L. performed the single-cell processing. A. Mottok, J.N.M., C. Steidl and C. Sarkozy performed the case identification. K.R.C., S.P.S., C. Steidl and S.A. supervised the study.

Corresponding authors

Correspondence to Kieran R. Campbell or Sohrab P. Shah.

Ethics declarations

Competing interests

S.P.S. and S.A. are founders, shareholders and consultants of Contextual Genomics.

Additional information

Peer review information: Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Simulation performance across a range of proportions of differentially expressed genes, using differential expression parameters derived from comparing naïve CD8+ and naïve CD4+ T cells.

(a) Accuracy and cell-level F1 score (Methods) for varying proportions of differentially expressed genes per cell type. All methods were provided with expression data for the same set of marker genes. *, **, *** denote FDR-adjusted p-vaues (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. Nine simulated datasets were generated for each parameter setting. Dotted lines separate marker-based, unsupervised, and supervised methods. (b) Correspondence between true simulated log fold change values and log fold change (δ) values inferred by CellAssign. R refers to the Pearson correlation between true and inferred logFC values for CellAssign. (c and d) Performance of CellAssign where a certain proportion of entries in the marker gene matrix are flipped at random, using (c) 5 and (d) 20 marker genes per cell type. Nine simulated datasets were generated for each parameter setting. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary Figure 2 Simulation performance across a range of proportions of differentially expressed genes, using differential expression parameters derived from comparing B and CD8+ T cells.

(a) Accuracy and cell-level F1 score (Methods) for varying proportions of differentially expressed genes per cell type. CellAssign was provided with a set of marker genes (Methods); all other methods were provided with all genes. Asterisks indicate FDR-adjusted statistical significance (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods. Nine simulated datasets were generated for each parameter setting. (b) Accuracy and cell-level F1 score for varying proportions of differentially expressed genes per cell type. All methods were provided with the same set of marker genes. Nine simulated datasets were generated for each parameter setting. *, **, *** denote FDR-adjusted p-vaues (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. (c) Correspondence between true simulated log fold change values and log fold change (δ) values inferred by CellAssign. R refers to the Pearson correlation between true and inferred logFC values for CellAssign. n = 1000 single-cells were simulated. (d) Performance of CellAssign where a certain proportion of entries in the marker gene matrix are flipped at random. Nine simulated datasets were generated for each parameter setting. Dotted lines separate marker-based, unsupervised, and supervised methods. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary Figure 3 Simulation performance (accuracy and cell-level F1 score) across a range of proportions of differentially expressed genes, where clusters from “unsupervised” algorithms are assigned to respective cell types through correlation to the transcriptomes of purified cell types.

Dotted lines separate marker-based, unsupervised, and supervised methods. (a) & (b) Simulating data based on parameters for B and CD8+ T cells, and running algorithms for whole-transcriptome and marker gene data only, respectively. (c) & (d) Simulating data based on parameters for naïve CD4+ and CD8+ T cells, and running algorithms for whole-transcriptome and marker gene data only, respectively. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range. *, **, *** denote FDR-adjusted p-values (two-sided Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. In all cases, n = 1000 single-cells were simulated.

Supplementary Figure 4 Performance (accuracy and cell type-level F1 score, Methods) of CellAssign and the best-performing clustering methods evaluated by6 on FACS-purified H7 human embryonic stem cells in various stages of differentiation.

t-SNE plots of (a) ground-truth FACS annotations; (b) CellAssign-derived annotations; (c) SCINA-derived annotations; (d) SC3 clusters (using all genes); (e) Seurat clusters (resolution = 0.8, using all genes); (f) Seurat clusters (resolution = 0.8, using the same marker gene set used by CellAssign); (g) Seurat clusters (resolution = 1.2, using all genes); (h) Seurat clusters (resolution = 1.2, using the same marker gene set used by CellAssign).

Supplementary Figure 5 Expression of select marker genes in HGSC single cell RNA-seq data.

(a) Expression (log normalized counts) of PECAM1 (for endothelial cells), CD3D (for T cells), CD79A (for B cells), KLHDC8A (for ovary-derived cells), ACTA2 (for myofibroblasts and smooth muscle), MYH11 (for smooth muscle), and MCAM (for vascular cell types including endothelial cells, vascular smooth muscle, and pericytes). (b) Expression (log normalized counts) of marker genes expressed in epithelial ovarian cancers but not in normal ovarian tissue. Expression values were winsorized between 0 and 4.

Supplementary Figure 6 Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches6 on HGSC single cell RNA-seq data.

(a) Expression (log normalized counts) of key marker genes of hematopoietic subpopulations CD3D (for T cells), CD79A (for B cells), and CD14 (for monocytes/macrophages). Expression values were winsorized between 0 and 4. UMAP plots of (b) CellAssign-derived annotations; (c) SC3 clusters (using all genes); (d) Seurat clusters (resolution = 0.8, using all genes); (e) Seurat clusters (resolution = 0.8, using the same marker gene set used by CellAssign); (f) Seurat clusters (resolution = 1.2, using all genes); (g) Seurat clusters (resolution = 1.2, using the same marker gene set used by CellAssign).

Supplementary Figure 7 Cluster-specific HLA expression in HGSC epithelial cells.

(a) Expression (log normalized counts) of HLA class I genes in all HGSC cells. Expression values clipped from 0 to 8. (b) Expression of HLA class I genes across cell types in all HGSC cells. Epithelial (1): epithelial cells from cluster 1. Epithelial (other): epithelial cells from all other clusters. Lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range. n = 4847 single-cells total (15 B cells, 24 T cells, 99 Monocyte/Macrophage, 361 endothelial, 300 vascular smooth muscle cells, 438 ovarian myofibroblast, 809 ovarian stromal 750 epithelial (1), 2051 epithelial (other)).

Supplementary Figure 8 Cluster-specific phenotypes in HGSC epithelial cells.

(a) Hallmark pathway enrichment results for epithelial clusters 3 (n = 161 single cells) vs. 1 (n = 750 single cells) from the left ovary sample. (b) Gene-level differential expression for epithelial clusters 3 vs. 1, with statistical testing performed using the findMarkers function in the scran R package. (c and d) Hallmark pathway enrichment results for epithelial clusters (c) 2 vs. 0; and (d) 2 vs. 4. (e) Expression (log normalized counts) of select hypoxia-associated markers in HGSC epithelial cells. Expression values were winsorized between 0 and 4.

Supplementary Figure 9 Cell-type specific expression in follicular lymphoma.

(a) Expression (log normalized counts) of select marker genes CD2 (for T cells), MS4A1 (for B cells), CD8A and GZMA (for CD8+ T cells), CD4 (for T follicular helper cells and other CD4+ T cells) and CXCR5 and ICA1 (for T follicular helper cells). Expression values were winsorized between 0 and 3. (b) Heatmap of marker gene expression, labeled by maximum probability CellAssign-inferred cell types.

Supplementary Figure 10 Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches6 on follicular lymphoma single cell RNA-seq data (showing only T cell subtypes).

Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches6 on follicular lymphoma single cell RNA-seq data (showing only T cell subtypes).

Supplementary Figure 11 Expression (log normalized counts) of κ and λ light chain constant region genes in nonmalignant B cells.

Class assignments were determined by CellAssign (Methods).

Supplementary Figure 12 Expression (log normalized counts) of selected marker genes (CD2, CD3D, and CD3E for T cells; CD79A, MS4A1, and CD19 for B cells) in scvis embedding of reactive lymph node data.

Expression values were winsorized between 0 and 3.

Supplementary Figure 13 Differential gene regulation for FL1018 and FL2001.

Differential expression results using scran’s findMarkers for malignant vs. nonmalignant B cells in (a) FL1018 and (b) FL2001. Comparisons was performed accounting for timepoint and potential interactions between malignant status and timepoint using a multivariate linear model described in Methods. Genes upregulated among malignant cells have logFC values > 0. P-values were adjusted with the Benjamini-Hochberg method. Significantly enriched Reactome pathways (BH-adjusted P -value ≤ 0.05) among the top 50 most highly upregulated genes (ranked by log fold change) in (c) FL1018 and (d) FL2001. Up to 30 pathways are shown in either plot (Methods). Differentially expressed genes (found using scran’s findMarkers) for (e) T follicular helper and (f) other CD4 T cells between T2 vs. T1. Genes upregulated in T2 have log fold change values > 0. The activation marker CD69 is highlighted. P -values were adjusted with the Benjamini-Hochberg method.

Supplementary Figure 14 Fitting single cell RNA-seq simulation models to the Zheng PBMC 68k dataset, using cell type annotations provided in51 (n = 66205 single cells), and of FACS prufied data from Koh et al. (n = 369 single cells).

(a) Log fold change values computed from differential expression analysis between naïve CD8+ and naïve CD4+ T cells. (b) ‘Null’ log fold change values computed by randomly splitting naïve CD8+ T cells into equally sized halves 10 times. (c) Quantile-quantile (QQ) plot comparing observed log fold change values between naïve CD8+ and naïve CD4+ T cells and posterior predictive samples from the splatter model (Methods). (d) Quantile-quantile (QQ) plot comparing observed log fold change values between naïve CD8+ and naïve CD4+ T cells and posterior predictive samples from the modified model (Methods). (e) Log fold change values computed from differential expression analysis between human embryonic stem cells (hESCs) and day 3 somite cells (ESMT). (f) ‘Null’ log fold change values computed by randomly splitting anterior primitive streak cells into equally sized halves 10 times. (g) Quantile-quantile (QQ) plot comparing observed log fold change values between hESC and ESMT cells and posterior predictive samples from the splatter model (Methods). (h) Quantile-quantile (QQ) plot comparing observed log fold change values between hESC and ESMT cells and posterior predictive samples from the modified model (Methods).

Supplementary Figure 15 Benchmarking results for CellAssign across a range of simulated data set sizes (number of cells), number of cell types being inferred, and number of marker genes per cell type.

(a) Runtime (to convergence, defined as a relative change in log-likelihood < 10−3 between successive iterations, as a function of data set size and the number of marker genes used per cell type, on simulated data (Methods). Two cell types were used. (b) Runtime (to convergence, defined as a relative change in log-likelihood < 10−3 between successive iterations, as a function of the number of cell types and the number of marker genes used per cell type, on simulated data. One thousand cells were used. n = 5 simulated datasets were generated for each parameter setting. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Supplementary Note.

Reporting Summary

Supplementary Table 1

Performance measures on simulated data

Supplementary Table 2

Marker gene matrices used in analysis

Supplementary Table 3

Pathway enrichment results for follicular lymphoma and HGSC data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, A.W., O’Flanagan, C., Chavez, E.A. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods 16, 1007–1015 (2019). https://doi.org/10.1038/s41592-019-0529-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-019-0529-1

This article is cited by

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer