Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling

Zhang, Allen W.; O’Flanagan, Ciara; Chavez, Elizabeth A.; Lim, Jamie L. P.; Ceglia, Nicholas; McPherson, Andrew; Wiens, Matt; Walters, Pascale; Chan, Tim; Hewitson, Brittany; Lai, Daniel; Mottok, Anja; Sarkozy, Clementine; Chong, Lauren; Aoki, Tomohiro; Wang, Xuehai; Weng, Andrew P; McAlpine, Jessica N.; Aparicio, Samuel; Steidl, Christian; Campbell, Kieran R.; Shah, Sohrab P.

doi:10.1038/s41592-019-0529-1

Article
Published: 09 September 2019

Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling

Allen W. Zhang ORCID: orcid.org/0000-0002-7606-089X^1,2,3,
Ciara O’Flanagan¹,
Elizabeth A. Chavez⁴,
Jamie L. P. Lim^1,2,
Nicholas Ceglia²,
Andrew McPherson¹,
Matt Wiens¹,
Pascale Walters¹,
Tim Chan¹,
Brittany Hewitson¹,
Daniel Lai ORCID: orcid.org/0000-0001-9203-6323¹,
Anja Mottok^4,5,
Clementine Sarkozy⁴,
Lauren Chong⁴,
Tomohiro Aoki ORCID: orcid.org/0000-0001-6782-8361^4,6,
Xuehai Wang⁷,
Andrew P Weng⁷,
Jessica N. McAlpine⁸,
Samuel Aparicio ORCID: orcid.org/0000-0002-0487-9599^1,6,
Christian Steidl⁴,
Kieran R. Campbell ORCID: orcid.org/0000-0003-1981-5763^1,9,10 &
…
Sohrab P. Shah ORCID: orcid.org/0000-0001-6402-523X^1,2,6

Nature Methods volume 16, pages 1007–1015 (2019)Cite this article

27k Accesses
162 Citations
124 Altmetric
Metrics details

Subjects

Abstract

Single-cell RNA sequencing has enabled the decomposition of complex tissues into functionally distinct cell types. Often, investigators wish to assign cells to cell types through unsupervised clustering followed by manual annotation or via ‘mapping’ to existing data. However, manual interpretation scales poorly to large datasets, mapping approaches require purified or pre-annotated data and both are prone to batch effects. To overcome these issues, we present CellAssign, a probabilistic model that leverages prior knowledge of cell-type marker genes to annotate single-cell RNA sequencing data into predefined or de novo cell types. CellAssign automates the process of assigning cells in a highly scalable manner across large datasets while controlling for batch and sample effects. We demonstrate the advantages of CellAssign through extensive simulations and analysis of tumor microenvironment composition in high-grade serous ovarian cancer and follicular lymphoma.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Performance of CellAssign on simulated data.**

**Fig. 3: CellAssign infers the composition of the HGSC microenvironment.**

**Fig. 4: CellAssign infers the composition of the follicular lymphoma microenvironment.**

**Fig. 5: Temporal changes in nonmalignant cells in the follicular lymphoma microenvironment.**

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Srinivas Niranj Chandrasekaran, Beth A. Cimini, … Anne E. Carpenter

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Data availability

Raw sequencing data for all experiments in this paper are available from the European Genome-phenome Archive (accession no. EGAD00001004585).

Code availability

CellAssign is available as an R package at www.github.com/irrationone/cellassign.

References

Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Article Google Scholar
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Article CAS Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS Google Scholar
Žurauskienė, J. & Yau, C. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics 17, 140 (2016).
Article Google Scholar
Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).
Article CAS Google Scholar
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1141 (2018).
Article Google Scholar
Freytag, S., Tian, L., Lönnstedt, I., Ng, M. & Bahlo, M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res. 7, 1297 (2018).
Article Google Scholar
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data.Nat. Rev. Genet. 20, 273–282 (2019).
Article CAS Google Scholar
Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).
Article CAS Google Scholar
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–362 (2018).
Article CAS Google Scholar
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).
Article Google Scholar
Koh, P. W. et al. An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development. Sci. Data 3, 160109 (2016).
Article Google Scholar
Tian, L. et al. scRNA-seq mixology: towards better benchmarking of single cell RNA-seq protocols and analysis methods. Preprint at bioRxiv https://doi.org/10.1101/433102 (2018).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Article Google Scholar
Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).
Article CAS Google Scholar
Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, 719–720 (2008).
Article CAS Google Scholar
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article CAS Google Scholar
Zhang, Z. et al. SCINA: A semi-supervised subtyping algorithm of single cells and bulk samples. Genes 10, 531 (2019).
Article Google Scholar
MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).
Article Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/pdf/1802.03426.pdf (2018).
Zhang, A. W. et al. Interfaces of malignant and immunologic clonal dynamics in ovarian cancer. Cell 173, 1755–1769.e22 (2018).
Article CAS Google Scholar
Kristiansen, G. et al. CD24 is expressed in ovarian cancer and is a new independent prognostic marker of patient survival. Am. J. Pathol. 161, 1215–1221 (2002).
Article CAS Google Scholar
Hylander, B. et al. Expression of Wilms tumor gene (WT1) in epithelial ovarian cancer. Gynecol. Oncol. 101, 12–17 (2006).
Article CAS Google Scholar
Andor, N. et al. Single-cell RNA-Seq of lymphoma cancers reveals malignant B-cell types and coexpression of T-cell immune checkpoints. Blood 133, 1119–1129 (2019).
Article CAS Google Scholar
Jefferis, R. & Lefranc, M.-P. Human immunoglobulin allotypes: possible implications for immunogenicity. MAbs 1, 332–338 (2009).
Article Google Scholar
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).
Article Google Scholar
Hermine, O. et al. Prognostic significance of bcl-2 protein expression in aggressive non-Hodgkin’s lymphoma. Groupe d’Etude des Lymphomes de l’Adulte (GELA). Blood 87, 265–272 (1996).
CAS PubMed Google Scholar
Gu, K. et al. t(14;18)-negative follicular lymphomas are associated with a high frequency of BCL6 rearrangement at the alternative breakpoint region. Mod. Pathol. 22, 1251–1257 (2009).
Article CAS Google Scholar
Hatzi, K. & Melnick, A. Breaking bad in the germinal center: how deregulation of BCL6 contributes to lymphomagenesis. Trends Mol. Med. 20, 343–352 (2014).
Article CAS Google Scholar
Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018).
Article CAS Google Scholar
Freeman, B. E., Hammarlund, E., Raué, H. P. & Slifka, M. K. Regulation of innate CD8⁺ T-cell activation mediated by cytokines. Proc. Natl Acad. Sci. USA 109, 9971–9976 (2012).
Article CAS Google Scholar
Hwang, B., Lee, J. H., Bang, D. & Single-cell, R. N. A. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 96 (2018).
Article Google Scholar
Eling, N., Richard, A. C., Richardson, S., Marioni, J. C. & Vallejos, C. A. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7, 284–294.e12 (2018).
Article CAS Google Scholar
Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/pdf/1603.04467.pdf (2015).
Sinha, D., Kumar, A., Kumar, H., Bandyopadhyay, S. & Sengupta, D. dropClust: efficient clustering of ultra-large scRNA-seq data. Nucleic Acids Res. 46, e36 (2018).
Article Google Scholar
Lun, A. T., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
PubMed PubMed Central Google Scholar
Adam, M., Potter, A. S. & Potter, S. S. Psychrophilic proteases dramatically reduce single-cell RNA-seq artifacts: a molecular atlas of kidney development. Development 144, 3625–3632 (2017).
Article CAS Google Scholar
O’Flanagan, C. H. et al. Dissociation of solid tumour tissues with cold active protease for single-cell RNA-seq minimizes conserved collagenase associated stress responses. Preprint at bioRxiv https://doi.org/10.1101/683227 (2019).
Schelker, M. et al. Estimation of immune cell content in tumour tissue using single-cell RNA-seq data. Nat. Commun. 8, 2032 (2017).
Article Google Scholar
Scialdone, A. et al. Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61 (2015).
Article CAS Google Scholar
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Article CAS Google Scholar
Shih, A. J. et al. Identification of grade and origin specific cell populations in serous epithelial ovarian cancer by single cell RNA-seq. PLoS ONE 13, e0206785 (2018).
Article Google Scholar
Liberzon, A. et al. The Molecular Signatures Database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Article CAS Google Scholar
Uhlen, M. et al. A pathology atlas of the human cancer transcriptome. Science 357, eaan2507 (2017).
Article Google Scholar
Perisic Matic, L. et al. Phenotypic modulation of smooth muscle cells in atherosclerosis is associated with downregulation of LMOD1, SYNPO2, PDLIM7, PLN, and SYNM. Arterioscler. Thromb. Vasc. Biol. 36, 1947–1961 (2016).
Article CAS Google Scholar
Espagnolle, N. et al. CD146 expression on mesenchymal stem cells is associated with their vascular smooth muscle commitment. J. Cell. Mol. Med. 18, 104–114 (2014).
Article CAS Google Scholar
Rocnik, E., Saward, L. & Pickering, J. G. HSP47 expression by smooth muscle cells is increased during arterial development and lesion formation and is inhibited by fibrillar collagen. Arterioscler. Thromb. Vasc. Biol. 21, 40–46 (2001).
Article CAS Google Scholar
Mura, M. et al. Identification and angiogenic role of the novel tumor endothelial marker CLEC14A. Oncogene 31, 293–305 (2012).
Article CAS Google Scholar
Deenick, E. K. & Ma, C. S. The regulation and role of T follicular helper cells in immunity. Immunology 134, 361–367 (2011).
Article CAS Google Scholar
Payne, D., Drinkwater, S., Baretto, R., Duddridge, M. & Browning, M. J. Expression of chemokine receptors CXCR4, CXCR5 and CCR7 on B and T lymphocytes from patients with primary antibody deficiency. Clin. Exp. Immunol. 156, 254–262 (2009).
Article CAS Google Scholar

Download references

Acknowledgements

We thank V. Svensson for his feedback on this manuscript. We also thank W. W. Wasserman, B. H. Nelson, P. T. Hamilton and A. Miranda for helpful discussions. A.W.Z. is funded by scholarships from the Canadian Institutes of Health Research (CIHR) (Vanier Canada Graduate Scholarship, Michael Smith Foreign Study Supplement) and a BC Children’s Hospital (UBC) MD/PhD studentship. K.R.C. is funded by postdoctoral fellowships from the CIHR (Banting) no. 01353-000, the Canadian Statistical Sciences Institute and the UBC Data Science Institute no. 201803. S.P.S. is a Susan G. Komen scholar. We acknowledge the generous funding support provided by the BC Cancer Foundation. In addition, S.P.S. receives operating funds from the CIHR (grant no. FDN-143246), Terry Fox Research Institute (grant nos. 1021 and 1061) and the Canadian Cancer Society (grant no. 705636). This work was supported by Cancer Research UK (grant no. C31893/A25050 to S.A. and S.P.S.). S.P.S. is supported by the Nicholls-Biondi endowed chair and the Cycle for Survival benefitting Memorial Sloan Kettering Cancer Center. C.S. is an Allen Distinguished Investigator supported by the Allen Frontiers Group no. 12829.

Author information

Authors and Affiliations

Department of Molecular Oncology, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada
Allen W. Zhang, Ciara O’Flanagan, Jamie L. P. Lim, Andrew McPherson, Matt Wiens, Pascale Walters, Tim Chan, Brittany Hewitson, Daniel Lai, Samuel Aparicio, Kieran R. Campbell & Sohrab P. Shah
Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
Allen W. Zhang, Jamie L. P. Lim, Nicholas Ceglia & Sohrab P. Shah
BC Children’s Hospital Research, Vancouver, British Columbia, Canada
Allen W. Zhang
Centre for Lymphoid Cancer, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada
Elizabeth A. Chavez, Anja Mottok, Clementine Sarkozy, Lauren Chong, Tomohiro Aoki & Christian Steidl
Institute of Human Genetics, Ulm University and Ulm University Medical Center, Ulm, Germany
Anja Mottok
Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada
Tomohiro Aoki, Samuel Aparicio & Sohrab P. Shah
Terry Fox Laboratory, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada
Xuehai Wang & Andrew P Weng
Department of Obstetrics & Gynaecology, University of British Columbia, Vancouver, British Columbia, Canada
Jessica N. McAlpine
Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada
Kieran R. Campbell
UBC Data Science Institute, University of British Columbia, Vancouver, British Columbia, Canada
Kieran R. Campbell

Authors

Allen W. Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ciara O’Flanagan
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth A. Chavez
View author publications
You can also search for this author in PubMed Google Scholar
Jamie L. P. Lim
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas Ceglia
View author publications
You can also search for this author in PubMed Google Scholar
Andrew McPherson
View author publications
You can also search for this author in PubMed Google Scholar
Matt Wiens
View author publications
You can also search for this author in PubMed Google Scholar
Pascale Walters
View author publications
You can also search for this author in PubMed Google Scholar
Tim Chan
View author publications
You can also search for this author in PubMed Google Scholar
Brittany Hewitson
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lai
View author publications
You can also search for this author in PubMed Google Scholar
Anja Mottok
View author publications
You can also search for this author in PubMed Google Scholar
Clementine Sarkozy
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Chong
View author publications
You can also search for this author in PubMed Google Scholar
Tomohiro Aoki
View author publications
You can also search for this author in PubMed Google Scholar
Xuehai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Andrew P Weng
View author publications
You can also search for this author in PubMed Google Scholar
Jessica N. McAlpine
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Aparicio
View author publications
You can also search for this author in PubMed Google Scholar
Christian Steidl
View author publications
You can also search for this author in PubMed Google Scholar
Kieran R. Campbell
View author publications
You can also search for this author in PubMed Google Scholar
Sohrab P. Shah
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.W.Z., K.R.C. and S.P.S. designed the study. A.W.Z., K.R.C. and S.P.S. wrote the manuscript. A.W.Z., C.O.F., E.A.C., J.L.P.L., A. McPherson, A. Mottok, N.C., L.C., M.W., T.A., A.P.W., J.N.M., S.A., C. Steidl, K.R.C. and S.P.S. reviewed the manuscript. A.W.Z., S.A., C. Steidl, K.R.C. and S.P.S. interpreted the data. B.H., D.L., L.C. and C. Sarkozy curated the data. A.W.Z., K.R.C., N.C., M.W., P.W., T.C. and X.W. analyzed the data. A.W.Z., K.R.C. and S.P.S. developed the model. C.O.F., E.A.C. and J.L.P.L. performed the single-cell processing. A. Mottok, J.N.M., C. Steidl and C. Sarkozy performed the case identification. K.R.C., S.P.S., C. Steidl and S.A. supervised the study.

Corresponding authors

Correspondence to Kieran R. Campbell or Sohrab P. Shah.

Ethics declarations

Competing interests

S.P.S. and S.A. are founders, shareholders and consultants of Contextual Genomics.

Additional information

Peer review information: Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Simulation performance across a range of proportions of differentially expressed genes, using differential expression parameters derived from comparing naïve CD8+ and naïve CD4+ T cells.

(a) Accuracy and cell-level F1 score (Methods) for varying proportions of differentially expressed genes per cell type. All methods were provided with expression data for the same set of marker genes. *, **, *** denote FDR-adjusted p-vaues (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. Nine simulated datasets were generated for each parameter setting. Dotted lines separate marker-based, unsupervised, and supervised methods. (b) Correspondence between true simulated log fold change values and log fold change (δ) values inferred by CellAssign. R refers to the Pearson correlation between true and inferred logFC values for CellAssign. (c and d) Performance of CellAssign where a certain proportion of entries in the marker gene matrix are flipped at random, using (c) 5 and (d) 20 marker genes per cell type. Nine simulated datasets were generated for each parameter setting. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary Figure 2 Simulation performance across a range of proportions of differentially expressed genes, using differential expression parameters derived from comparing B and CD8+ T cells.

(a) Accuracy and cell-level F1 score (Methods) for varying proportions of differentially expressed genes per cell type. CellAssign was provided with a set of marker genes (Methods); all other methods were provided with all genes. Asterisks indicate FDR-adjusted statistical significance (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods. Nine simulated datasets were generated for each parameter setting. (b) Accuracy and cell-level F1 score for varying proportions of differentially expressed genes per cell type. All methods were provided with the same set of marker genes. Nine simulated datasets were generated for each parameter setting. *, **, *** denote FDR-adjusted p-vaues (Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. (c) Correspondence between true simulated log fold change values and log fold change (δ) values inferred by CellAssign. R refers to the Pearson correlation between true and inferred logFC values for CellAssign. n = 1000 single-cells were simulated. (d) Performance of CellAssign where a certain proportion of entries in the marker gene matrix are flipped at random. Nine simulated datasets were generated for each parameter setting. Dotted lines separate marker-based, unsupervised, and supervised methods. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary Figure 3 Simulation performance (accuracy and cell-level F1 score) across a range of proportions of differentially expressed genes, where clusters from “unsupervised” algorithms are assigned to respective cell types through correlation to the transcriptomes of purified cell types.

Dotted lines separate marker-based, unsupervised, and supervised methods. (a) & (b) Simulating data based on parameters for B and CD8+ T cells, and running algorithms for whole-transcriptome and marker gene data only, respectively. (c) & (d) Simulating data based on parameters for naïve CD4+ and CD8+ T cells, and running algorithms for whole-transcriptome and marker gene data only, respectively. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range. *, **, *** denote FDR-adjusted p-values (two-sided Wilcoxon signed-rank test) for pairwise comparisons between CellAssign and other methods < 0.05, 0.01, 0.001 respectively. In all cases, n = 1000 single-cells were simulated.

Supplementary Figure 4 Performance (accuracy and cell type-level F1 score, Methods) of CellAssign and the best-performing clustering methods evaluated by 6 on FACS-purified H7 human embryonic stem cells in various stages of differentiation.

t-SNE plots of (a) ground-truth FACS annotations; (b) CellAssign-derived annotations; (c) SCINA-derived annotations; (d) SC3 clusters (using all genes); (e) Seurat clusters (resolution = 0.8, using all genes); (f) Seurat clusters (resolution = 0.8, using the same marker gene set used by CellAssign); (g) Seurat clusters (resolution = 1.2, using all genes); (h) Seurat clusters (resolution = 1.2, using the same marker gene set used by CellAssign).

Supplementary Figure 5 Expression of select marker genes in HGSC single cell RNA-seq data.

(a) Expression (log normalized counts) of PECAM1 (for endothelial cells), CD3D (for T cells), CD79A (for B cells), KLHDC8A (for ovary-derived cells), ACTA2 (for myofibroblasts and smooth muscle), MYH11 (for smooth muscle), and MCAM (for vascular cell types including endothelial cells, vascular smooth muscle, and pericytes). (b) Expression (log normalized counts) of marker genes expressed in epithelial ovarian cancers but not in normal ovarian tissue. Expression values were winsorized between 0 and 4.

Supplementary Figure 6 Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches 6 on HGSC single cell RNA-seq data.

(a) Expression (log normalized counts) of key marker genes of hematopoietic subpopulations CD3D (for T cells), CD79A (for B cells), and CD14 (for monocytes/macrophages). Expression values were winsorized between 0 and 4. UMAP plots of (b) CellAssign-derived annotations; (c) SC3 clusters (using all genes); (d) Seurat clusters (resolution = 0.8, using all genes); (e) Seurat clusters (resolution = 0.8, using the same marker gene set used by CellAssign); (f) Seurat clusters (resolution = 1.2, using all genes); (g) Seurat clusters (resolution = 1.2, using the same marker gene set used by CellAssign).

Supplementary Figure 7 Cluster-specific HLA expression in HGSC epithelial cells.

(a) Expression (log normalized counts) of HLA class I genes in all HGSC cells. Expression values clipped from 0 to 8. (b) Expression of HLA class I genes across cell types in all HGSC cells. Epithelial (1): epithelial cells from cluster 1. Epithelial (other): epithelial cells from all other clusters. Lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range. n = 4847 single-cells total (15 B cells, 24 T cells, 99 Monocyte/Macrophage, 361 endothelial, 300 vascular smooth muscle cells, 438 ovarian myofibroblast, 809 ovarian stromal 750 epithelial (1), 2051 epithelial (other)).

Supplementary Figure 8 Cluster-specific phenotypes in HGSC epithelial cells.

(a) Hallmark pathway enrichment results for epithelial clusters 3 (n = 161 single cells) vs. 1 (n = 750 single cells) from the left ovary sample. (b) Gene-level differential expression for epithelial clusters 3 vs. 1, with statistical testing performed using the findMarkers function in the scran R package. (c and d) Hallmark pathway enrichment results for epithelial clusters (c) 2 vs. 0; and (d) 2 vs. 4. (e) Expression (log normalized counts) of select hypoxia-associated markers in HGSC epithelial cells. Expression values were winsorized between 0 and 4.

Supplementary Figure 9 Cell-type specific expression in follicular lymphoma.

(a) Expression (log normalized counts) of select marker genes CD2 (for T cells), MS4A1 (for B cells), CD8A and GZMA (for CD8+ T cells), CD4 (for T follicular helper cells and other CD4+ T cells) and CXCR5 and ICA1 (for T follicular helper cells). Expression values were winsorized between 0 and 3. (b) Heatmap of marker gene expression, labeled by maximum probability CellAssign-inferred cell types.

Supplementary Figure 10 Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches 6 on follicular lymphoma single cell RNA-seq data (showing only T cell subtypes).

Comparison of clusters from CellAssign and state-of-the-art unsupervised clustering approaches⁶ on follicular lymphoma single cell RNA-seq data (showing only T cell subtypes).

Supplementary Figure 11 Expression (log normalized counts) of κ and λ light chain constant region genes in nonmalignant B cells.

Class assignments were determined by CellAssign (Methods).

Supplementary Figure 12 Expression (log normalized counts) of selected marker genes (CD2, CD3D, and CD3E for T cells; CD79A, MS4A1, and CD19 for B cells) in scvis embedding of reactive lymph node data.

Expression values were winsorized between 0 and 3.

Supplementary Figure 13 Differential gene regulation for FL1018 and FL2001.

Differential expression results using scran’s findMarkers for malignant vs. nonmalignant B cells in (a) FL1018 and (b) FL2001. Comparisons was performed accounting for timepoint and potential interactions between malignant status and timepoint using a multivariate linear model described in Methods. Genes upregulated among malignant cells have logFC values > 0. P-values were adjusted with the Benjamini-Hochberg method. Significantly enriched Reactome pathways (BH-adjusted P -value ≤ 0.05) among the top 50 most highly upregulated genes (ranked by log fold change) in (c) FL1018 and (d) FL2001. Up to 30 pathways are shown in either plot (Methods). Differentially expressed genes (found using scran’s findMarkers) for (e) T follicular helper and (f) other CD4 T cells between T2 vs. T1. Genes upregulated in T2 have log fold change values > 0. The activation marker CD69 is highlighted. P -values were adjusted with the Benjamini-Hochberg method.

Supplementary Figure 14 Fitting single cell RNA-seq simulation models to the Zheng PBMC 68k dataset, using cell type annotations provided in 51 (n = 66205 single cells), and of FACS prufied data from Koh et al. (n = 369 single cells).

(a) Log fold change values computed from differential expression analysis between naïve CD8+ and naïve CD4+ T cells. (b) ‘Null’ log fold change values computed by randomly splitting naïve CD8+ T cells into equally sized halves 10 times. (c) Quantile-quantile (QQ) plot comparing observed log fold change values between naïve CD8+ and naïve CD4+ T cells and posterior predictive samples from the splatter model (Methods). (d) Quantile-quantile (QQ) plot comparing observed log fold change values between naïve CD8+ and naïve CD4+ T cells and posterior predictive samples from the modified model (Methods). (e) Log fold change values computed from differential expression analysis between human embryonic stem cells (hESCs) and day 3 somite cells (ESMT). (f) ‘Null’ log fold change values computed by randomly splitting anterior primitive streak cells into equally sized halves 10 times. (g) Quantile-quantile (QQ) plot comparing observed log fold change values between hESC and ESMT cells and posterior predictive samples from the splatter model (Methods). (h) Quantile-quantile (QQ) plot comparing observed log fold change values between hESC and ESMT cells and posterior predictive samples from the modified model (Methods).

Supplementary Figure 15 Benchmarking results for CellAssign across a range of simulated data set sizes (number of cells), number of cell types being inferred, and number of marker genes per cell type.

(a) Runtime (to convergence, defined as a relative change in log-likelihood < 10⁻³ between successive iterations, as a function of data set size and the number of marker genes used per cell type, on simulated data (Methods). Two cell types were used. (b) Runtime (to convergence, defined as a relative change in log-likelihood < 10⁻³ between successive iterations, as a function of the number of cell types and the number of marker genes used per cell type, on simulated data. One thousand cells were used. n = 5 simulated datasets were generated for each parameter setting. In all boxplots, lower and upper hinges denote the 1st and 3rd quartiles, with whiskers extending to the largest value less than 1.5 × the inter-quartile range.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15 and Supplementary Note.

Reporting Summary

Supplementary Table 1

Performance measures on simulated data

Supplementary Table 2

Marker gene matrices used in analysis

Supplementary Table 3

Pathway enrichment results for follicular lymphoma and HGSC data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, A.W., O’Flanagan, C., Chavez, E.A. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods 16, 1007–1015 (2019). https://doi.org/10.1038/s41592-019-0529-1

Download citation

Received: 15 January 2019
Accepted: 16 July 2019
Published: 09 September 2019
Issue Date: October 2019
DOI: https://doi.org/10.1038/s41592-019-0529-1

This article is cited by

Allele-specific transcriptional effects of subclonal copy number alterations enable genotype-phenotype mapping in cancer cells
- Hongyu Shi
- Marc J. Williams
- Sohrab P. Shah
Nature Communications (2024)
Computational immunogenomic approaches to predict response to cancer immunotherapies
- Venkateswar Addala
- Felicity Newell
- Nicola Waddell
Nature Reviews Clinical Oncology (2024)
Pianno: a probabilistic framework automating semantic annotation for spatial transcriptomics
- Yuqiu Zhou
- Wei He
- Ying Zhu
Nature Communications (2024)
Celloscope: a probabilistic model for marker-gene-driven cell type deconvolution in spatial transcriptomics data
- Agnieszka Geras
- Shadi Darvish Shafighi
- Ewa Szczurek
Genome Biology (2023)
Annotation of cell types (ACT): a convenient web server for cell type annotation
- Fei Quan
- Xin Liang
- Yun Xiao
Genome Medicine (2023)