We present the Single-Cell Clustering Assessment Framework, a method for the automated identification of putative cell types from single-cell RNA sequencing (scRNA-seq) data. By iteratively applying a machine learning approach to a given set of cells, we simultaneously identify distinct cell groups and a weighted list of feature genes for each group. The differentially expressed feature genes discriminate the given cell group from other cells. Each such group of cells corresponds to a putative cell type or state, characterized by the feature genes as markers. Benchmarking using expert-annotated scRNA-seq datasets shows that our method automatically identifies the ‘ground truth’ cell assignments with high accuracy.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The datasets together with the accession codes are as follows: pancreas12, accession no. GSE84133; cortex30, accession no. GSE60361; retinal bipolar neurons11, accession no. GSE81904; pancreatic islets13, accession no. E-MTAB-5061; visual cortex31, accession no. GSE102827; hematopoiesis35, GSE89754; hematopoiesis36, accession no. GSE92575; cortex51, accession no. GSE71585; cortex33, accession no. GSE115746; liver32, accession no. GSE124395; liver50, accession no. GSE115469. Source data for Figs. 1–5 are included with this paper.
An open source implementation of SCCAF is available at GitHub (https://github.com/SCCAF/sccaf) and (https://doi.org/10.5281/zenodo.3695975) under the MIT license. The release includes tutorials and example vignettes for reproducing the analyses presented in this article, as well as all preprocessed datasets considered in this study. The software version used to generate the results presented in this article is also available as Supplementary Software. SCCAF is also accessible from the Python package index (https://pypi.org/project/SCCAF/) and it is implemented as a Galaxy tool in the Human Cell Atlas (https://humancellatlas.usegalaxy.eu/). The SCCAF Galaxy modules are available to install with a few clicks on any Galaxy instance through the main Galaxy Tool Shed at https://toolshed.g2.bx.psu.edu/view/ebi-gxa/suite_sccaf/.
Hooke, R. Micrographia: or Some Physiological Descriptions of Minute Bodies Made by Magnifying Glasses. With Observations and Inquiries Thereupon (J. Martyn and J. Allestry, 1665).
Arendt, D. et al. The origin and evolution of cell types. Nat. Rev. Genet. 17, 744–757 (2016).
Nagasawa, T. Microenvironmental niches in the bone marrow required for B-cell development. Nat. Rev. Immunol. 6, 107–116 (2006).
Trapnell, C. Defining cell types and states with single-cell genomics. Genome Res. 25, 1491–1498 (2015).
Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008, P10008 (2008).
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323.e30 (2016).
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Han, X. et al. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell 173, 1307 (2018).
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Zhang, J. M., Fan, J., Fan, H. C., Rosenfeld, D. & Tse, D. N. An interpretable framework for clustering single-cell RNA-Seq datasets. BMC Bioinformatics 19, 93 (2018).
de Kanter, J. K., Lijnzaad, P., Candelli T., Margaritis, T. & Holstege, F. C. P. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 47, e95 (2019).
Xie, P. et al. SuperCT: a supervised-learning framework for enhanced characterization of single-cell transcriptomic profiles. Nucleic Acids Res. 47, e48 (2019).
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16, 983–986 (2019).
Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods 16, 1007–1015 (2019).
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).
Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-seq data across platforms and across species. Cell Syst. 9, 207–213.e2 (2019).
Wagner, F. & Yanai, I. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data. Preprint at bioRxiv https://doi.org/10.1101/456129 (2018).
Ma, F. & Pellegrini, M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics 36, 533–538 (2020).
Lin, Y. et al. scClassify: hierarchical classification of cells. Preprint at bioRxiv https://doi.org/10.1101/776948 (2019).
Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Dimitriadis, G., Neto, J. P. & Kampff, A. R. t-SNE visualization of large-scale neural recordings. Neural Comput. 30, 1750–1774 (2018).
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Hrvatin, S. et al. Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex. Nat. Neurosci. 21, 120–129 (2018).
Aizarani, N. et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature 572, 199–204 (2019).
Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018).
Tracy, C. A. & Widom, H. Level-spacing distributions and the Airy kernel. Comm. Math. Phys. 159, 151–174 (1994).
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
Giladi, A. et al. Single-cell characterization of haematopoietic progenitors and their trajectories in homeostasis and perturbed haematopoiesis. Nat. Cell Biol. 20, 836–846 (2018).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 3, 861 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Konstantinides, N. et al. Phenotypic convergence: distinct transcription factors regulate common terminal features. Cell 174, 622–635.e13 (2018).
Gerber, T. et al. Single-cell analysis uncovers convergence of cell identities during axolotl limb regeneration. Science 362, eaaq0681 (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Stehman, S. V. Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 62, 77–89 (1997).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Hill, C. Learning Scientific Programming with Python 333–401 (Cambridge Univ. Press, 2016).
Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15, 343–346 (2018).
Leek, J. T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).
Azizi, E., Prabhakaran, S., Carr, A. & Pe’er, D. Bayesian inference for single-cell clustering and imputing. Genom. Comput. Biol. 3, e46 (2017).
Athar, A. et al. ArrayExpress update—from bulk to single-cell expression data. Nucleic Acids Res. 47, D711–D715 (2019).
Allen Brain Atlas Data Portal. Cell types: overview of the data (Allen Institute, 2015); http://celltypes.brain-map.org
MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).
Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).
We thank all members of the Teichmann and Brazma labs for helpful discussions. We thank S. Aldridge for proofreading the text. Z.M. is supported by the Single Cell Gene Expression Atlas grant from the Wellcome Trust (no. 108437/Z/15/Z).
In the last three years S.A.T. has consulted for Biogen, Genentech and Roche, and is a member of the Scientific Advisory Board of Foresite Labs and of the Functional Genomics & AI Scientific Advisory Board of GlaxoSmithKline.
Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Self-projection accuracy comparison between the ground truth annotation and the clustering with under-clustering or over-clustering.
This test measures the self-projection accuracy on three conditions: 1) the “ground truth” clustering as annotated by human experts (marked as ‘correct-clustering’); 2) over-clustering and 3) under-clustering. The violin plots on the left column show the self-projection accuracy distributions (of both cross-validation as red and on the test set as green) for these three conditions in all the datasets by repeating the random sampling 100 times. These plots demonstrate that the “ground truth” clustering corresponds to the highest self-projection accuracy in almost all cases. According to the test results on the datasets: Hrvatin(48,266 cells), Tasic2018 (21,874 cells), Shekhar (26,830 cells), Segerstolpe (2,108 cells), Zeisel (3,005 cells), Baron (Mouse, 1,886 cells) and Baron (Human, 8,199 cells), it is possible to identify the best clustering using self-projection as the clustering consistency test. As for any classifier, it is always easier to perform well on fewer clusters. Thus if two clusterings show a similar level of self-projection accuracy, for example, Baron (Mouse), the clustering with more clusters should be chosen for consideration. Source data
Five machine learning models were tested on the five ground truth datasets (Baron mouse cells (1,886 cells), Baron human cells (8,199 cells), Shekhar (26,830 cells), Segerstolpe (2,108 cells), Zeisel (3,005 cells)). The data were randomly split into a training set and a test set for self-projection, and this process was repeated 100 times. The distributions of the self-projection accuracies and the mean accuracy of cross-validation in the training were plotted as violin plots. Source data
The performance on 1000 BC1A cells from the mouse retina dataset (Shekhar et al.). When the data randomly assigned as two clusters, logistic regression cannot demonstrate any predictive ability in self-projection. When splitting the 1000 BC1A cells into two clusters based on the first principal component (PC), logistic regression shows certain but not ideal predictive ability in self-projection. The performance on 500 BC2 cells and 500 BC1A cells. Self-projection shows high predictive ability. When the 500 BC2 cells are over-clustered into two clusters based on PC1, the confusion always happens between the over-clustered clusters but hardly between BC1A cells and BC2 cells. Source data
The mouse retina data of Shekhar includes 26,830 cells. a, show the t-SNE plot of the initial clustering (Round0) and the clustering during the four Rounds (Round1 to Round4) of SCCAF optimization, (b) shows the self-projection results, while (c) shows the consistency (self-projection accuracy) between the clustering assignment and the self-projection results. Source data
Extended Data Fig. 5 The self-projection-based clustering optimization achieves clustering identical to human expert annotation.
The six expert-annotated datasets a Baron mouse cells (1,886 cells), b: Baron human cells (8,199 cells), c: Shekhar (26,830 cells), d, Segerstolpe (2,108 cells), e: Zeisel (3,005 cells), f: Hrvatin (48,266 cells), g: Aizarani (10,305 cells), h: Tasic2018 (21,874 cells) are used to compare the SCCAF clustering result and the human expert annotation. Source data
Extended Data Fig. 6 Adjusted Rand Index evaluation of the SCCAF results compared with Louvain clustering.
The Adjusted Rand Index (ARI) is calculated between the clustering results and the human expert annotation. The blue dots show the ARI of SCCAF, while the orange dots show the ARI of the initial Louvain clustering before SCCAF optimization (the initial clustering). Source data
a, The clustering shows the result from the SCCAF clustering of the mouse hematopoiesis data (4,016 cells) from Tusi et al. b, The cell potential (of the 4,016 cells) to develop into different cell lineages (Er: Erythrocytes, Gr: Granulocytes, Ly: Lymphocytes, Mk: Megakaryocytes, Mo: Monocytes, Ba: Basophilic or mast cell) are colored as Viridis. Source data
The upregulated and downregulated genes in the erythrocytes’ development are colored on the SPRING plot and the UMAP plot of the Tusi dataset (4,016 cells) (a) and the Giladi dataset (20,202 cells) (b). Source data
Human brain single nuclei-Seq data (http://celltypes.brain-map.org/rnaseq) from Middle Temporal Gyrus, Primary Visual Cortex, Anterior Cingulate Cortex and Lateral Geniculate (33,782 cells in total) were analyzed together. In the t-SNE plots, cells are colored according to the a) cortical area and the cell classes. Projection-based annotation approaches (logistic regression and CHETAH) were used to annotate the dataset using the mouse brain data from Tasic et al. were applied considering the ortholog genes between human and mouse. And the results are colored in the t-SNE plots in b. SCCAF was also used to identify the discriminative cell clusters and resulted in 38 clusters. c, Each cell cluster is annotated according to the top-ranked feature genes extracted from the SCCAF model. d, Self-projection accuracy and ROC curves are compared for these three annotation approaches. Source data
a, The t-SNE plot shows the cell types in the Hrvatin dataset (48,266 cells). Three cell types, under the main cell type “Interneurons” (350 cells), cluster together in the red circle. The variances of these three cell types are not dominant when considering the whole dataset. b, The first round SCCAF clustering of the Hrvatin dataset (48,266 cells) cannot find such subpopulations, because (c) they are clustered together from the initial state. When looking at these three clusters (350 cells) (d), Louvain clustering (e) may achieve a similar clustering result as the manual annotation. Leiden clustering (f) can also identify the three cell types but shows a difference in the center cells. Source data
Raw data for the simulated one cell type and two cell types.
Raw data for the simulated six cell types, number of recapitulated genes.
Raw data for the t-SNE plots, river plot.
Raw data for the simulated six cell types.
Raw data for the violin plots and t-SNE plots.
Raw data for the violin plots and t-SNE plots.
Raw data for the violin plots.
Raw data for the PCA plots.
Raw data for the violin plots and t-SNE plots.
Raw data for the t-SNE plots.
Raw data for the dot plots.
Raw data for the SPRING plots.
Raw data for the SPRING plots and UMAP plots.
Raw data for the t-SNE plots.
Raw data for the UMAP plots and t-SNE plots.
About this article
Cite this article
Miao, Z., Moreno, P., Huang, N. et al. Putative cell type discovery from single-cell gene expression data. Nat Methods 17, 621–628 (2020). https://doi.org/10.1038/s41592-020-0825-9
Nature Reviews Genetics (2021)
BMC Bioinformatics (2021)
Nature Communications (2021)
Frontiers in Oncology (2021)