Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Putative cell type discovery from single-cell gene expression data

Abstract

We present the Single-Cell Clustering Assessment Framework, a method for the automated identification of putative cell types from single-cell RNA sequencing (scRNA-seq) data. By iteratively applying a machine learning approach to a given set of cells, we simultaneously identify distinct cell groups and a weighted list of feature genes for each group. The differentially expressed feature genes discriminate the given cell group from other cells. Each such group of cells corresponds to a putative cell type or state, characterized by the feature genes as markers. Benchmarking using expert-annotated scRNA-seq datasets shows that our method automatically identifies the ‘ground truth’ cell assignments with high accuracy.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: A self-projection approach.
Fig. 2: Using the connection graph to optimize clustering.
Fig. 3: Self-projection-based clustering optimization compared with ground truth.
Fig. 4: Self-projection accuracy indicates optimal clustering during clustering optimization.
Fig. 5: SCCAF captures the key stages in mouse hematopoiesis.

Similar content being viewed by others

Data availability

The datasets together with the accession codes are as follows: pancreas12, accession no. GSE84133; cortex30, accession no. GSE60361; retinal bipolar neurons11, accession no. GSE81904; pancreatic islets13, accession no. E-MTAB-5061; visual cortex31, accession no. GSE102827; hematopoiesis35, GSE89754; hematopoiesis36, accession no. GSE92575; cortex51, accession no. GSE71585; cortex33, accession no. GSE115746; liver32, accession no. GSE124395; liver50, accession no. GSE115469. Source data for Figs. 1–5 are included with this paper.

Code availability

An open source implementation of SCCAF is available at GitHub (https://github.com/SCCAF/sccaf) and (https://doi.org/10.5281/zenodo.3695975) under the MIT license. The release includes tutorials and example vignettes for reproducing the analyses presented in this article, as well as all preprocessed datasets considered in this study. The software version used to generate the results presented in this article is also available as Supplementary Software. SCCAF is also accessible from the Python package index (https://pypi.org/project/SCCAF/) and it is implemented as a Galaxy tool in the Human Cell Atlas (https://humancellatlas.usegalaxy.eu/). The SCCAF Galaxy modules are available to install with a few clicks on any Galaxy instance through the main Galaxy Tool Shed at https://toolshed.g2.bx.psu.edu/view/ebi-gxa/suite_sccaf/.

References

  1. Hooke, R. Micrographia: or Some Physiological Descriptions of Minute Bodies Made by Magnifying Glasses. With Observations and Inquiries Thereupon (J. Martyn and J. Allestry, 1665).

  2. Arendt, D. et al. The origin and evolution of cell types. Nat. Rev. Genet. 17, 744–757 (2016).

    Article  CAS  PubMed  Google Scholar 

  3. Nagasawa, T. Microenvironmental niches in the bone marrow required for B-cell development. Nat. Rev. Immunol. 6, 107–116 (2006).

    Article  CAS  PubMed  Google Scholar 

  4. Trapnell, C. Defining cell types and states with single-cell genomics. Genome Res. 25, 1491–1498 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).

    Article  CAS  PubMed  Google Scholar 

  6. Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008, P10008 (2008).

    Article  Google Scholar 

  11. Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323.e30 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Han, X. et al. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell 173, 1307 (2018).

    Article  CAS  PubMed  Google Scholar 

  15. Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Zhang, J. M., Fan, J., Fan, H. C., Rosenfeld, D. & Tse, D. N. An interpretable framework for clustering single-cell RNA-Seq datasets. BMC Bioinformatics 19, 93 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  17. de Kanter, J. K., Lijnzaad, P., Candelli T., Margaritis, T. & Holstege, F. C. P. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 47, e95 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Xie, P. et al. SuperCT: a supervised-learning framework for enhanced characterization of single-cell transcriptomic profiles. Nucleic Acids Res. 47, e48 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16, 983–986 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods 16, 1007–1015 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-seq data across platforms and across species. Cell Syst. 9, 207–213.e2 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Wagner, F. & Yanai, I. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data. Preprint at bioRxiv https://doi.org/10.1101/456129 (2018).

  24. Ma, F. & Pellegrini, M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics 36, 533–538 (2020).

    Article  PubMed  Google Scholar 

  25. Lin, Y. et al. scClassify: hierarchical classification of cells. Preprint at bioRxiv https://doi.org/10.1101/776948 (2019).

  26. Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).

    Article  CAS  PubMed  Google Scholar 

  27. Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Dimitriadis, G., Neto, J. P. & Kampff, A. R. t-SNE visualization of large-scale neural recordings. Neural Comput. 30, 1750–1774 (2018).

    Article  PubMed  Google Scholar 

  29. Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).

    Article  Google Scholar 

  30. Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

    Article  CAS  PubMed  Google Scholar 

  31. Hrvatin, S. et al. Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex. Nat. Neurosci. 21, 120–129 (2018).

    Article  CAS  PubMed  Google Scholar 

  32. Aizarani, N. et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature 572, 199–204 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Tracy, C. A. & Widom, H. Level-spacing distributions and the Airy kernel. Comm. Math. Phys. 159, 151–174 (1994).

    Article  Google Scholar 

  35. Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Giladi, A. et al. Single-cell characterization of haematopoietic progenitors and their trajectories in homeostasis and perturbed haematopoiesis. Nat. Cell Biol. 20, 836–846 (2018).

    Article  CAS  PubMed  Google Scholar 

  37. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 3, 861 (2018).

    Article  Google Scholar 

  38. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).

    Article  CAS  Google Scholar 

  39. Konstantinides, N. et al. Phenotypic convergence: distinct transcription factors regulate common terminal features. Cell 174, 622–635.e13 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Gerber, T. et al. Single-cell analysis uncovers convergence of cell identities during axolotl limb regeneration. Science 362, eaaq0681 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  42. Stehman, S. V. Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 62, 77–89 (1997).

    Article  Google Scholar 

  43. Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Article  Google Scholar 

  44. Hill, C. Learning Scientific Programming with Python 333–401 (Cambridge Univ. Press, 2016).

  45. Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15, 343–346 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Leek, J. T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).

    Article  PubMed Central  CAS  Google Scholar 

  47. Azizi, E., Prabhakaran, S., Carr, A. & Pe’er, D. Bayesian inference for single-cell clustering and imputing. Genom. Comput. Biol. 3, e46 (2017).

    Article  Google Scholar 

  48. Athar, A. et al. ArrayExpress update—from bulk to single-cell expression data. Nucleic Acids Res. 47, D711–D715 (2019).

    Article  CAS  PubMed  Google Scholar 

  49. Allen Brain Atlas Data Portal. Cell types: overview of the data (Allen Institute, 2015); http://celltypes.brain-map.org

  50. MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  51. Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank all members of the Teichmann and Brazma labs for helpful discussions. We thank S. Aldridge for proofreading the text. Z.M. is supported by the Single Cell Gene Expression Atlas grant from the Wellcome Trust (no. 108437/Z/15/Z).

Author information

Authors and Affiliations

Authors

Contributions

Z.M. conceived the method, implemented the algorithm and website, conducted the analyses, created the figures and contributed to the manuscript. P.M., N.H. and I.P. packed the algorithm and implemented it as a Galaxy tool. A.B. and S.A.T. supervised the work and contributed to the manuscript.

Corresponding authors

Correspondence to Alvis Brazma or Sarah A. Teichmann.

Ethics declarations

Competing interests

In the last three years S.A.T. has consulted for Biogen, Genentech and Roche, and is a member of the Scientific Advisory Board of Foresite Labs and of the Functional Genomics & AI Scientific Advisory Board of GlaxoSmithKline.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Self-projection accuracy comparison between the ground truth annotation and the clustering with under-clustering or over-clustering.

This test measures the self-projection accuracy on three conditions: 1) the “ground truth” clustering as annotated by human experts (marked as ‘correct-clustering’); 2) over-clustering and 3) under-clustering. The violin plots on the left column show the self-projection accuracy distributions (of both cross-validation as red and on the test set as green) for these three conditions in all the datasets by repeating the random sampling 100 times. These plots demonstrate that the “ground truth” clustering corresponds to the highest self-projection accuracy in almost all cases. According to the test results on the datasets: Hrvatin(48,266 cells), Tasic2018 (21,874 cells), Shekhar (26,830 cells), Segerstolpe (2,108 cells), Zeisel (3,005 cells), Baron (Mouse, 1,886 cells) and Baron (Human, 8,199 cells), it is possible to identify the best clustering using self-projection as the clustering consistency test. As for any classifier, it is always easier to perform well on fewer clusters. Thus if two clusterings show a similar level of self-projection accuracy, for example, Baron (Mouse), the clustering with more clusters should be chosen for consideration.

Source data

Extended Data Fig. 2 Testing machine learning methods on ground truth datasets.

Five machine learning models were tested on the five ground truth datasets (Baron mouse cells (1,886 cells), Baron human cells (8,199 cells), Shekhar (26,830 cells), Segerstolpe (2,108 cells), Zeisel (3,005 cells)). The data were randomly split into a training set and a test set for self-projection, and this process was repeated 100 times. The distributions of the self-projection accuracies and the mean accuracy of cross-validation in the training were plotted as violin plots.

Source data

Extended Data Fig. 3 Self-projection can assess over-clustering on real data.

The performance on 1000 BC1A cells from the mouse retina dataset (Shekhar et al.). When the data randomly assigned as two clusters, logistic regression cannot demonstrate any predictive ability in self-projection. When splitting the 1000 BC1A cells into two clusters based on the first principal component (PC), logistic regression shows certain but not ideal predictive ability in self-projection. The performance on 500 BC2 cells and 500 BC1A cells. Self-projection shows high predictive ability. When the 500 BC2 cells are over-clustered into two clusters based on PC1, the confusion always happens between the over-clustered clusters but hardly between BC1A cells and BC2 cells.

Source data

Extended Data Fig. 4 SCCAF clustering optimization on mouse retina data.

The mouse retina data of Shekhar includes 26,830 cells. a, show the t-SNE plot of the initial clustering (Round0) and the clustering during the four Rounds (Round1 to Round4) of SCCAF optimization, (b) shows the self-projection results, while (c) shows the consistency (self-projection accuracy) between the clustering assignment and the self-projection results.

Source data

Extended Data Fig. 5 The self-projection-based clustering optimization achieves clustering identical to human expert annotation.

The six expert-annotated datasets a Baron mouse cells (1,886 cells), b: Baron human cells (8,199 cells), c: Shekhar (26,830 cells), d, Segerstolpe (2,108 cells), e: Zeisel (3,005 cells), f: Hrvatin (48,266 cells), g: Aizarani (10,305 cells), h: Tasic2018 (21,874 cells) are used to compare the SCCAF clustering result and the human expert annotation.

Source data

Extended Data Fig. 6 Adjusted Rand Index evaluation of the SCCAF results compared with Louvain clustering.

The Adjusted Rand Index (ARI) is calculated between the clustering results and the human expert annotation. The blue dots show the ARI of SCCAF, while the orange dots show the ARI of the initial Louvain clustering before SCCAF optimization (the initial clustering).

Source data

Extended Data Fig. 7 SCCAF clustering compared with published annotation.

a, The clustering shows the result from the SCCAF clustering of the mouse hematopoiesis data (4,016 cells) from Tusi et al. b, The cell potential (of the 4,016 cells) to develop into different cell lineages (Er: Erythrocytes, Gr: Granulocytes, Ly: Lymphocytes, Mk: Megakaryocytes, Mo: Monocytes, Ba: Basophilic or mast cell) are colored as Viridis.

Source data

Extended Data Fig. 8 Upregulated and downregulated genes in erythrocytes development.

The upregulated and downregulated genes in the erythrocytes’ development are colored on the SPRING plot and the UMAP plot of the Tusi dataset (4,016 cells) (a) and the Giladi dataset (20,202 cells) (b).

Source data

Extended Data Fig. 9 SCCAF helps in annotating a new unannotated dataset of the human brain.

Human brain single nuclei-Seq data (http://celltypes.brain-map.org/rnaseq) from Middle Temporal Gyrus, Primary Visual Cortex, Anterior Cingulate Cortex and Lateral Geniculate (33,782 cells in total) were analyzed together. In the t-SNE plots, cells are colored according to the a) cortical area and the cell classes. Projection-based annotation approaches (logistic regression and CHETAH) were used to annotate the dataset using the mouse brain data from Tasic et al. were applied considering the ortholog genes between human and mouse. And the results are colored in the t-SNE plots in b. SCCAF was also used to identify the discriminative cell clusters and resulted in 38 clusters. c, Each cell cluster is annotated according to the top-ranked feature genes extracted from the SCCAF model. d, Self-projection accuracy and ROC curves are compared for these three annotation approaches.

Source data

Extended Data Fig. 10 A hierarchical approach to cluster mouse visual cortex data.

a, The t-SNE plot shows the cell types in the Hrvatin dataset (48,266 cells). Three cell types, under the main cell type “Interneurons” (350 cells), cluster together in the red circle. The variances of these three cell types are not dominant when considering the whole dataset. b, The first round SCCAF clustering of the Hrvatin dataset (48,266 cells) cannot find such subpopulations, because (c) they are clustered together from the initial state. When looking at these three clusters (350 cells) (d), Louvain clustering (e) may achieve a similar clustering result as the manual annotation. Leiden clustering (f) can also identify the three cell types but shows a difference in the center cells.

Source data

Supplementary information

Supplementary Information

Supplementary Discussion, Supplementary Figs. 1–18, Supplementary Table 1

Reporting Summary

Supplementary Software

Source data

Source Data Fig. 1

Raw data for the simulated one cell type and two cell types.

Source Data Fig. 2

Raw data for the simulated six cell types, number of recapitulated genes.

Source Data Fig. 3

Raw data for the t-SNE plots, river plot.

Source Data Fig. 4

Raw data for the simulated six cell types.

Source Data Fig. 5

Raw data for the violin plots and t-SNE plots.

Source Data Extended Data Fig. 1

Raw data for the violin plots and t-SNE plots.

Source Data Extended Data Fig. 2

Raw data for the violin plots.

Source Data Extended Data Fig. 3

Raw data for the PCA plots.

Source Data Extended Data Fig. 4

Raw data for the violin plots and t-SNE plots.

Source Data Extended Data Fig. 5

Raw data for the t-SNE plots.

Source Data Extended Data Fig. 6

Raw data for the dot plots.

Source Data Extended Data Fig. 7

Raw data for the SPRING plots.

Source Data Extended Data Fig. 8

Raw data for the SPRING plots and UMAP plots.

Source Data Extended Data Fig. 9

Raw data for the t-SNE plots.

Source Data Extended Data Fig. 10

Raw data for the UMAP plots and t-SNE plots.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Miao, Z., Moreno, P., Huang, N. et al. Putative cell type discovery from single-cell gene expression data. Nat Methods 17, 621–628 (2020). https://doi.org/10.1038/s41592-020-0825-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-020-0825-9

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics