Putative cell type discovery from single-cell gene expression data

Miao, Zhichao; Moreno, Pablo; Huang, Ni; Papatheodorou, Irene; Brazma, Alvis; Teichmann, Sarah A.

doi:10.1038/s41592-020-0825-9

Article
Published: 18 May 2020

Putative cell type discovery from single-cell gene expression data

Nature Methods volume 17, pages 621–628 (2020)Cite this article

17k Accesses
65 Citations
73 Altmetric
Metrics details

Subjects

Abstract

We present the Single-Cell Clustering Assessment Framework, a method for the automated identification of putative cell types from single-cell RNA sequencing (scRNA-seq) data. By iteratively applying a machine learning approach to a given set of cells, we simultaneously identify distinct cell groups and a weighted list of feature genes for each group. The differentially expressed feature genes discriminate the given cell group from other cells. Each such group of cells corresponds to a putative cell type or state, characterized by the feature genes as markers. Benchmarking using expert-annotated scRNA-seq datasets shows that our method automatically identifies the ‘ground truth’ cell assignments with high accuracy.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Using the connection graph to optimize clustering.**

**Fig. 3: Self-projection-based clustering optimization compared with ground truth.**

**Fig. 4: Self-projection accuracy indicates optimal clustering during clustering optimization.**

**Fig. 5: SCCAF captures the key stages in mouse hematopoiesis.**

Identification of cell types from single cell data using stable clustering

Article Open access 23 July 2020

Azam Peyvandipour, Adib Shafi, … Sorin Draghici

Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data

Article Open access 10 March 2022

Aleksandr Ianevski, Anil K. Giri & Tero Aittokallio

Accurate estimation of cell-type composition from gene expression data

Article Open access 05 July 2019

Daphne Tsoucas, Rui Dong, … Guo-Cheng Yuan

Data availability

The datasets together with the accession codes are as follows: pancreas¹², accession no. GSE84133; cortex³⁰, accession no. GSE60361; retinal bipolar neurons¹¹, accession no. GSE81904; pancreatic islets¹³, accession no. E-MTAB-5061; visual cortex³¹, accession no. GSE102827; hematopoiesis³⁵, GSE89754; hematopoiesis³⁶, accession no. GSE92575; cortex⁵¹, accession no. GSE71585; cortex³³, accession no. GSE115746; liver³², accession no. GSE124395; liver⁵⁰, accession no. GSE115469. Source data for Figs. 1–5 are included with this paper.

Code availability

An open source implementation of SCCAF is available at GitHub (https://github.com/SCCAF/sccaf) and (https://doi.org/10.5281/zenodo.3695975) under the MIT license. The release includes tutorials and example vignettes for reproducing the analyses presented in this article, as well as all preprocessed datasets considered in this study. The software version used to generate the results presented in this article is also available as Supplementary Software. SCCAF is also accessible from the Python package index (https://pypi.org/project/SCCAF/) and it is implemented as a Galaxy tool in the Human Cell Atlas (https://humancellatlas.usegalaxy.eu/). The SCCAF Galaxy modules are available to install with a few clicks on any Galaxy instance through the main Galaxy Tool Shed at https://toolshed.g2.bx.psu.edu/view/ebi-gxa/suite_sccaf/.

References

Hooke, R. Micrographia: or Some Physiological Descriptions of Minute Bodies Made by Magnifying Glasses. With Observations and Inquiries Thereupon (J. Martyn and J. Allestry, 1665).
Arendt, D. et al. The origin and evolution of cell types. Nat. Rev. Genet. 17, 744–757 (2016).
Article CAS PubMed Google Scholar
Nagasawa, T. Microenvironmental niches in the bone marrow required for B-cell development. Nat. Rev. Immunol. 6, 107–116 (2006).
Article CAS PubMed Google Scholar
Trapnell, C. Defining cell types and states with single-cell genomics. Genome Res. 25, 1491–1498 (2015).
Article CAS PubMed PubMed Central Google Scholar
Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).
Article CAS PubMed Google Scholar
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Article PubMed PubMed Central Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS PubMed PubMed Central Google Scholar
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Article CAS PubMed PubMed Central Google Scholar
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008, P10008 (2008).
Article Google Scholar
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323.e30 (2016).
Article CAS PubMed PubMed Central Google Scholar
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).
Article CAS PubMed PubMed Central Google Scholar
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Article CAS PubMed PubMed Central Google Scholar
Han, X. et al. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell 173, 1307 (2018).
Article CAS PubMed Google Scholar
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhang, J. M., Fan, J., Fan, H. C., Rosenfeld, D. & Tse, D. N. An interpretable framework for clustering single-cell RNA-Seq datasets. BMC Bioinformatics 19, 93 (2018).
Article PubMed PubMed Central CAS Google Scholar
de Kanter, J. K., Lijnzaad, P., Candelli T., Margaritis, T. & Holstege, F. C. P. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 47, e95 (2019).
Article CAS PubMed PubMed Central Google Scholar
Xie, P. et al. SuperCT: a supervised-learning framework for enhanced characterization of single-cell transcriptomic profiles. Nucleic Acids Res. 47, e48 (2019).
Article CAS PubMed PubMed Central Google Scholar
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16, 983–986 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods 16, 1007–1015 (2019).
Article CAS PubMed PubMed Central Google Scholar
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-seq data across platforms and across species. Cell Syst. 9, 207–213.e2 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wagner, F. & Yanai, I. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data. Preprint at bioRxiv https://doi.org/10.1101/456129 (2018).
Ma, F. & Pellegrini, M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics 36, 533–538 (2020).
Article PubMed Google Scholar
Lin, Y. et al. scClassify: hierarchical classification of cells. Preprint at bioRxiv https://doi.org/10.1101/776948 (2019).
Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).
Article CAS PubMed Google Scholar
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Article PubMed PubMed Central CAS Google Scholar
Dimitriadis, G., Neto, J. P. & Kampff, A. R. t-SNE visualization of large-scale neural recordings. Neural Comput. 30, 1750–1774 (2018).
Article PubMed Google Scholar
Hubert, L. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Article Google Scholar
Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Article CAS PubMed Google Scholar
Hrvatin, S. et al. Single-cell analysis of experience-dependent transcriptomic states in the mouse visual cortex. Nat. Neurosci. 21, 120–129 (2018).
Article CAS PubMed Google Scholar
Aizarani, N. et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature 572, 199–204 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018).
Article CAS PubMed PubMed Central Google Scholar
Tracy, C. A. & Widom, H. Level-spacing distributions and the Airy kernel. Comm. Math. Phys. 159, 151–174 (1994).
Article Google Scholar
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
Article CAS PubMed PubMed Central Google Scholar
Giladi, A. et al. Single-cell characterization of haematopoietic progenitors and their trajectories in homeostasis and perturbed haematopoiesis. Nat. Cell Biol. 20, 836–846 (2018).
Article CAS PubMed Google Scholar
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 3, 861 (2018).
Article Google Scholar
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Article CAS Google Scholar
Konstantinides, N. et al. Phenotypic convergence: distinct transcription factors regulate common terminal features. Cell 174, 622–635.e13 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gerber, T. et al. Single-cell analysis uncovers convergence of cell identities during axolotl limb regeneration. Science 362, eaaq0681 (2018).
Article PubMed PubMed Central CAS Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Stehman, S. V. Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 62, 77–89 (1997).
Article Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Hill, C. Learning Scientific Programming with Python 333–401 (Cambridge Univ. Press, 2016).
Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15, 343–346 (2018).
Article CAS PubMed PubMed Central Google Scholar
Leek, J. T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).
Article PubMed Central CAS Google Scholar
Azizi, E., Prabhakaran, S., Carr, A. & Pe’er, D. Bayesian inference for single-cell clustering and imputing. Genom. Comput. Biol. 3, e46 (2017).
Article Google Scholar
Athar, A. et al. ArrayExpress update—from bulk to single-cell expression data. Nucleic Acids Res. 47, D711–D715 (2019).
Article CAS PubMed Google Scholar
Allen Brain Atlas Data Portal. Cell types: overview of the data (Allen Institute, 2015); http://celltypes.brain-map.org
MacParland, S. A. et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 9, 4383 (2018).
Article PubMed PubMed Central CAS Google Scholar
Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank all members of the Teichmann and Brazma labs for helpful discussions. We thank S. Aldridge for proofreading the text. Z.M. is supported by the Single Cell Gene Expression Atlas grant from the Wellcome Trust (no. 108437/Z/15/Z).

Author information

Authors and Affiliations

European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Genome Campus, Cambridge, UK
Zhichao Miao, Pablo Moreno, Ni Huang, Irene Papatheodorou & Alvis Brazma
Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, United Kingdom
Zhichao Miao, Ni Huang & Sarah A. Teichmann
Department of Physics, Cavendish Laboratory, University of Cambridge, Cambridge, UK
Sarah A. Teichmann

Authors

Zhichao Miao
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Moreno
View author publications
You can also search for this author in PubMed Google Scholar
Ni Huang
View author publications
You can also search for this author in PubMed Google Scholar
Irene Papatheodorou
View author publications
You can also search for this author in PubMed Google Scholar
Alvis Brazma
View author publications
You can also search for this author in PubMed Google Scholar
Sarah A. Teichmann
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.M. conceived the method, implemented the algorithm and website, conducted the analyses, created the figures and contributed to the manuscript. P.M., N.H. and I.P. packed the algorithm and implemented it as a Galaxy tool. A.B. and S.A.T. supervised the work and contributed to the manuscript.

Corresponding authors

Correspondence to Alvis Brazma or Sarah A. Teichmann.

Ethics declarations

Competing interests

In the last three years S.A.T. has consulted for Biogen, Genentech and Roche, and is a member of the Scientific Advisory Board of Foresite Labs and of the Functional Genomics & AI Scientific Advisory Board of GlaxoSmithKline.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Self-projection accuracy comparison between the ground truth annotation and the clustering with under-clustering or over-clustering.

This test measures the self-projection accuracy on three conditions: 1) the “ground truth” clustering as annotated by human experts (marked as ‘correct-clustering’); 2) over-clustering and 3) under-clustering. The violin plots on the left column show the self-projection accuracy distributions (of both cross-validation as red and on the test set as green) for these three conditions in all the datasets by repeating the random sampling 100 times. These plots demonstrate that the “ground truth” clustering corresponds to the highest self-projection accuracy in almost all cases. According to the test results on the datasets: Hrvatin(48,266 cells), Tasic2018 (21,874 cells), Shekhar (26,830 cells), Segerstolpe (2,108 cells), Zeisel (3,005 cells), Baron (Mouse, 1,886 cells) and Baron (Human, 8,199 cells), it is possible to identify the best clustering using self-projection as the clustering consistency test. As for any classifier, it is always easier to perform well on fewer clusters. Thus if two clusterings show a similar level of self-projection accuracy, for example, Baron (Mouse), the clustering with more clusters should be chosen for consideration.

Source data

Extended Data Fig. 2 Testing machine learning methods on ground truth datasets.

Five machine learning models were tested on the five ground truth datasets (Baron mouse cells (1,886 cells), Baron human cells (8,199 cells), Shekhar (26,830 cells), Segerstolpe (2,108 cells), Zeisel (3,005 cells)). The data were randomly split into a training set and a test set for self-projection, and this process was repeated 100 times. The distributions of the self-projection accuracies and the mean accuracy of cross-validation in the training were plotted as violin plots.

Source data

Extended Data Fig. 3 Self-projection can assess over-clustering on real data.

The performance on 1000 BC1A cells from the mouse retina dataset (Shekhar et al.). When the data randomly assigned as two clusters, logistic regression cannot demonstrate any predictive ability in self-projection. When splitting the 1000 BC1A cells into two clusters based on the first principal component (PC), logistic regression shows certain but not ideal predictive ability in self-projection. The performance on 500 BC2 cells and 500 BC1A cells. Self-projection shows high predictive ability. When the 500 BC2 cells are over-clustered into two clusters based on PC1, the confusion always happens between the over-clustered clusters but hardly between BC1A cells and BC2 cells.

Source data

Extended Data Fig. 4 SCCAF clustering optimization on mouse retina data.

The mouse retina data of Shekhar includes 26,830 cells. a, show the t-SNE plot of the initial clustering (Round0) and the clustering during the four Rounds (Round1 to Round4) of SCCAF optimization, (b) shows the self-projection results, while (c) shows the consistency (self-projection accuracy) between the clustering assignment and the self-projection results.

Source data

Extended Data Fig. 5 The self-projection-based clustering optimization achieves clustering identical to human expert annotation.

The six expert-annotated datasets a Baron mouse cells (1,886 cells), b: Baron human cells (8,199 cells), c: Shekhar (26,830 cells), d, Segerstolpe (2,108 cells), e: Zeisel (3,005 cells), f: Hrvatin (48,266 cells), g: Aizarani (10,305 cells), h: Tasic2018 (21,874 cells) are used to compare the SCCAF clustering result and the human expert annotation.

Source data

Extended Data Fig. 6 Adjusted Rand Index evaluation of the SCCAF results compared with Louvain clustering.

The Adjusted Rand Index (ARI) is calculated between the clustering results and the human expert annotation. The blue dots show the ARI of SCCAF, while the orange dots show the ARI of the initial Louvain clustering before SCCAF optimization (the initial clustering).

Source data

Extended Data Fig. 7 SCCAF clustering compared with published annotation.

a, The clustering shows the result from the SCCAF clustering of the mouse hematopoiesis data (4,016 cells) from Tusi et al. b, The cell potential (of the 4,016 cells) to develop into different cell lineages (Er: Erythrocytes, Gr: Granulocytes, Ly: Lymphocytes, Mk: Megakaryocytes, Mo: Monocytes, Ba: Basophilic or mast cell) are colored as Viridis.

Source data

Extended Data Fig. 8 Upregulated and downregulated genes in erythrocytes development.

The upregulated and downregulated genes in the erythrocytes’ development are colored on the SPRING plot and the UMAP plot of the Tusi dataset (4,016 cells) (a) and the Giladi dataset (20,202 cells) (b).

Source data

Extended Data Fig. 9 SCCAF helps in annotating a new unannotated dataset of the human brain.

Human brain single nuclei-Seq data (http://celltypes.brain-map.org/rnaseq) from Middle Temporal Gyrus, Primary Visual Cortex, Anterior Cingulate Cortex and Lateral Geniculate (33,782 cells in total) were analyzed together. In the t-SNE plots, cells are colored according to the a) cortical area and the cell classes. Projection-based annotation approaches (logistic regression and CHETAH) were used to annotate the dataset using the mouse brain data from Tasic et al. were applied considering the ortholog genes between human and mouse. And the results are colored in the t-SNE plots in b. SCCAF was also used to identify the discriminative cell clusters and resulted in 38 clusters. c, Each cell cluster is annotated according to the top-ranked feature genes extracted from the SCCAF model. d, Self-projection accuracy and ROC curves are compared for these three annotation approaches.

Source data

Extended Data Fig. 10 A hierarchical approach to cluster mouse visual cortex data.

a, The t-SNE plot shows the cell types in the Hrvatin dataset (48,266 cells). Three cell types, under the main cell type “Interneurons” (350 cells), cluster together in the red circle. The variances of these three cell types are not dominant when considering the whole dataset. b, The first round SCCAF clustering of the Hrvatin dataset (48,266 cells) cannot find such subpopulations, because (c) they are clustered together from the initial state. When looking at these three clusters (350 cells) (d), Louvain clustering (e) may achieve a similar clustering result as the manual annotation. Leiden clustering (f) can also identify the three cell types but shows a difference in the center cells.

Source data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Miao, Z., Moreno, P., Huang, N. et al. Putative cell type discovery from single-cell gene expression data. Nat Methods 17, 621–628 (2020). https://doi.org/10.1038/s41592-020-0825-9

Download citation

Received: 15 July 2019
Accepted: 02 April 2020
Published: 18 May 2020
Issue Date: June 2020
DOI: https://doi.org/10.1038/s41592-020-0825-9

This article is cited by

CDSKNNXMBD: a novel clustering framework for large-scale single-cell data based on a stable graph structure
- Jun Ren
- Xuejing Lyu
- Qiyuan Li
Journal of Translational Medicine (2024)
Computational immunogenomic approaches to predict response to cancer immunotherapies
- Venkateswar Addala
- Felicity Newell
- Nicola Waddell
Nature Reviews Clinical Oncology (2024)
Spatially organized cellular communities form the developing human heart
- Elie N. Farah
- Robert K. Hu
- Neil C. Chi
Nature (2024)
An immunophenotype-coupled transcriptomic atlas of human hematopoietic progenitors
- Xuan Zhang
- Baobao Song
- H. Leighton Grimes
Nature Immunology (2024)
A primate nigrostriatal atlas of neuronal vulnerability and resilience in a model of Parkinson’s disease
- Lei Tang
- Nana Xu
- Sheng Liu
Nature Communications (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links