Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID

Abstract

Because of the stochasticity associated with high-throughput single-cell sequencing, current methods for exploring cell-type diversity rely on clustering-based computational approaches in which heterogeneity is characterized at cell subpopulation rather than at full single-cell resolution. Here we present Cell-ID, a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell sequencing data. We applied Cell-ID to data from multiple human and mouse samples, including blood cells, pancreatic islets and airway, intestinal and olfactory epithelium, as well as to comprehensive mouse cell atlas datasets. We demonstrate that Cell-ID signatures are reproducible across different donors, tissues of origin, species and single-cell omics technologies, and can be used for automatic cell-type annotation and cell matching across datasets. Cell-ID improves biological interpretation at individual cell level, enabling discovery of previously uncharacterized rare cell types or cell states. Cell-ID is distributed as an open-source R software package.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the Cell-ID approach.
Fig. 2: Cell-ID cell-type prediction of human CBMCs using preestablished marker lists.
Fig. 3: Performance of Cell-ID cell matching across scRNA-seq datasets from the same or different tissue of origin, within and across species.
Fig. 4: Performance of Cell-ID cell-to-cell matching across independent datasets from different single-cell omics technologies: scRNA-seq and scATAC-seq.

Similar content being viewed by others

Data availability

All single-cell datasets used in this paper are publicly available (Supplementary Table 7). scRNA-seq datasets for human blood cells profiled by Cite-Seq17 and Reap-Seq18 were downloaded from the Gene Expression Omnibus (GEO) (accession numbers GSE100866 and GSE100501, respectively). Cell-type labels for these two datasets were obtained following the Multimodal Analysis vignette of the Seurat33 R package (https://satijalab.org/seurat/multimodal_vignette.html). Pancreas scRNA-seq datasets from Baron20, Muraro22 and Segerstolpe21, as well as their associated cell-type annotations were downloaded via the scRNA-seq59 R package as a SingleCellExperiment format R object. Plasschaert23 mouse and human and Montoro24 mouse airway epithelium scRNA-seq datasets, and their annotations were downloaded from GEO (GSE102580, GSE103354). Haber34 intestinal epithelium scRNA-seq dataset was downloaded from GEO accession code GSE92332. Olfactory epithelium scRNA-seq datasets from Fletcher36 and Wu35 were downloaded from GEO (GSE95601, GSE120199), and their cell-type annotations were obtained from the associated GitHub repositories: https://github.com/rufletch/p63-HBC-diff and https://www.stowers.org/research/publications/odr for Fletcher36 and Wu35, respectively. Tabula Muris39 10X and Smart-seq mouse scRNA-seq datasets were downloaded from https://tabula-muris.ds.czbiohub.org/. Gene activity score matrices from the Mouse sci-ATAC-seq atlas datasets from Cusanovich40 were obtained from http://atlas.gs.washington.edu/mouse-atac/data/, as provided by the authors and resulting from the aggregation of information across all differentially accessible chromatin sites linked to a target gene.

Code availability

Cell-ID is implemented as an R package and is available on GitHub (https://github.com/RausellLab/CelliD) under the GPL-3 open-source license. Complete documentation is provided with step-by-step procedures for MCA dimensionality reduction, per-cell gene signature extraction, cell-type prediction, label transferring across datasets and functional enrichment analysis. A development version of Cell-ID software is also available in Bioconductor (devel branch 3.13): https://bioconductor.org/packages/CelliD. In addition, R scripts to reproduce all figures in the paper are available on a dedicated GitHub repository (https://github.com/RausellLab/CellIDPaperScript).

References

  1. Teichmann, S. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  2. National Institutes of Health. The Human BioMolecular Atlas Program: HuBMAP NIH Common Fund Program https://commonfund.nih.gov/HuBMAP (2021).

  3. The LifeTime Initiative LifeTime FET Flagship https://lifetime-fetflagship.eu/ (2021).

  4. Lähnemann, D. et al. Eleven grand challenges in single-cell data science. Genome Biol. 21, 31 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Sun, S., Zhu, J., Ma, Y. & Zhou, X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 20, 269 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnol. 37, 38–44 (2019).

    Article  CAS  Google Scholar 

  7. Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1141 (2018).

    Article  PubMed  CAS  Google Scholar 

  8. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).

    Article  CAS  PubMed  Google Scholar 

  9. Greenacre, M. J. Theory and Applications of Correspondence Analysis (Academic Press, 1984).

    Google Scholar 

  10. Greenacre, M. & Blasius, J. (eds). Multiple Correspondence Analysis and Related Methods (Chapman & Hall/CRC, 2006).

  11. Aşan, Z. & Greenacre, M. Biplots of fuzzy coded data. Fuzzy Set. Syst. 183, 57–71 (2011).

    Article  Google Scholar 

  12. Rausell, A., Juan, D., Pazos, F. & Valencia, A. Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc. Natl Acad. Sci. USA 107, 1995–2000 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Gabriel, K. R. The biplot graphic display of matrices with application to principal component analysis. Biometrika 58, 453–467 (1971).

    Article  Google Scholar 

  14. Greenacre, M. Biplots in Practice Ch. 8, 79–88 (Foundation BBVA, Rubes Editorial, 2010).

  15. Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Aran, D., Hu, Z. & Butte, A. J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 18, 220 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  17. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).

    Article  CAS  PubMed  Google Scholar 

  19. Zhang et al. SCINA: semi-supervised analysis of single cells in silico. Genes 10, 531–531 (2019).

    Article  CAS  PubMed Central  Google Scholar 

  20. Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Systems 3, 346–360 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Systems 3, 385–394.e3 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Plasschaert, L. W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Montoro, D. T. et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Nature 560, 319–324 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–359 (2018).

    Article  CAS  PubMed  Google Scholar 

  26. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. De Kanter, J. K., Lijnzaad, P., Candelli, T., Margaritis, T. & Holstege, F. C. P. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 47, e95 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Lieberman, Y., Rokach, L. & Shay, T. CaSTLe–classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments. PLoS ONE 13, e0205499–e0205499 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. Boufea, K., Seth, S. & Batada, N. N. scID uses discriminant analysis to identify transcriptionally equivalent cell types across single-cell RNA-seq data with batch effect. iScience 23, 100914 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-seq data across platforms and across species. Cell Systems 9, 207–213.e2 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E. ScPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 20, 264–264 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Wu, Y. et al. A population of navigator neurons is essential for olfactory map formation during the critical period article a population of navigator neurons is essential for olfactory map formation during the critical period. Neuron 100, 1066–1082.e6 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Fletcher, R. B. et al. Deconstructing olfactory stem cell trajectories at single-cell resolution. Cell Stem Cell 20, 817–830.e8 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Ualiyeva, S. et al. Airway brush cells generate cysteinyl leukotrienes through the ATP sensor P2Y2. Science Immunol. 5, eaax7224–eaax7224 (2020).

    Article  CAS  Google Scholar 

  38. Bankova, L. G. et al. The cysteinyl leukotriene 3 receptor regulates expansion of IL-25–producing airway brush cells leading to type 2 inflammation. Science Immunol. 3, eaat9453 (2018).

    Article  Google Scholar 

  39. Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

    Article  PubMed Central  CAS  Google Scholar 

  40. Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).

    Article  CAS  PubMed  Google Scholar 

  43. Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cells 1, 417–425 (2015).

    CAS  Google Scholar 

  44. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258–D261 (2004).

  45. Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).

    CAS  PubMed  Google Scholar 

  46. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, 457–462 (2015).

    Article  CAS  Google Scholar 

  47. Slenter, D. N. et al. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 46, D661–D667 (2018).

    Article  CAS  PubMed  Google Scholar 

  48. Efremova, M. & Teichmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).

    Article  CAS  PubMed  Google Scholar 

  49. Hao, Y. et al. Integrated analysis of multimodal single-cell data. Preprint at bioRxiv https://doi.org/10.1101/2020.10.12.335331 (2020).

  50. Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 111 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).

    Article  CAS  PubMed  Google Scholar 

  52. Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4, 1184–1191 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Lebart, L, Morineau, A & Warwick, K. M. Multivariate Descriptive Statistical Analysis. Correspondence Analysis and Related Techniques for Large Matrices (John Wiley & Sons, 1984).

    Google Scholar 

  54. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B. (Methodological) 57, 289–300 (1995).

    Article  Google Scholar 

  55. Pagès, J. Multiple Factor Analysis by Example Using R (CRC Press, 2014).

  56. Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174–174 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  57. Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 14, 128 (2013).

    Article  Google Scholar 

  58. Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  59. Risso, D. & Cole, M. scRNAseq: Collection of public single-cell RNA-Seq datasets. R package v.2.4.0 http://bioconductor.org/packages/scRNAseq/ (Bioconductor, 2020).

Download references

Acknowledgements

We thank the Laboratory of Clinical Bioinformatics and the Laboratory of Human Lymphohematopoiesis for helpful discussions and support. The Laboratory of Clinical Bioinformatics was partly supported by the French National Research Agency (ANR) ‘Investissements d’Avenir’ Program (grant no. ANR-10-IAHU-01). The Laboratory of Clinical Bioinformatics and the Laboratory of Human Lymphohematopoiesis were partly supported by Christian Dior Couture, Dior. We also thank G. Fuentes from The Visual Thinker LLP for the creation of the illustrations in Figs. 1–4.

Author information

Authors and Affiliations

Authors

Contributions

A.C. and A.R. conceived and designed research. A.C. performed research. A.C and L.M. contributed with materials/analysis tools. A.C. and A.R. analyzed data. A.C., E.S. and A.R. interpreted results. A.C., E.S. and A.R. wrote the paper. All authors read and approved the final draft of the paper.

Corresponding author

Correspondence to Antonio Rausell.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–9, Figs. 1–14 and table legends.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–12.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cortal, A., Martignetti, L., Six, E. et al. Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID. Nat Biotechnol 39, 1095–1102 (2021). https://doi.org/10.1038/s41587-021-00896-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-021-00896-6

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing