Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Single-cell manifold-preserving feature selection for detecting rare cell populations

A Publisher Correction to this article was published on 01 June 2021

This article has been updated


A key challenge in studying organisms and diseases is to detect rare molecular programs and rare cell populations that drive development, differentiation and transformation. Molecular features, such as genes and proteins, defining rare cell populations are often unknown and are difficult to detect from unenriched single-cell data using conventional dimensionality reduction and clustering-based approaches. Here, we propose an unsupervised approach, SCMER (‘single-cell manifold-preserving feature selection’), which selects a compact set of molecular features with definitive meanings that preserve the manifold of the data. We apply SCMER in the context of hematopoiesis, lymphogenesis, tumorigenesis and drug resistance and response. We find that SCMER can identify non-redundant features that sensitively delineate both common cell lineages and rare cellular states. SCMER can be used for discovering molecular features in a high-dimensional dataset, designing targeted, cost-effective assays for clinical applications and facilitating multi-modality integration.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The SCMER approach.
Fig. 2: Results for the data of melanoma patients
Fig. 3: Results for the ileum lamina propria immunocytes data.
Fig. 4: Results for the A549 lung cancer cell line data.
Fig. 5: Results for the CITE-seq bone marrow mononuclear cell data.

Data availability

All original datasets are accessible through the original publications34,35,36,37,38,39,40,41, including the melanoma data (GSE72056), pan-cancer cell line data (, immune cell subtypes data (, hematopoiesis data (GSE116256), A549 data (GSE128639), CITE-seq data (GSE128639 and GSE100866) and CyTOF data ( Source data are provided with this paper.

Code availability

The open-source implementation of SCMER is available at under an MIT License. Scripts for reproducing all the results are deposited in Code Ocean64.

Change history


  1. Merrell, A. J. & Stanger, B. Z. Adult cell plasticity in vivo: de-differentiation and transdifferentiation are back in style. Nat. Rev. Mol. Cell Biol. 17, 413–425 (2016).

    Article  Google Scholar 

  2. Setty, M. et al. Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 37, 451–460 (2019).

    Article  Google Scholar 

  3. Wang, Z. et al. Sarcomatoid renal cell carcinoma has a distinct molecular pathogenesis, driver mutation profile and transcriptional landscape. Clin. Cancer Res. 23, 6686–6696 (2017).

    Article  Google Scholar 

  4. Conant, J. L., Peng, Z., Evans, M. F., Naud, S. & Cooper, K. Sarcomatoid renal cell carcinoma is an example of epithelial–mesenchymal transition. J. Clin. Pathol. 64, 1088–1092 (2011).

    Article  Google Scholar 

  5. Lytle, N. K. et al. A multiscale map of the stem cell state in pancreatic adenocarcinoma. Cell 177, 572–586 (2019).

    Article  Google Scholar 

  6. Sanada, Y. et al. Histopathologic evaluation of stepwise progression of pancreatic carcinoma with immunohistochemical analysis of gastric epithelial transcription factor SOX2: comparison of expression patterns between invasive components and cancerous or nonneoplastic intraductal components. Pancreas 32, 164–170 (2006).

    Article  Google Scholar 

  7. Herreros-Villanueva, M. et al. SOX2 promotes dedifferentiation and imparts stem cell-like features to pancreatic cancer cells. Oncogenesis 2, e61(2013).

    Article  Google Scholar 

  8. Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).

    Article  Google Scholar 

  9. Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).

    Article  Google Scholar 

  10. Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

    Article  Google Scholar 

  11. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  Google Scholar 

  12. Vargo, A. H. S. & Gilbert, A. C. A rank-based marker selection method for high throughput scRNA-seq data. BMC Bioinformatics 21, 477 (2020).

    Article  Google Scholar 

  13. Delaney, C. et al. Combinatorial prediction of marker panels from single-cell transcriptomic data. Mol. Syst. Biol. 15, e9005 (2019).

    Article  Google Scholar 

  14. Trapnell, C. Defining cell types and states with single-cell genomics. Genome Res. 25, 1491–1498 (2015).

    Article  Google Scholar 

  15. Jerby-Arnon, L. & Regev, A. Mapping multicellular programs from single-cell profiles. Preprint at bioRxiv (2020).

  16. Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).

    Article  Google Scholar 

  17. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).

    Article  Google Scholar 

  18. Ghazanfar, S. et al. Investigating higher-order interactions in single-cell data with scHOT. Nat. Methods 17, 799–806 (2020).

    Article  Google Scholar 

  19. Wang, F., Liang, S., Kumar, T., Navin, N. & Chen, K. SCMarker: ab initio marker selection for single cell transcriptome profiling. PLoS Comput. Biol. 15, e1007445 (2019).

    Article  Google Scholar 

  20. Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).

    Article  Google Scholar 

  21. Xiao, Z., Dai, Z. & Locasale, J. W. Metabolic landscape of the tumor microenvironment at single cell resolution. Nat. Commun. 10, 3763 (2019).

    Article  Google Scholar 

  22. Liu, B. et al. An entropy-based metric for assessing the purity of single cell populations. Nat. Commun. 11, 3155 (2020).

    Article  Google Scholar 

  23. Tsoucas, D. & Yuan, G.-C. GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection. Genome Biol. 19, 58 (2018).

    Article  Google Scholar 

  24. Sun, X., Liu, Y. & An, L. Ensemble dimensionality reduction and feature gene extraction for single-cell RNA-seq data. Nat. Commun. 11, 5853 (2020).

    Article  Google Scholar 

  25. Wegmann, R. et al. CellSIUS provides sensitive and specific detection of rare cell populations from complex single-cell RNA-seq data. Genome Biol. 20, 142 (2019).

    Article  Google Scholar 

  26. Angermueller, C. et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat. Methods 13, 229–232 (2016).

    Article  Google Scholar 

  27. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    MATH  Google Scholar 

  28. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at (2020).

  29. Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).

    Article  Google Scholar 

  30. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).

    Google Scholar 

  31. Andrew, G. & Gao, J. Scalable training of L1-regularized log-linear models. In Proc. 24th International Conference on Machine Learning (ed. Ghahramani, Z.) 33–40 (ACL, 2007).

  32. Karamitros, D. et al. Single-cell analysis reveals the continuum of human lympho-myeloid progenitor cells. Nat. Immunol. 19, 85–97 (2018).

    Article  Google Scholar 

  33. McFaline-Figueroa, J. L. et al. A pooled single-cell genetic screen identifies regulatory checkpoints in the continuum of the epithelial-to-mesenchymal transition. Nat. Genet. 51, 1389–1398 (2019).

    Article  Google Scholar 

  34. Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).

    Article  Google Scholar 

  35. Kinker, G. S. et al. Pan-cancer single-cell RNA-seq identifies recurring programs of cellular heterogeneity. Nat. Genet. 52, 1208–1218 (2020).

    Article  Google Scholar 

  36. Martin, J. C. et al. Single-cell analysis of Crohn’s disease lesions identifies a pathogenic cellular module associated with resistance to anti-TNF therapy. Cell 178, 1493–1508 (2019).

    Article  Google Scholar 

  37. van Galen, P. et al. Single-cell RNA-seq reveals AML hierarchies relevant to disease progression and immunity. Cell 176, 1265–1281 (2019).

    Article  Google Scholar 

  38. Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).

    Article  Google Scholar 

  39. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    Article  Google Scholar 

  40. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    Article  Google Scholar 

  41. Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).

    Article  Google Scholar 

  42. Marjanovic, N. D. et al. Emergence of a high-plasticity cell state during lung cancer evolution. Cancer Cell 38, 229–246 (2020).

    Article  Google Scholar 

  43. Anaya, J. OncoLnc: linking TCGA survival data to mRNAs, miRNAs and lncRNAs. PeerJ Comput. Sci. 2, e67 (2016).

    Article  Google Scholar 

  44. Chen, J., Bardes, E. E., Aronow, B. J. & Jegga, A. G. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37, W305–W311 (2009).

    Article  Google Scholar 

  45. Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).

    Article  Google Scholar 

  46. Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017).

    Article  Google Scholar 

  47. Pa, N., Lk, W., Ms, S. & Tm, O. Follow-up study of a randomized controlled trial of postnatal dexamethasone therapy in very low birth weight infants: effects on pulmonary outcomes at age 8 to 11 years. J. Pediatr. 150, 345–350 (2007).

    Article  Google Scholar 

  48. Srivastava, S. et al. ETS proteins bind with glucocorticoid receptors: relevance for treatment of Ewing sarcoma. Cell Rep. 29, 104–117 (2019).

    Article  Google Scholar 

  49. Zannas, A. S., Wiechmann, T., Gassen, N. C. & Binder, E. B. Gene–stress–epigenetic regulation of FKBP5: clinical and translational implications. Neuropsychopharmacology 41, 261–274 (2016).

    Article  Google Scholar 

  50. O’Leary, J. C., Zhang, B., Koren, J., Blair, L. & Dickey, C. A. The role of FKBP5 in mood disorders: action of FKBP5 on steroid hormone receptors leads to questions about its evolutionary importance. CNS Neurol. Disord. Drug Targets 12, 1157–1162 (2013).

    Google Scholar 

  51. Tieu, E. W., Tang, E. K. Y. & Tuckey, R. C. Kinetic analysis of human CYP24A1 metabolism of vitamin D via the C24-oxidation pathway. FEBS J. 281, 3280–3296 (2014).

    Article  Google Scholar 

  52. Andrews, T. S. & Hemberg, M. M3Drop: dropout-based feature selection for scRNASeq. Bioinformatics 35, 2865–2867 (2019).

    Article  Google Scholar 

  53. Ma, Y., McKay, D. J. & Buttitta, L. Changes in chromatin accessibility ensure robust cell cycle exit in terminally differentiated cells. PLoS Biol. 17, e3000378 (2019).

    Article  Google Scholar 

  54. Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–232 (2012).

    Article  Google Scholar 

  55. Regev, A. et al. The Human Cell Atlas. eLife 6, 1–30 (2017).

    Article  Google Scholar 

  56. Snyder, M. P. et al. The human body at cellular resolution: the NIH Human Biomolecular Atlas program. Nature 574, 187–192 (2019).

    Article  Google Scholar 

  57. Spira, A. et al. PreCancer Atlas to drive precision prevention trials. Cancer Res. 77, 1510–1541 (2017).

    Article  Google Scholar 

  58. Rozenblatt-Rosen, O. et al. The Human Tumor Atlas network: charting tumor transitions across space and time at single-cell resolution. Cell 181, 236–249 (2020).

    Article  Google Scholar 

  59. Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

    Article  Google Scholar 

  60. Wei, X. & Yu, P. S. Unsupervised feature selection by preserving stochastic neighbors. In Proc. 19th International Conference on Artificial Intelligence and Statistics Vol 51 (eds. Gretton, A. & Robert, C. C.) 995–1003 (PMLR, 2016).

  61. Liu, D. C. & Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 45, 503–528 (1989).

    Article  MathSciNet  MATH  Google Scholar 

  62. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  Google Scholar 

  63. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

    Article  Google Scholar 

  64. Liang, S. et al. SCMER: single-cell manifold preserving feature selection. Code Ocean (2021).

Download references


We thank H. Abbas, Y. Wang and L. Wang for their comments. We acknowledge the support of the High Performance Computing for Research facility at the University of Texas MD Anderson Cancer Center for providing computational resources that contributed to the research results reported in this Article. This project has been made possible in part by Human Cell Atlas Seed Network grants (nos. CZF2019-002432 and CZF2019-02425) to K.C. from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation; grants RP180248 (K.C.) and RP200520 (W.P.) from Cancer Prevention & Research Institute of Texas; grants U01CA247760 (K.C.) and U24CA211006 (L.D.) and Cancer Center Support Grant P30 CA016672 (P.P.) from the National Cancer Institute.

Author information

Authors and Affiliations



S.L., M.M., W.P., L.D. and K.C. conceptualized the project. S.L. designed the SCMER algorithm and implemented the software. All authors collectively designed the experiments and analyzed the results. All authors drafted the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Ken Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks Ting Chen, Kevin Menden and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Jie Pan was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–9, Results 1–4, Tables 1–11 and Figs. 1–21.

Supplementary Data 1

Human phenotype pathways enriched in novel markers found for melanoma.

Supplementary Data 2

Biological process pathways enriched in novel markers found for melanoma.

Supplementary Data 3

Markers for 198 pan-cancer cell lines.

Supplementary Data 4

Biological process pathways enriched in novel markers found for Crohn’s disease immune cells.

Supplementary Data 5

Biological process pathways enriched in novel markers found for hematopoietic cells.

Supplementary Data 6

Pathways for novel markers found for hematopoietic cells.

Supplementary Data 7

Biological processes pathways enriched in markers uncorrelated with NR3C1 TF.

Source data

Source Data Fig. 1

UMAP embedding and feature values.

Source Data Fig. 2

UMAP embedding, cell labels, gene expression, method comparison results and survival analysis data.

Source Data Fig. 3

UMAP embedding, cell labels, gene expression and method comparison results.

Source Data Fig. 4

UMAP embedding, cell labels, gene expression and ATAC peak levels.

Source Data Fig. 5

UMAP embedding, cell labels, gene expression and protein levels.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liang, S., Mohanty, V., Dou, J. et al. Single-cell manifold-preserving feature selection for detecting rare cell populations. Nat Comput Sci 1, 374–384 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research