Abstract
A key challenge in studying organisms and diseases is to detect rare molecular programs and rare cell populations that drive development, differentiation and transformation. Molecular features, such as genes and proteins, defining rare cell populations are often unknown and are difficult to detect from unenriched single-cell data using conventional dimensionality reduction and clustering-based approaches. Here, we propose an unsupervised approach, SCMER (‘single-cell manifold-preserving feature selection’), which selects a compact set of molecular features with definitive meanings that preserve the manifold of the data. We apply SCMER in the context of hematopoiesis, lymphogenesis, tumorigenesis and drug resistance and response. We find that SCMER can identify non-redundant features that sensitively delineate both common cell lineages and rare cellular states. SCMER can be used for discovering molecular features in a high-dimensional dataset, designing targeted, cost-effective assays for clinical applications and facilitating multi-modality integration.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All original datasets are accessible through the original publications34,35,36,37,38,39,40,41, including the melanoma data (GSE72056), pan-cancer cell line data (https://singlecell.broadinstitute.org/single_cell/study/SCP542), immune cell subtypes data (https://singlecell.broadinstitute.org/single_cell/study/SCP359), hematopoiesis data (GSE116256), A549 data (GSE128639), CITE-seq data (GSE128639 and GSE100866) and CyTOF data (https://cytobank.org/nolanlab/reports/Levine2015.html). Source data are provided with this paper.
Code availability
The open-source implementation of SCMER is available at https://github.com/KChen-lab/SCMER under an MIT License. Scripts for reproducing all the results are deposited in Code Ocean64.
Change history
01 June 2021
A Correction to this paper has been published: https://doi.org/10.1038/s43588-021-00091-2
References
Merrell, A. J. & Stanger, B. Z. Adult cell plasticity in vivo: de-differentiation and transdifferentiation are back in style. Nat. Rev. Mol. Cell Biol. 17, 413–425 (2016).
Setty, M. et al. Characterization of cell fate probabilities in single-cell data with Palantir. Nat. Biotechnol. 37, 451–460 (2019).
Wang, Z. et al. Sarcomatoid renal cell carcinoma has a distinct molecular pathogenesis, driver mutation profile and transcriptional landscape. Clin. Cancer Res. 23, 6686–6696 (2017).
Conant, J. L., Peng, Z., Evans, M. F., Naud, S. & Cooper, K. Sarcomatoid renal cell carcinoma is an example of epithelial–mesenchymal transition. J. Clin. Pathol. 64, 1088–1092 (2011).
Lytle, N. K. et al. A multiscale map of the stem cell state in pancreatic adenocarcinoma. Cell 177, 572–586 (2019).
Sanada, Y. et al. Histopathologic evaluation of stepwise progression of pancreatic carcinoma with immunohistochemical analysis of gastric epithelial transcription factor SOX2: comparison of expression patterns between invasive components and cancerous or nonneoplastic intraductal components. Pancreas 32, 164–170 (2006).
Herreros-Villanueva, M. et al. SOX2 promotes dedifferentiation and imparts stem cell-like features to pancreatic cancer cells. Oncogenesis 2, e61(2013).
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Vargo, A. H. S. & Gilbert, A. C. A rank-based marker selection method for high throughput scRNA-seq data. BMC Bioinformatics 21, 477 (2020).
Delaney, C. et al. Combinatorial prediction of marker panels from single-cell transcriptomic data. Mol. Syst. Biol. 15, e9005 (2019).
Trapnell, C. Defining cell types and states with single-cell genomics. Genome Res. 25, 1491–1498 (2015).
Jerby-Arnon, L. & Regev, A. Mapping multicellular programs from single-cell profiles. Preprint at bioRxiv https://doi.org/10.1101/2020.08.11.245472 (2020).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Ghazanfar, S. et al. Investigating higher-order interactions in single-cell data with scHOT. Nat. Methods 17, 799–806 (2020).
Wang, F., Liang, S., Kumar, T., Navin, N. & Chen, K. SCMarker: ab initio marker selection for single cell transcriptome profiling. PLoS Comput. Biol. 15, e1007445 (2019).
Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).
Xiao, Z., Dai, Z. & Locasale, J. W. Metabolic landscape of the tumor microenvironment at single cell resolution. Nat. Commun. 10, 3763 (2019).
Liu, B. et al. An entropy-based metric for assessing the purity of single cell populations. Nat. Commun. 11, 3155 (2020).
Tsoucas, D. & Yuan, G.-C. GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection. Genome Biol. 19, 58 (2018).
Sun, X., Liu, Y. & An, L. Ensemble dimensionality reduction and feature gene extraction for single-cell RNA-seq data. Nat. Commun. 11, 5853 (2020).
Wegmann, R. et al. CellSIUS provides sensitive and specific detection of rare cell populations from complex single-cell RNA-seq data. Genome Biol. 20, 142 (2019).
Angermueller, C. et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat. Methods 13, 229–232 (2016).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/pdf/1802.03426.pdf (2020).
Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
Andrew, G. & Gao, J. Scalable training of L1-regularized log-linear models. In Proc. 24th International Conference on Machine Learning (ed. Ghahramani, Z.) 33–40 (ACL, 2007).
Karamitros, D. et al. Single-cell analysis reveals the continuum of human lympho-myeloid progenitor cells. Nat. Immunol. 19, 85–97 (2018).
McFaline-Figueroa, J. L. et al. A pooled single-cell genetic screen identifies regulatory checkpoints in the continuum of the epithelial-to-mesenchymal transition. Nat. Genet. 51, 1389–1398 (2019).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
Kinker, G. S. et al. Pan-cancer single-cell RNA-seq identifies recurring programs of cellular heterogeneity. Nat. Genet. 52, 1208–1218 (2020).
Martin, J. C. et al. Single-cell analysis of Crohn’s disease lesions identifies a pathogenic cellular module associated with resistance to anti-TNF therapy. Cell 178, 1493–1508 (2019).
van Galen, P. et al. Single-cell RNA-seq reveals AML hierarchies relevant to disease progression and immunity. Cell 176, 1265–1281 (2019).
Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).
Marjanovic, N. D. et al. Emergence of a high-plasticity cell state during lung cancer evolution. Cancer Cell 38, 229–246 (2020).
Anaya, J. OncoLnc: linking TCGA survival data to mRNAs, miRNAs and lncRNAs. PeerJ Comput. Sci. 2, e67 (2016).
Chen, J., Bardes, E. E., Aronow, B. J. & Jegga, A. G. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37, W305–W311 (2009).
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017).
Pa, N., Lk, W., Ms, S. & Tm, O. Follow-up study of a randomized controlled trial of postnatal dexamethasone therapy in very low birth weight infants: effects on pulmonary outcomes at age 8 to 11 years. J. Pediatr. 150, 345–350 (2007).
Srivastava, S. et al. ETS proteins bind with glucocorticoid receptors: relevance for treatment of Ewing sarcoma. Cell Rep. 29, 104–117 (2019).
Zannas, A. S., Wiechmann, T., Gassen, N. C. & Binder, E. B. Gene–stress–epigenetic regulation of FKBP5: clinical and translational implications. Neuropsychopharmacology 41, 261–274 (2016).
O’Leary, J. C., Zhang, B., Koren, J., Blair, L. & Dickey, C. A. The role of FKBP5 in mood disorders: action of FKBP5 on steroid hormone receptors leads to questions about its evolutionary importance. CNS Neurol. Disord. Drug Targets 12, 1157–1162 (2013).
Tieu, E. W., Tang, E. K. Y. & Tuckey, R. C. Kinetic analysis of human CYP24A1 metabolism of vitamin D via the C24-oxidation pathway. FEBS J. 281, 3280–3296 (2014).
Andrews, T. S. & Hemberg, M. M3Drop: dropout-based feature selection for scRNASeq. Bioinformatics 35, 2865–2867 (2019).
Ma, Y., McKay, D. J. & Buttitta, L. Changes in chromatin accessibility ensure robust cell cycle exit in terminally differentiated cells. PLoS Biol. 17, e3000378 (2019).
Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–232 (2012).
Regev, A. et al. The Human Cell Atlas. eLife 6, 1–30 (2017).
Snyder, M. P. et al. The human body at cellular resolution: the NIH Human Biomolecular Atlas program. Nature 574, 187–192 (2019).
Spira, A. et al. PreCancer Atlas to drive precision prevention trials. Cancer Res. 77, 1510–1541 (2017).
Rozenblatt-Rosen, O. et al. The Human Tumor Atlas network: charting tumor transitions across space and time at single-cell resolution. Cell 181, 236–249 (2020).
Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Wei, X. & Yu, P. S. Unsupervised feature selection by preserving stochastic neighbors. In Proc. 19th International Conference on Artificial Intelligence and Statistics Vol 51 (eds. Gretton, A. & Robert, C. C.) 995–1003 (PMLR, 2016).
Liu, D. C. & Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 45, 503–528 (1989).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Liang, S. et al. SCMER: single-cell manifold preserving feature selection. Code Ocean https://doi.org/10.24433/CO.6781338.v1 (2021).
Acknowledgements
We thank H. Abbas, Y. Wang and L. Wang for their comments. We acknowledge the support of the High Performance Computing for Research facility at the University of Texas MD Anderson Cancer Center for providing computational resources that contributed to the research results reported in this Article. This project has been made possible in part by Human Cell Atlas Seed Network grants (nos. CZF2019-002432 and CZF2019-02425) to K.C. from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation; grants RP180248 (K.C.) and RP200520 (W.P.) from Cancer Prevention & Research Institute of Texas; grants U01CA247760 (K.C.) and U24CA211006 (L.D.) and Cancer Center Support Grant P30 CA016672 (P.P.) from the National Cancer Institute.
Author information
Authors and Affiliations
Contributions
S.L., M.M., W.P., L.D. and K.C. conceptualized the project. S.L. designed the SCMER algorithm and implemented the software. All authors collectively designed the experiments and analyzed the results. All authors drafted the manuscript. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Computational Science thanks Ting Chen, Kevin Menden and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Jie Pan was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–9, Results 1–4, Tables 1–11 and Figs. 1–21.
Supplementary Data 1
Human phenotype pathways enriched in novel markers found for melanoma.
Supplementary Data 2
Biological process pathways enriched in novel markers found for melanoma.
Supplementary Data 3
Markers for 198 pan-cancer cell lines.
Supplementary Data 4
Biological process pathways enriched in novel markers found for Crohn’s disease immune cells.
Supplementary Data 5
Biological process pathways enriched in novel markers found for hematopoietic cells.
Supplementary Data 6
Pathways for novel markers found for hematopoietic cells.
Supplementary Data 7
Biological processes pathways enriched in markers uncorrelated with NR3C1 TF.
Source data
Source Data Fig. 1
UMAP embedding and feature values.
Source Data Fig. 2
UMAP embedding, cell labels, gene expression, method comparison results and survival analysis data.
Source Data Fig. 3
UMAP embedding, cell labels, gene expression and method comparison results.
Source Data Fig. 4
UMAP embedding, cell labels, gene expression and ATAC peak levels.
Source Data Fig. 5
UMAP embedding, cell labels, gene expression and protein levels.
Rights and permissions
About this article
Cite this article
Liang, S., Mohanty, V., Dou, J. et al. Single-cell manifold-preserving feature selection for detecting rare cell populations. Nat Comput Sci 1, 374–384 (2021). https://doi.org/10.1038/s43588-021-00070-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-021-00070-7
This article is cited by
-
Gene panel selection for targeted spatial transcriptomics
Genome Biology (2024)
-
Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics
Nature Reviews Molecular Cell Biology (2024)
-
MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer
Nature Communications (2024)
-
DELVE: feature selection for preserving biological trajectories in single-cell data
Nature Communications (2024)
-
Single-cell omics: experimental workflow, data analyses and applications
Science China Life Sciences (2024)