Supervised learning of high-confidence phenotypic subpopulations from single-cell data

Ren, Tao; Chen, Canping; Danilov, Alexey V.; Liu, Susan; Guan, Xiangnan; Du, Shunyi; Wu, Xiwei; Sherman, Mara H.; Spellman, Paul T.; Coussens, Lisa M.; Adey, Andrew C.; Mills, Gordon B.; Wu, Ling-Yun; Xia, Zheng

doi:10.1038/s42256-023-00656-y

Article
Published: 08 May 2023

Supervised learning of high-confidence phenotypic subpopulations from single-cell data

Nature Machine Intelligence volume 5, pages 528–541 (2023)Cite this article

3848 Accesses
3 Citations
12 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 06 June 2023

This article has been updated

A preprint version of the article is available at bioRxiv.

Abstract

Accurately identifying phenotype-relevant cell subsets from heterogeneous cell populations is crucial for delineating the underlying mechanisms driving biological or clinical phenotypes. Here by deploying a Learning with Rejection strategy, we developed a novel supervised learning framework called PENCIL to identify subpopulations associated with categorical or continuous phenotypes from single-cell data. By embedding a feature selection function into this flexible framework, for the first time, we were able to simultaneously select informative features and identify cell subpopulations, enabling accurate identification of phenotypic subpopulations otherwise missed by methods incapable of concurrent gene selection. Furthermore, the regression mode of PENCIL presents a novel ability for supervised phenotypic trajectory learning of subpopulations from single-cell data. We conducted comprehensive simulations to evaluate PENCIL’s versatility in simultaneous gene selection, subpopulation identification and phenotypic trajectory prediction. PENCIL is fast and scalable to analyse one million cells within 1 h. Using the classification mode, PENCIL detected T-cell subpopulations associated with melanoma immunotherapy outcomes. Moreover, when applied to single-cell RNA sequencing of a patient with mantle cell lymphoma with drug treatment across multiple timepoints, the regression mode of PENCIL revealed a transcriptional treatment response trajectory. Collectively, our work introduces a scalable and flexible infrastructure to accurately identify phenotype-associated subpopulations from single-cell data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The workflow of PENCIL and its main functions.**

**Fig. 2: Evaluation of PENCIL’s classification mode for simultaneously selecting genes and cells in simulations.**

**Fig. 3: Evaluation of regression mode of PENCIL on the simulated datasets.**

**Fig. 4: The running time and memory usages of PENCIL against the number of cells.**

**Fig. 5: PENCIL analysis of T-cell subpopulations associated with melanoma immunotherapy outcomes.**

**Fig. 6: Regression mode of PENCIL analysis of scRNA-seq malignant B cells across three timepoints from a patient with MCL.**

Reusability report: Leveraging supervised learning to uncover phenotype-relevant biology from single-cell RNA sequencing data

Article 05 March 2024

Single-cell manifold-preserving feature selection for detecting rare cell populations

Article 20 May 2021

VoPo leverages cellular heterogeneity for predictive modeling of single-cell data

Article Open access 27 July 2020

Data availability

Publicly available scRNA-seq studies can be accessed via the following accession numbers or the link provided: GSE120575 (ref. ⁶), GSE159251 (ref. ³²), GSE134388 (ref. ³³) and https://zenodo.org/record/7761954 (ref. ³⁴). More detailed description of these datasets can be found in Supplementary Material.

Code availability

The open-source PENCIL program and its tutorials are freely available at GitHub (https://github.com/cliffren/PENCIL) and Zenodo (https://doi.org/10.5281/zenodo.7762054).

Change history

06 June 2023
A Correction to this paper has been published: https://doi.org/10.1038/s42256-023-00681-x

References

Miao, Y. et al. Adaptive immune resistance emerges from tumor-initiating stem cells. Cell 177, 1172–1186 e1114 (2019).
Article Google Scholar
Wagner, J. et al. A single-cell atlas of the tumor and immune ecosystem of human breast cancer. Cell 177, 1330–1345 e1318 (2019).
Article Google Scholar
Trapnell, C. Defining cell types and states with single-cell genomics. Genome Res. 25, 1491–1498 (2015).
Article Google Scholar
Stephenson, E. et al. Single-cell multi-omics analysis of the immune response in COVID-19. Nat. Med. 27, 904–916 (2021).
Article Google Scholar
Ekiz, H. A. et al. MicroRNA-155 coordinates the immunological landscape within murine melanoma and correlates with immunity in human cancers. JCI Insight 4, e126543 (2019).
Google Scholar
Sade-Feldman, M. et al. Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell 175, 998–1013 e1020 (2018).
Article Google Scholar
Eksi, S. E. et al. Epigenetic loss of heterogeneity from low to high grade localized prostate tumours. Nat. Commun. 12, 7292 (2021).
Article Google Scholar
Lun, A. T. L., Richard, A. C. & Marioni, J. C. Testing for differential abundance in mass cytometry data. Nat. Methods 14, 707–709 (2017).
Article Google Scholar
Zhao, J. et al. Detection of differentially abundant cell subpopulations in scRNA-seq data. Proc. Natl Acad. Sci. USA 118, e2100293118. (2021).
Article Google Scholar
Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40, 245–253 (2022).
Article Google Scholar
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39, 619–629 (2021).
Article Google Scholar
Sheng, J. & Li, W. V. Selecting gene features for unsupervised analysis of single-cell gene expression data. Brief. Bioinform. 22, bbab295 (2021).
Article Google Scholar
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 20, 295 (2019).
Article Google Scholar
Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360, eaar3131 (2018).
Article Google Scholar
Zhong, S. et al. A single-cell RNA-seq survey of the developmental landscape of the human prefrontal cortex. Nature 555, 524–528 (2018).
Article Google Scholar
Baran-Gale, J. et al. Ageing compromises mouse thymus function and remodels epithelial cell differentiation. eLife 9, e56221 (2020).
Article Google Scholar
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Article Google Scholar
Chen, H. et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun. 10, 1903 (2019).
Article Google Scholar
Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).
Article Google Scholar
Lange, M. et al. CellRank for directed single-cell fate mapping. Nat. Methods 19, 159–170 (2022).
Article Google Scholar
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. https://doi.org/10.1038/nbt.4314 (2018).
Cannoodt, R., Saelens, W., Deconinck, L. & Saeys, Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nat. Commun. 12, 3942 (2021).
Article Google Scholar
Chen, W. et al. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. Nat. Biotechnol. 39, 1103–1114 (2021).
Article Google Scholar
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Article Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 e3529 (2021).
Article Google Scholar
Ruan, X. et al. Progenitor cell diversity in the developing mouse neocortex. Proc. Natl Acad. Sci. USA 118, e2018866118 (2021).
Article Google Scholar
Van den Berge, K. et al. Trajectory-based differential expression analysis for single-cell sequencing data. Nat. Commun. 11, 1201 (2020).
Article Google Scholar
Qiu, X. et al. Single-cell mRNA quantification and differential analysis with Census. Nat. Methods 14, 309–315 (2017).
Article Google Scholar
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Article Google Scholar
Li, H. et al. Dysfunctional CD8 T cells form a proliferative, dynamically regulated compartment within human melanoma. Cell 176, 775–789 e718 (2019).
Article Google Scholar
Scott, A. C. et al. TOX is a critical regulator of tumour-specific T cell differentiation. Nature 571, 270–274 (2019).
Article Google Scholar
Pauken, K. E. et al. Single-cell analyses identify circulating anti-tumor CD8 T cells and markers for their enrichment. J. Exp. Med. 218, e20200920 (2021).
Article Google Scholar
Li, N. et al. ALKBH5 regulates anti-PD-1 therapy response by modulating lactate and suppressive immune cell accumulation in tumor microenvironment. Proc. Natl Acad. Sci. USA 117, 20159–20170 (2020).
Article Google Scholar
Torka, P. et al. Pevonedistat, a Nedd8-activating enzyme inhibitor, in combination with ibrutinib in patients with relapsed/refractory B-cell non-Hodgkin lymphoma. Blood Cancer J. 13, 9 (2023).
Article Google Scholar
Tickle, T., Tirosh, I., Georgescu, C., Brown, M. & Haas, B. inferCNV of the Trinity CTAT Project. Klarman Cell Observatory, Broad Institute of MIT and Harvard. https://github.com/broadinstitute/inferCNV (2019).
Hartmann, E. M. et al. Pathway discovery in mantle cell lymphoma by integrated analysis of high-resolution gene expression and copy number profiling. Blood 116, 953–961 (2010).
Article Google Scholar
Mathas, S. et al. Aberrantly expressed c-Jun and JunB are a hallmark of Hodgkin lymphoma cells, stimulate proliferation and synergize with NF-kappa B. EMBO J. 21, 4104–4113 (2002).
Article Google Scholar
Papoudou-Bai, A. et al. The expression levels of JunB, JunD and p-c-Jun are positively correlated with tumor cell proliferation in diffuse large B-cell lymphomas. Leuk. Lymphoma 57, 143–150 (2016).
Article Google Scholar
Balaji, S. et al. NF-kappaB signaling and its relevance to the treatment of mantle cell lymphoma. J. Hematol. Oncol. 11, 83 (2018).
Article Google Scholar
Godbersen, J. C. et al. The Nedd8-activating enzyme inhibitor MLN4924 thwarts microenvironment-driven NF-kappaB activation and induces apoptosis in chronic lymphocytic leukemia B cells. Clin. Cancer Res. 20, 1576–1589 (2014).
Article Google Scholar
Mulqueen, R. M. et al. Highly scalable generation of DNA methylation profiles in single cells. Nat. Biotechnol. 36, 428–431 (2018).
Article Google Scholar
Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
Article Google Scholar
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Article Google Scholar
Bartlett, P. L. & Wegkamp, M. H. Classification with a reject option using a hinge loss. J. Mach. Learn. Res. 9, 1823–1840 (2008).
MathSciNet MATH Google Scholar
Cortes, C., DeSalvo, G. & Mohri, M. Learning with Rejection. Lect. Notes Artif. Intell. 9925, 67–82 (2016).
MathSciNet MATH Google Scholar
Herbei, R. & Wegkamp, M. H. Classification with reject option. Can. J. Stat. 34, 709–721 (2006).
Article MathSciNet MATH Google Scholar
Asif, A. & Minhas, F. U. A. Generalized neural framework for learning with rejection. International Joint Conference on Neural Networks (IJCNN). https://doi.org/10.1109/IJCNN48605.2020.9206612 (IEEE, 2020).
Charoenphakdee, N., Cui, Z. H., Zhang, Y. A. & Sugiyama, M. Classification with rejection based on cost-sensitive classification. Proc. Mach. Learn. Res. 139, 1507–1517 (2021).
Google Scholar
Misra, D. Mish: a self regularized non-monotonic activation function. Preprint at arXiv https://doi.org/10.48550/arXiv.1908.08681 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 e1821 (2019).
Article Google Scholar

Download references

Acknowledgements

This work was supported by the following funding: the National Key Research and Development Program of China 2020YFA0712400 (to T.R. and L.-Y.W.); NIH 1R21HL145426 (to Z.X.); Department of Defense Idea Development Award W81XWH2110539 (to Z.X.); Breast Cancer Research Foundation and NIH U01CA253472 and U01CA217842 (to G.B.M.); NIH 1R01CA244576 (to A.V.D.); NIH R35GM124704 (to A.C.A.); NIH R01CA250917 (to M.H.S.). We thank J. Zeng (University of Macau), and all the members of his bioinformatics team for generously sharing their experience and codes. We thank W. Anderson for helping edit the manuscript.

Author information

Authors and Affiliations

Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
Tao Ren & Ling-Yun Wu
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
Tao Ren & Ling-Yun Wu
Computational Biology Program, Oregon Health & Science University, Portland, OR, USA
Canping Chen, Susan Liu, Shunyi Du & Zheng Xia
Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
Canping Chen, Susan Liu, Shunyi Du & Zheng Xia
City of Hope National Medical Center, Duarte, CA, USA
Alexey V. Danilov & Xiwei Wu
Department of Oncology Biomarker Development, Genentech Inc, South San Francisco, CA, USA
Xiangnan Guan
Department of Cell, Developmental & Cancer Biology, Oregon Health & Science University, Portland, OR, USA
Mara H. Sherman & Lisa M. Coussens
Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
Mara H. Sherman, Paul T. Spellman, Lisa M. Coussens, Andrew C. Adey & Zheng Xia
Cancer Biology & Genetics Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
Mara H. Sherman
Department of Molecular and Medical Genetics, Oregon Health & Science University, Portland, OR, USA
Paul T. Spellman & Andrew C. Adey
Division of Oncological Sciences Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
Gordon B. Mills

Authors

Tao Ren
View author publications
You can also search for this author in PubMed Google Scholar
Canping Chen
View author publications
You can also search for this author in PubMed Google Scholar
Alexey V. Danilov
View author publications
You can also search for this author in PubMed Google Scholar
Susan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangnan Guan
View author publications
You can also search for this author in PubMed Google Scholar
Shunyi Du
View author publications
You can also search for this author in PubMed Google Scholar
Xiwei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Mara H. Sherman
View author publications
You can also search for this author in PubMed Google Scholar
Paul T. Spellman
View author publications
You can also search for this author in PubMed Google Scholar
Lisa M. Coussens
View author publications
You can also search for this author in PubMed Google Scholar
Andrew C. Adey
View author publications
You can also search for this author in PubMed Google Scholar
Gordon B. Mills
View author publications
You can also search for this author in PubMed Google Scholar
Ling-Yun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Xia
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.X. conceived the idea. T.R., L.-Y.W. and Z.X. implemented the method and performed the analyses. T.R., C.C., A.V.D, S.L., X.G., S.D., L.-Y.W. and Z.X. interpreted the results. X.W., M.H.S., A.C.A., P.T.S., L.M.C. and G.B.M. provided scientific insights on the applications. A.C.A. and G.B.M. contributed to the analytic strategies. L.-Y.W. and Z.X. supervised the study. T.R., L.-Y.W. and Z.X. wrote the manuscript with feedback from all other authors. All the authors read and approved the final manuscript.

Corresponding authors

Correspondence to Ling-Yun Wu or Zheng Xia.

Ethics declarations

Competing interests

A.V.D. has received consulting fees from Abbvie, AstraZeneca, Bayer Oncology, BeiGene, Bristol Meyers Squibb, Genentech, Incyte, Lilly Oncology, Morphposys, Nurix, Oncovalent, Pharmacyclics and TG Therapeutics and has ongoing research funding from Abbvie, AstraZeneca, Bayer Oncology, Bristol Meyers Squibb, Cyclacel, MEI Pharma, Nurix and Takeda Oncology. X.G. is a Genentech employee and Roche shareholder. G.B.M. is SAB/Consultant for AstraZeneca, BlueDot, Chrysallis Biotechnology, Ellipses Pharma, ImmunoMET, Infinity, Ionis, Lilly, Medacorp, Nanostring, PDX Pharmaceuticals, Signalchem Lifesciences, Tarveda, Turbine and Zentalis Pharmaceuticals; stock/options/financial: Catena Pharmaceuticals, ImmunoMet, SignalChem, Tarveda and Turbine; licenced technology: HRD assay to Myriad Genetics, and DSP patents with Nanostring. L.M.C. provides consulting services for Cell Signaling Technologies, AbbVie, the Susan G Komen Foundation and Shasqi, received reagent and/or research support from Cell Signaling Technologies, Syndax Pharmaceuticals, ZelBio Inc., Hibercell Inc. and Acerta Pharma, and participates in advisory boards for Pharmacyclics, Syndax, Carisma, Verseau, CytomX, Kineta, Hibercell, Cell Signaling Technologies, Alkermes, Zymeworks, Genenta Sciences, Pio Therapeutics Pty Ltd, PDX Pharmaceuticals, the AstraZeneca Partner of Choice Network, the Lustgarten Foundation and the NIH/NCI-Frederick National Laboratory Advisory Committee. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Yun Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A simple simulation consists of cells from two conditions and three cell types, each containing only two genes (X_1 and X_2).

a, Visualizing cells from two conditions colored by condition labels using the two genes. b, Standard clustering of the cells. Cell number in parentheses. c, Percentage of cell condition labels within each cluster. d, The identified phenotypic subpopulations from the clustering-based method. e, The learned prediction model from PENCIL with the orange line as the boundary with prediction scores ℎ(𝑥) = 0 to classify the two conditions. Cells colored by the condition labels as in a. f, The learned rejection model from PENCIL with the green curve as the boundary with confidence scores 𝑟(𝑥) = 0 to reject cells. Cells colored by the condition labels as in a. g, PENCIL identified phenotypic subpopulations.

Extended Data Fig. 2 A simple simulation includes cells from two conditions and three cell types, each containing only two genes (X_1 and X_2), but lacks enriched phenotypic subpopulations.

a, Visualizing cells from two conditions colored by condition labels using the two genes. b, Standard clustering of the cells. Cell number in parentheses. c, The equal percentages of cell condition labels within each cluster. d, The result of the clustering-based method showing no subpopulations associated with the phenotypes. e, The learned prediction model from PENCIL with the orange line as the boundary with prediction scores ℎ(𝑥) = 0 to classify the two conditions. Cells colored by the condition labels as in a. f, The rejection module in PENCIL with all confidence scores 𝑟(x) < 0 to reject all cells. g, PENCIL rejected all cells.

Extended Data Fig. 3 The simulation flowchart.

a, The matrix from a real scRNA-seq dataset. b, Selecting a submatrix with a subset of genes as indicated by the orange rectangle for the following clustering. c, UMAP visualization and standard clustering based on the submatrix from the previous step. d, Selecting two clusters from panel c as the ground truth subpopulations enriched in the phenotypes, respectively. e, Assigning cells with condition labels based on the designed conditions in panel d and the given mixing rate. f, The raw matrix with each cell assigned with a condition label as indicated on the top bar. g, The UMAP using the top 2000 MVGs colored by the condition labels of cells. h, The raw expression matrix and cell condition labels as the same inputs for all the methods.

Extended Data Fig. 4 PENCIL classification analysis of simulated datasets with two conditions.

a, The confidence scores output by PENCIL. b, The distribution of the selected and rejected cells over the simulated ground truth of the conditions. c, The Venn diagram showing the overlap between the PENCIL selected cells and the ground truth phenotypic cells for the two conditions, respectively. d, The F1, precision and recall scores comparing the performances of the four methods.

Extended Data Fig. 5 Evaluating PENCIL on the simulated datasets with batch-effect.

a, UMAP based on the manually curated genes showing the cells of two conditions from two batches separated by the dashed line. b, UMAP based on the manually curated genes showing the cells of two conditions after batch corrections. c, UMAP based on the top 3000 MVGs showing all cells. d, PENCIL selected genes. e, UMAP based on the PENCIL selected genes showing the PENCIL selected cells. f, The Venn diagram showing the overlap between the ground truth phenotype-enriched subpopulations and the PENCIL selected cells. g, The box plots comparing the performances of PENCIL, Milo, DAseq and MELD in simulated batch effects datasets with mixing rates 0, 0.1, 0.2 and 0.3 (n = 50 simulations). In the box plots, the center line and the box bounds represent median value and upper and lower quartiles, respectively. Box whiskers indicate the largest and smallest values no more than 1.5 times the interquartile range from the quartiles.

Extended Data Fig. 6 Evaluating PENCIL on the simulated datasets with three conditions.

a, The UMAP based on the pre-selected gene set colored by the cell condition labels generated from ground truth cell subsets with a mixing rate of 0.1. b, The ground truth phenotype-associated subpopulations visualized on the UMAP using the top 2000 MVGs. c, Cells with the same condition labels as the ones in the panel a visualized on the UMAP using the top 2000 MVGs. d, The UMAP based on the pre-selected genes colored by the PENCIL predicted confidence scores. e, The distribution of PENCIL selected cells over the ground truth cell conditions. f, The F1, precision and recall scores comparing the performances of the three methods on this simulated dataset with three conditions. g, The Venn diagrams depicting the overlap between ground truth cell subpopulations and cell subsets selected by the three methods, respectively.

Extended Data Fig. 7 Evaluating the four methods using the PENCIL selected genes as inputs.

a, The results of PENCIL, Milo, DAseq and MELD when inputting the genes selected by PENCIL. b, The Venn diagrams comparing the result of each method with the ground truth phenotypic cell subpopulations. c, The F1, precision, and recall scores comparing the performances of the four methods when inputting the genes selected by PENCIL. d-f, A simulation for the three conditions. d, The UMAP plots showing the results of PENCIL, Milo, and MELD when inputting the genes selected by PENCIL. e, The Venn diagrams comparing the result of each method with the ground truth phenotypic cell subpopulations. f, The F1, precision and recall scores comparing the performances of the three methods when inputting the genes selected by PENCIL in this simulated example with three conditions.

Extended Data Fig. 8 Evaluating the regression model of PENCIL in simulated datasets.

a, UMAP showing the cells of 5 clusters selected as ground truth subpopulations corresponding to main Fig. 3a. b, PENCIL predicted confidence scores corresponding to main Fig. 3c. c, The cells with simulated condition labels visualized on the UMAP using the top 2000 MVGs. d, PENCIL predicted confidence scores corresponding to main Fig. 3k. e-i, A simulated dataset for PENCIL regression analysis from the Feldman T-cell dataset. e, UMAP from a pre-selected gene set (800-1500th MVGs) to show cells with simulated ground truth phenotypic subpopulations of five time points. f, The five subpopulations are assigned to the five samples accordingly, and all remaining cells are evenly assigned to the five samples to simulate the sample labels. The UMAP is the same as the panel e colored by cell condition labels. g, Ground truth of phenotype-associated subpopulations in panel e visualized on the UMAP using the top 2000 MVGs. h, PENCIL predicted continuous time points for the selected cells. i, PENCIL selected genes. Genes within the dashed rectangle region were the gene-set to generate UMAPs in panels e, f and h. j-n, A simulated dataset for PENCIL regression analysis from the Sade-Feldman cohort dataset. j, UMAP from a pre-selected gene set (1500-2000th MVGs) to show cells with simulated ground truth subpopulations of four time points. k, The four subpopulations are assigned to the four samples accordingly and all remaining cells are evenly assigned to the four samples. The UMAP is the same as the panel j colored by simulated condition labels. l, Ground truth of phenotype-associated subpopulations in panel j visualized on the UMAP using top 2000 MVGs. m, PENCIL predicted continuous time points for the selected cells. n, PENCIL selected genes. Genes within the dashed rectangle region were the gene set to generate UMAPs in panels j, k and m.

Extended Data Fig. 9 PENCIL’s runtime and memory usages with varying numbers of genes and conditions.

a-c, For datasets with 10,000 cells and three conditions, the runtime, overall memory usage of CPU and GPU against the number of genes, respectively. d-f, For datasets with 10,000 cells and 2000 genes, the runtime, overall memory usage of CPU and GPU against the number of conditions, respectively. MiB, mebibyte.

Extended Data Fig. 10 A summary of PENCIL’s two modes.

a, The advantages of classification-based PENCIL. b, The regression mode of PENCIL formulates a new application to reveal a continuous dynamic process.

Supplementary information

Supplementary Information

Supplementary notes 1–7, figs. 1–3, and table 1–4 legends.

Reporting Summary

Supplementary Tables

Supplementary Table 1. The list of DEGs between PENCIL-predicted cells associated with immunotherapy outcomes. Supplementary Table 2. The pathways related to CD8⁺ T-cells are enriched by the significantly downregulated genes in responders compared with non-responders for the cells selected by PENCIL. Supplementary Table 3. The pathways related to CD8⁺ T-cells are enriched by the significantly upregulated genes in responders compared with non-responders for the cells selected by PENCIL. Supplementary Table 4. The list of the genes whose expression levels significantly depend on the timepoints predicted by the regression-based PENCIL analysis on the MCL scRNA-seq dataset.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ren, T., Chen, C., Danilov, A.V. et al. Supervised learning of high-confidence phenotypic subpopulations from single-cell data. Nat Mach Intell 5, 528–541 (2023). https://doi.org/10.1038/s42256-023-00656-y

Download citation

Received: 22 September 2022
Accepted: 06 April 2023
Published: 08 May 2023
Issue Date: May 2023
DOI: https://doi.org/10.1038/s42256-023-00656-y

This article is cited by

Immunosenescence and vaccine efficacy revealed by immunometabolic analysis of SARS-CoV-2-specific cells in multiple sclerosis patients
- Sara De Biasi
- Domenico Lo Tartaro
- Andrea Cossarizza
Nature Communications (2024)
Reusability report: Leveraging supervised learning to uncover phenotype-relevant biology from single-cell RNA sequencing data
- Yingying Cao
- Tian-Gen Chang
- Eytan Ruppin
Nature Machine Intelligence (2024)