Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

An active learning approach for clustering single-cell RNA-seq data


Single-cell RNA sequencing (scRNA-seq) data has been widely used to profile cellular heterogeneities with a high-resolution picture. Clustering analysis is a crucial step of scRNA-seq data analysis because it provides a chance to identify and uncover undiscovered cell types. Most methods for clustering scRNA-seq data use an unsupervised learning strategy. Since the clustering step is separated from the cell annotation and labeling step, it is not uncommon for a totally exotic clustering with poor biological interpretability to be generated—a result generally undesired by biologists. To solve this problem, we proposed an active learning (AL) framework for clustering scRNA-seq data. The AL model employed a learning algorithm that can actively query biologists for labels, and this manual labeling is expected to be applied to only a subset of cells. To develop an optimal active learning approach, we explored several key parameters of the AL model in the experiments with four real scRNA-seq datasets. We demonstrate that the proposed AL model outperformed state-of-the-art unsupervised clustering methods with less than 1000 labeled cells. Therefore, we conclude that AL model is a promising tool for clustering scRNA-seq data that allows us to achieve a superior performance effectively and efficiently.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Architecture and protocol of the active learning model.
Fig. 2: Clustering performance test of AL models on the different datasets.
Fig. 3: Clustering performance of the AL model and four popular unsupervised clustering methods.
Fig. 4: Low-dimensional representations of the real datasets with the predicted labels from different methods.
Fig. 5: Running time test of SVM and RF-based AL models.
Fig. 6: The label distribution of the training cells in the best and worst AL models.
Fig. 7: Data structure of the real datasets used in this study.

Data availability

The code and all datasets of this study are available on the GitHub:


  1. 1.

    Bacher R, Kendziorski C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 2016;17:63.

    PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019;20:1–21.

    Article  Google Scholar 

  3. 3.

    Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat. Mach. Intell. 2019;1:191–8.

    Article  Google Scholar 

  4. 4.

    Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinform. 2019;20:40.

    Article  Google Scholar 

  5. 5.

    Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute: Autoencoder based imputation of single-cell RNA-seq data. Sci Rep. 2018;8:1–11.

    CAS  Article  Google Scholar 

  6. 6.

    Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20:273–82.

    CAS  PubMed  Article  Google Scholar 

  7. 7.

    Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. 2014;343:776–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Wang X, Sun Z, Zhang Y, Xu Z, Xin H, Huang H, et al. BREM-SC: a bayesian random effects mixture model for joint clustering single cell multi-omics data. Nucleic Acids Res. 2020;48:5814–24.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Ringeling FR, Canzar S. Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data. Genome Res. 2021;31:677–88.

    Article  Google Scholar 

  10. 10.

    Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016:44;e117–e117.

  11. 11.

    Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14:483–6.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36:411–20.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Tian T, Zhang J, Lin X, Wei Z, Hakonarson H. Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data. Nat Commun. 2021;12:1–12.

    Article  Google Scholar 

  14. 14.

    Chen L, He Q, Zhai Y, Deng M. Single-cell RNA-seq data semi-supervised clustering and annotation via structural regularized domain adaptation. Bioinformatics. 2021;37:775–84.

    PubMed  Article  Google Scholar 

  15. 15.

    Settles B. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences; MINDS@UW; 2009.

  16. 16.

    Prince M. Does active learning work? A review of the research. J Eng Educ. 2004;93:223–31.

    Article  Google Scholar 

  17. 17.

    Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2:193–218.

    Article  Google Scholar 

  18. 18.

    Strehl A, Ghosh J. Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach Learn Res. 2002;3:583–617.

    Google Scholar 

  19. 19.

    Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist Q. 1955;2:83–97.

    Article  Google Scholar 

  20. 20.

    Balcan M-F, Broder A, Zhang T. Margin based active learning. In: International Conference on Computational Learning Theory. 35–50. Springer; International Conference on Computational Learning Theory; 2007.

  21. 21.

    Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:1–12.

    Article  Google Scholar 

  22. 22.

    Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science. 2017;357:661–7.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. 23.

    Young MD, Mitchell TJ, Vieira Braga FA, Tran M, Stewart BJ, Ferdinand JR, et al. Single-cell transcriptomes from human kidneys reveal the cellular identity of renal tumors. Science. 2018;361:594–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  24. 24.

    Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, et al. Mapping the mouse cell atlas by microwell-seq. Cell. 2018;172:1091–107.e1017.

    CAS  PubMed  Article  Google Scholar 

  25. 25.

    Wei C, Sohn K, Mellina C, Yuille A, Yang, F. Crest: a class-rebalancing self-training framework for imbalanced semi-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021;10857–66.

Download references


The research was partially supported by the National Center for Advancing Translational Sciences (NCATS), a component of the National Institute of Health (NIH) under award number UL1TR003017.

Author information




XL, ZW, and SB performed study design and development of methodology. ZW, SB, and NG review and revision of the paper; XL and HL performed data analysis and interpretation, and statistical analysis; ZW and SB provided technical and material support. All authors read and approved the final paper.

Corresponding author

Correspondence to Zhi Wei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, X., Liu, H., Wei, Z. et al. An active learning approach for clustering single-cell RNA-seq data. Lab Invest (2021).

Download citation


Quick links