Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis

A preprint version of the article is available at bioRxiv.

Abstract

Clustering and cell type classification are important steps in single-cell RNA-seq (scRNA-seq) analysis. As more and more scRNA-seq data are becoming available, supervised cell type classification methods that utilize external well-annotated source data start to gain popularity over unsupervised clustering algorithms; however, the performance of existing supervised methods is highly dependent on source data quality and they often have limited accuracy to classify cell types that are missing in the source data. We developed ItClust to overcome these limitations, a transfer learning algorithm that borrows ideas from supervised cell type classification algorithms, but also leverages information in target data to ensure sensitivity in classifying cells that are only present in the target data. Through extensive evaluations using data from different species and tissues generated with diverse scRNA-seq protocols, we show that ItClust considerably improves clustering and cell type classification accuracy over popular unsupervised clustering and supervised cell type classification algorithms.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the ItClust framework.
Fig. 2: Comparison of ItClust with unsupervised methods on human pancreatic islet datasets.
Fig. 3: Comparisons of ItClust with semi-supervised and supervised methods on human pancreatic islet datasets when source and target data are from the same species.
Fig. 4: Comparisons of ItClust with semi-supervised and supervised methods on mouse and human kidney datasets when source and target data are from different species.
Fig. 5: Comparisons of ItClust with semi-supervised and supervised methods on human pancreatic islet datasets to evaluate the impact of missing cell types in source data.
Fig. 6: Confidence scores in ItClust.

Similar content being viewed by others

Data availability

We analysed multiple scRNA-seq datasets. Publicly available data were acquired from the access numbers provided by the original publications: Baron et al.19 (GSE84133), Xin et al.26 (GSE81608), Grün et al.17 (GSE81076), Muraro et al.18 (GSE85241), Lawlor et al.15 (GSE86469), Segerstolpe et al.16 (E-MTAB-5061), Park et al.24 (GSE107585), Peng et al.27 (GSE118480), Paul et al.33 (GSE727857) and Tusi et al.34 (GSE89754). Details of the datasets analysed in this paper were described in Supplementary Table 1.

Code availability

An open-source implementation of the ItClust algorithm can be downloaded from https://github.com/jianhuupenn/ItClust, https://doi.org/10.5281/zenodo.397606135.

References

  1. Sandberg, R. Entering the era of single-cell transcriptomics in biology and medicine. Nat. Methods 11, 22–24 (2014).

    Article  Google Scholar 

  2. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. 2008, P10008 (2008).

    MATH  Google Scholar 

  3. Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414 (2017).

    Article  Google Scholar 

  4. Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).

    Article  Google Scholar 

  5. Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).

    Article  Google Scholar 

  6. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 1–14 (2019).

    Article  Google Scholar 

  7. Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359 (2018).

    Article  Google Scholar 

  8. Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Preprint at https://doi.org/10.1101/532895 (2020).

  9. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  Google Scholar 

  10. Wagner, F. & Yanai, I. Moana: a robust and scalable cell type classification framework for single-cell RNA-Seq data. Preprint at https://doi.org/10.1101/456129 (2018).

  11. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, e1821 (2019).

    Article  Google Scholar 

  12. Donahue, J. et al. DeCAF: a deep convolutional activation feature for generic visual recognition. In Proc. 31st International Conference on Machine Learning Vol. 32, 647–655 (JMLR, 2014).

  13. Li, X. et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat. Commun. 11, 2338 (2020).

    Article  Google Scholar 

  14. Wang, J. et al. Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16, 875–878 (2019).

    Article  Google Scholar 

  15. Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).

    Article  Google Scholar 

  16. Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metabol. 24, 593–607 (2016).

    Article  Google Scholar 

  17. Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    Article  Google Scholar 

  18. Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, e383 (2016).

    Google Scholar 

  19. Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell systems 3, e344 (2016).

    Article  Google Scholar 

  20. Franzén, O., Gan, L.-M. & Björkegren, J. L. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).

    Article  Google Scholar 

  21. Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucl. Acids Res. 47, D721–D728 (2019).

    Article  Google Scholar 

  22. Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 20, 194 (2019).

    Article  Google Scholar 

  23. Grapin-Botton, A. Ductal cells of the pancreas. Int. J. Biochem. Cell Biol. 37, 504–510 (2005).

    Article  Google Scholar 

  24. Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360, 758–763 (2018).

    Article  Google Scholar 

  25. Young, M. D. et al. Single-cell transcriptomes from human kidneys reveal the cellular identity of renal tumors. Science 361, 594–599 (2018).

    Article  Google Scholar 

  26. Xin, Y. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metabol. 24, 608–615 (2016).

    Article  Google Scholar 

  27. Peng, Y.-R. et al. Molecular classification and comparative taxonomics of foveal and peripheral cells in primate retina. Cell 176, e1222 (2019).

    Article  Google Scholar 

  28. Mohammadi, S., Davila-Velderrain, J. & Kellis, M. Multi-resolution single-cell state characterization via joint archetypal/network analysis. Preprint at https://doi.org/10.1101/746339 (2019).

  29. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010).

    MathSciNet  MATH  Google Scholar 

  30. Xie, J., Girshick, R. & Farhadi, A. Unsupervised deep embedding for clustering analysis. In International conference on machine learning. In Proc. 33rd International Conference on Machine Learning Vol. 48, 478–487 (PMLR, 2016).

  31. Maaten, L. V. D. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    MATH  Google Scholar 

  32. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  Google Scholar 

  33. Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

    Article  Google Scholar 

  34. Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).

    Article  Google Scholar 

  35. Hu, J. jianhuupenn/ItClust: ItClust V1.0.0 (Zenodo, 2020); https://doi.org/10.5281/zenodo.3976061.

Download references

Acknowledgements

This work was supported by the following grants: NIH R01GM108600, R01GM125301, R01HL113147, R01HL150359, R01EY030192 and R01EY031209 (to M.L.), and R01DK076077 (to. K.S.).

Author information

Authors and Affiliations

Authors

Contributions

This study was conceived of and led by M.L.. J.H., X.L., G.H. and M.L. designed the model and algorithm. J.H. implemented the ItClust software and led the data analysis with input from M.L., X.L., G.H., Y.L. and K.S. J.H. and M.L. wrote the paper with feedback from X.L., G.H., Y.L. and K.S.

Corresponding author

Correspondence to Mingyao Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 scVI’s latent space of Lawlor et al. data.

UMAP plot of scVI’s latent space when Baron human data were used as source data and Lawlor et al. data were used as target data. The plot indicates that scVI failed to remove batch effect between the source and target data, which led to low cell type annotation accuracy in the target data.

Extended Data Fig. 2 Dot plots for human kidney data.

Dot plots of known marker genes used for cell type identification for the human kidney data (data generated ourselves together with data from Young et al.). The marker genes used to label the cell types are: SLC13A3 and SLC34A1 for PT (Proximal Tubule); CLDN16 and SLC12A for Loop of Henle; PTPRB and KDR for Endo_AVR_1 (Endothelial Ascending Vasa Recta); PTPRB and SLC14A1 for Endo_AVR_2; PTPRB, KDR, and SLC14A1 for Endo_DVR (Endothelial Descending Vasa Recta); KCNJ1 and SLC8A1 for Distal Tubules; SLC4A1 and CLCNKB for CD_IC_A; SLC26A4 and CLCNKB for CD_IC_B; GZMA and GZMB for NK_cells; CD3D, CD3E, and CD3G for T_cells; CD14, S100A8, and S100A9 for Macrophage_1; CD14 and FCER1A for Macrophage_2; CD79A and CD79B for B_cells.

Extended Data Fig. 3 Computing cost of ItClust.

Memory usage and CPU time for the kidney data analysis.

Extended Data Fig. 4 Analyzing Tusi et al. data using ItClust.

UMAP (a) and Sankey (b) plots of Tusi et al. data based on ItClust embedding and predicted cell types.

Extended Data Fig. 5 Classification accuracies for combined source data and read depth down sampling experiments.

(a) The classification accuracies of ItClust, Seurat 3.0, Moana, scmap, and scVI for the Segerstolpe human pancreatic islet data, using different source datasets as input. Source data 1 is the reduced Baron human pancreatic islet data as in Fig. 5(b) and source data 2 is the Xin human pancreatic islet data, which only include alpha, beta, gamma, and delta cells. (b) The classification accuracies of ItClust before and after fine-tuning, Seurat 3.0, Moana, scmap, and scVI for the macaque retina data across different down-sampling efficiencies. Cells from macaques 1, 2, and 3 were used as the source data, and cells from macaque 4 were used as the target data.

Extended Data Fig. 6 Sankey plots for Segerstolpe et al. data analysis.

The Sankey plots of ItClust, Seurat 3.0, Moana, scmap, and scVI cell type classification results for the Segerstolpe et al. dataset using the combined source data.

Supplementary information

Supplementary Information

Supplementary Tables 1–5 and Notes 1–4.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, J., Li, X., Hu, G. et al. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell 2, 607–618 (2020). https://doi.org/10.1038/s42256-020-00233-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-020-00233-7

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing