Significance analysis for clustering with single-cell RNA-sequencing data

Grabski, Isabella N.; Street, Kelly; Irizarry, Rafael A.

doi:10.1038/s41592-023-01933-9

Article
Published: 10 July 2023

Significance analysis for clustering with single-cell RNA-sequencing data

Nature Methods volume 20, pages 1196–1202 (2023)Cite this article

22k Accesses
10 Citations
197 Altmetric
Metrics details

Subjects

Abstract

Unsupervised clustering of single-cell RNA-sequencing data enables the identification of distinct cell populations. However, the most widely used clustering algorithms are heuristic and do not formally account for statistical uncertainty. We find that not addressing known sources of variability in a statistically rigorous manner can lead to overconfidence in the discovery of novel cell types. Here we extend a previous method, significance of hierarchical clustering, to propose a model-based hypothesis testing approach that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations. We also adapt this approach to permit statistical assessment on the clusters reported by any algorithm. Finally, we extend these approaches to account for batch structure. We benchmarked our approach against popular clustering workflows, demonstrating improved performance. To show practical utility, we applied our approach to the Human Lung Cell Atlas and an atlas of the mouse cerebellar cortex, identifying several cases of over-clustering and recapitulating experimentally validated cell type definitions.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Clustering results for applying Seurat’s implementation of the Louvain algorithm to simulated data representing one cell population.**

**Fig. 2: Schematic illustrating our approach to significance analysis for clustering.**

**Fig. 3: Additional benchmarks of our approach.**

**Fig. 4: Applying significance analysis to clusters reported by the Human Lung Cell Atlas.**

DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data

Article Open access 06 October 2021

An entropy-based metric for assessing the purity of single cell populations

Article Open access 22 June 2020

MarkerMap: nonlinear marker selection for single-cell studies

Article Open access 14 February 2024

Data availability

The datasets used in this work are publicly available and can be found as follows. The 293T cells are available at https://www.10xgenomics.com/resources/datasets/293-t-cells-1-standard-1-1-0. The Human Lung Cell Atlas is available at https://hlca.ds.czbiohub.org/. The mouse cerebellum atlas is available at the Broad Institute Single Cell Portal with study ID SCP795.

Code availability

The software developed in this work is publicly available as an R package at https://github.com/igrabski/sc-SHC ref. ²⁶.

References

Waltman, L. & Van Eck, NeesJan A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B 86, 1–14 (2013).
Article Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tang, M. et al. Evaluating single-cell cluster stability using the jaccard similarity index. Bioinformatics 37, 2212–2214 (2021).
Article CAS PubMed Google Scholar
Peyvandipour, A., Shafi, A., Saberian, N. & Draghici, S. Identification of cell types from single cell data using stable clustering. Sci. Rep. 10, 1–12 (2020).
Article Google Scholar
Patterson-Cross, R. B., Levine, A. J. & Menon, V. Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC Bioinform. 22, 1–13 (2021).
Article Google Scholar
Zappia, L. & Oshlack, A. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. Gigascience 7, giy083 (2018).
Article PubMed PubMed Central Google Scholar
Kiselev, VladimirYu, Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Article CAS PubMed Google Scholar
Zhang, J. M., Kamath, G. M. & David, N. T. Valid post-clustering differential analysis for single-cell RNA-seq. Cell Syst. 9, 383–392 (2019).
Article CAS PubMed PubMed Central Google Scholar
McShane, L. M. et al. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002).
Article CAS PubMed Google Scholar
Liu, Y., Hayes, DavidNeil, Nobel, A. & Marron, JamesStephen Statistical significance of clustering for high-dimension, low–sample size data. J. Am. Stat. Assoc. 103, 1281–1293 (2008).
Article CAS Google Scholar
Kimes, P. K., Liu, Y., Neil Hayes, D. & Marron, JamesStephen Statistical significance for hierarchical clustering. Biometrics 73, 811–821 (2017).
Article PubMed PubMed Central Google Scholar
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 20, 1–16 (2019).
Article Google Scholar
Grabski, I. N. and Irizarry, R. A. A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data. Biostatistics https://doi.org/10.1093/biostatistics/kxac021 (2022).
Ward Jr, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
Article Google Scholar
Murtagh, F. & Contreras, P. Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 2, 86–97 (2012).
Article Google Scholar
Zheng, GraceX. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 1–12 (2017).
Article Google Scholar
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kiselev, VladimirYu et al. Sc3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Article CAS PubMed PubMed Central Google Scholar
Santos, J. M. & Embrechts, M. in International Conference on Artificial Neural Networks (eds. Alippi, C. et al.) 175–184 (Springer, 2009).
Kozareva, V. et al. A transcriptomic atlas of mouse cerebellar cortex comprehensively defines cell types. Nature 598, 214–219 (2021).
Article CAS PubMed PubMed Central Google Scholar
Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).
Article CAS PubMed PubMed Central Google Scholar
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Article CAS PubMed PubMed Central Google Scholar
Meinshausen, N. Hierarchical testing of variable importance. Biometrika 95, 265–278 (2008).
Article Google Scholar
Maechler, M. sfsmisc: Utilities from ‘Seminar fuer Statistik’ ETH Zurich. R package version1.1-14. https://CRAN.R-project.org/package=sfsmisc (2022).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Res. 5, 2122 (2016).
Grabski, I. N. igrabski/sc-shc: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.7834130 (2023).

Download references

Acknowledgements

Research supported by the National Science Foundation Graduate Research Fellowship under grant no. DGE1745303 (I.N.G.), and the National Institutes of Health under grant nos. R35GM131802 and R01HG005220 (R.A.I. and K.S.). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or National Institutes of Health.

Author information

Authors and Affiliations

Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
Isabella N. Grabski
Division of Biostatistics, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
Kelly Street
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
Rafael A. Irizarry

Authors

Isabella N. Grabski
View author publications
You can also search for this author in PubMed Google Scholar
Kelly Street
View author publications
You can also search for this author in PubMed Google Scholar
Rafael A. Irizarry
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

I.N.G. and R.A.I. conceived the project. I.N.G., K.S. and R.A.I. developed the methods. I.N.G. implemented the methods and generated the figures. I.N.G. and R.A.I. wrote the paper.

Corresponding authors

Correspondence to Isabella N. Grabski, Kelly Street or Rafael A. Irizarry.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Grabski, I.N., Street, K. & Irizarry, R.A. Significance analysis for clustering with single-cell RNA-sequencing data. Nat Methods 20, 1196–1202 (2023). https://doi.org/10.1038/s41592-023-01933-9

Download citation

Received: 01 August 2022
Accepted: 01 June 2023
Published: 10 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1038/s41592-023-01933-9

This article is cited by

CDSKNNXMBD: a novel clustering framework for large-scale single-cell data based on a stable graph structure
- Jun Ren
- Xuejing Lyu
- Qiyuan Li
Journal of Translational Medicine (2024)
Comprehensive scRNA-seq Model Reveals Artery Endothelial Cell Heterogeneity and Metabolic Preference in Human Vascular Disease
- Liping Zeng
- Yunchang Liu
- Chunyu Zeng
Interdisciplinary Sciences: Computational Life Sciences (2024)
Epitranscriptomic subtyping, visualization, and denoising by global motif visualization
- Jianheng Liu
- Tao Huang
- Rui Zhang
Nature Communications (2023)
Population-level integration of single-cell datasets enables multi-scale analysis across samples
- Carlo De Donno
- Soroor Hediyeh-Zadeh
- Fabian J. Theis
Nature Methods (2023)

Significance analysis for clustering with single-cell RNA-sequencing data

Subjects

Abstract

Access options

Similar content being viewed by others

DUBStepR is a scalable correlation-based feature selection method for accurately clustering single-cell data

An entropy-based metric for assessing the purity of single cell populations

MarkerMap: nonlinear marker selection for single-cell studies

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

CDSKNNXMBD: a novel clustering framework for large-scale single-cell data based on a stable graph structure

Comprehensive scRNA-seq Model Reveals Artery Endothelial Cell Heterogeneity and Metabolic Preference in Human Vascular Disease

Epitranscriptomic subtyping, visualization, and denoising by global motif visualization

Population-level integration of single-cell datasets enables multi-scale analysis across samples

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links