Abstract
Unsupervised clustering of single-cell RNA-sequencing data enables the identification of distinct cell populations. However, the most widely used clustering algorithms are heuristic and do not formally account for statistical uncertainty. We find that not addressing known sources of variability in a statistically rigorous manner can lead to overconfidence in the discovery of novel cell types. Here we extend a previous method, significance of hierarchical clustering, to propose a model-based hypothesis testing approach that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations. We also adapt this approach to permit statistical assessment on the clusters reported by any algorithm. Finally, we extend these approaches to account for batch structure. We benchmarked our approach against popular clustering workflows, demonstrating improved performance. To show practical utility, we applied our approach to the Human Lung Cell Atlas and an atlas of the mouse cerebellar cortex, identifying several cases of over-clustering and recapitulating experimentally validated cell type definitions.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The datasets used in this work are publicly available and can be found as follows. The 293T cells are available at https://www.10xgenomics.com/resources/datasets/293-t-cells-1-standard-1-1-0. The Human Lung Cell Atlas is available at https://hlca.ds.czbiohub.org/. The mouse cerebellum atlas is available at the Broad Institute Single Cell Portal with study ID SCP795.
Code availability
The software developed in this work is publicly available as an R package at https://github.com/igrabski/sc-SHC ref. 26.
References
Waltman, L. & Van Eck, NeesJan A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B 86, 1–14 (2013).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Tang, M. et al. Evaluating single-cell cluster stability using the jaccard similarity index. Bioinformatics 37, 2212–2214 (2021).
Peyvandipour, A., Shafi, A., Saberian, N. & Draghici, S. Identification of cell types from single cell data using stable clustering. Sci. Rep. 10, 1–12 (2020).
Patterson-Cross, R. B., Levine, A. J. & Menon, V. Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC Bioinform. 22, 1–13 (2021).
Zappia, L. & Oshlack, A. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. Gigascience 7, giy083 (2018).
Kiselev, VladimirYu, Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Zhang, J. M., Kamath, G. M. & David, N. T. Valid post-clustering differential analysis for single-cell RNA-seq. Cell Syst. 9, 383–392 (2019).
McShane, L. M. et al. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002).
Liu, Y., Hayes, DavidNeil, Nobel, A. & Marron, JamesStephen Statistical significance of clustering for high-dimension, low–sample size data. J. Am. Stat. Assoc. 103, 1281–1293 (2008).
Kimes, P. K., Liu, Y., Neil Hayes, D. & Marron, JamesStephen Statistical significance for hierarchical clustering. Biometrics 73, 811–821 (2017).
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 20, 1–16 (2019).
Grabski, I. N. and Irizarry, R. A. A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data. Biostatistics https://doi.org/10.1093/biostatistics/kxac021 (2022).
Ward Jr, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
Murtagh, F. & Contreras, P. Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 2, 86–97 (2012).
Zheng, GraceX. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 1–12 (2017).
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Kiselev, VladimirYu et al. Sc3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Santos, J. M. & Embrechts, M. in International Conference on Artificial Neural Networks (eds. Alippi, C. et al.) 175–184 (Springer, 2009).
Kozareva, V. et al. A transcriptomic atlas of mouse cerebellar cortex comprehensively defines cell types. Nature 598, 214–219 (2021).
Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Meinshausen, N. Hierarchical testing of variable importance. Biometrika 95, 265–278 (2008).
Maechler, M. sfsmisc: Utilities from ‘Seminar fuer Statistik’ ETH Zurich. R package version1.1-14. https://CRAN.R-project.org/package=sfsmisc (2022).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Res. 5, 2122 (2016).
Grabski, I. N. igrabski/sc-shc: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.7834130 (2023).
Acknowledgements
Research supported by the National Science Foundation Graduate Research Fellowship under grant no. DGE1745303 (I.N.G.), and the National Institutes of Health under grant nos. R35GM131802 and R01HG005220 (R.A.I. and K.S.). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or National Institutes of Health.
Author information
Authors and Affiliations
Contributions
I.N.G. and R.A.I. conceived the project. I.N.G., K.S. and R.A.I. developed the methods. I.N.G. implemented the methods and generated the figures. I.N.G. and R.A.I. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1 and 2.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Grabski, I.N., Street, K. & Irizarry, R.A. Significance analysis for clustering with single-cell RNA-sequencing data. Nat Methods 20, 1196–1202 (2023). https://doi.org/10.1038/s41592-023-01933-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-01933-9
This article is cited by
-
CDSKNNXMBD: a novel clustering framework for large-scale single-cell data based on a stable graph structure
Journal of Translational Medicine (2024)
-
Comprehensive scRNA-seq Model Reveals Artery Endothelial Cell Heterogeneity and Metabolic Preference in Human Vascular Disease
Interdisciplinary Sciences: Computational Life Sciences (2024)
-
Epitranscriptomic subtyping, visualization, and denoising by global motif visualization
Nature Communications (2023)
-
Population-level integration of single-cell datasets enables multi-scale analysis across samples
Nature Methods (2023)