Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Significance analysis for clustering with single-cell RNA-sequencing data

Abstract

Unsupervised clustering of single-cell RNA-sequencing data enables the identification of distinct cell populations. However, the most widely used clustering algorithms are heuristic and do not formally account for statistical uncertainty. We find that not addressing known sources of variability in a statistically rigorous manner can lead to overconfidence in the discovery of novel cell types. Here we extend a previous method, significance of hierarchical clustering, to propose a model-based hypothesis testing approach that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations. We also adapt this approach to permit statistical assessment on the clusters reported by any algorithm. Finally, we extend these approaches to account for batch structure. We benchmarked our approach against popular clustering workflows, demonstrating improved performance. To show practical utility, we applied our approach to the Human Lung Cell Atlas and an atlas of the mouse cerebellar cortex, identifying several cases of over-clustering and recapitulating experimentally validated cell type definitions.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Clustering results for applying Seurat’s implementation of the Louvain algorithm to simulated data representing one cell population.
Fig. 2: Schematic illustrating our approach to significance analysis for clustering.
Fig. 3: Additional benchmarks of our approach.
Fig. 4: Applying significance analysis to clusters reported by the Human Lung Cell Atlas.

Similar content being viewed by others

Data availability

The datasets used in this work are publicly available and can be found as follows. The 293T cells are available at https://www.10xgenomics.com/resources/datasets/293-t-cells-1-standard-1-1-0. The Human Lung Cell Atlas is available at https://hlca.ds.czbiohub.org/. The mouse cerebellum atlas is available at the Broad Institute Single Cell Portal with study ID SCP795.

Code availability

The software developed in this work is publicly available as an R package at https://github.com/igrabski/sc-SHC ref. 26.

References

  1. Waltman, L. & Van Eck, NeesJan A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B 86, 1–14 (2013).

    Article  Google Scholar 

  2. Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Tang, M. et al. Evaluating single-cell cluster stability using the jaccard similarity index. Bioinformatics 37, 2212–2214 (2021).

    Article  CAS  PubMed  Google Scholar 

  4. Peyvandipour, A., Shafi, A., Saberian, N. & Draghici, S. Identification of cell types from single cell data using stable clustering. Sci. Rep. 10, 1–12 (2020).

    Article  Google Scholar 

  5. Patterson-Cross, R. B., Levine, A. J. & Menon, V. Selecting single cell clustering parameter values using subsampling-based robustness metrics. BMC Bioinform. 22, 1–13 (2021).

    Article  Google Scholar 

  6. Zappia, L. & Oshlack, A. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. Gigascience 7, giy083 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Kiselev, VladimirYu, Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).

    Article  CAS  PubMed  Google Scholar 

  8. Zhang, J. M., Kamath, G. M. & David, N. T. Valid post-clustering differential analysis for single-cell RNA-seq. Cell Syst. 9, 383–392 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. McShane, L. M. et al. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469 (2002).

    Article  CAS  PubMed  Google Scholar 

  10. Liu, Y., Hayes, DavidNeil, Nobel, A. & Marron, JamesStephen Statistical significance of clustering for high-dimension, low–sample size data. J. Am. Stat. Assoc. 103, 1281–1293 (2008).

    Article  CAS  Google Scholar 

  11. Kimes, P. K., Liu, Y., Neil Hayes, D. & Marron, JamesStephen Statistical significance for hierarchical clustering. Biometrics 73, 811–821 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 20, 1–16 (2019).

    Article  Google Scholar 

  13. Grabski, I. N. and Irizarry, R. A. A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data. Biostatistics https://doi.org/10.1093/biostatistics/kxac021 (2022).

  14. Ward Jr, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).

    Article  Google Scholar 

  15. Murtagh, F. & Contreras, P. Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 2, 86–97 (2012).

    Article  Google Scholar 

  16. Zheng, GraceX. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 1–12 (2017).

    Article  Google Scholar 

  17. Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Kiselev, VladimirYu et al. Sc3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Santos, J. M. & Embrechts, M. in International Conference on Artificial Neural Networks (eds. Alippi, C. et al.) 175–184 (Springer, 2009).

  20. Kozareva, V. et al. A transcriptomic atlas of mouse cerebellar cortex comprehensively defines cell types. Nature 598, 214–219 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Meinshausen, N. Hierarchical testing of variable importance. Biometrika 95, 265–278 (2008).

    Article  Google Scholar 

  24. Maechler, M. sfsmisc: Utilities from ‘Seminar fuer Statistik’ ETH Zurich. R package version1.1-14. https://CRAN.R-project.org/package=sfsmisc (2022).

  25. Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Res. 5, 2122 (2016).

  26. Grabski, I. N. igrabski/sc-shc: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.7834130 (2023).

Download references

Acknowledgements

Research supported by the National Science Foundation Graduate Research Fellowship under grant no. DGE1745303 (I.N.G.), and the National Institutes of Health under grant nos. R35GM131802 and R01HG005220 (R.A.I. and K.S.). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or National Institutes of Health.

Author information

Authors and Affiliations

Authors

Contributions

I.N.G. and R.A.I. conceived the project. I.N.G., K.S. and R.A.I. developed the methods. I.N.G. implemented the methods and generated the figures. I.N.G. and R.A.I. wrote the paper.

Corresponding authors

Correspondence to Isabella N. Grabski, Kelly Street or Rafael A. Irizarry.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Grabski, I.N., Street, K. & Irizarry, R.A. Significance analysis for clustering with single-cell RNA-sequencing data. Nat Methods 20, 1196–1202 (2023). https://doi.org/10.1038/s41592-023-01933-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-023-01933-9

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics