Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

An analytical framework for interpretable and generalizable single-cell data analysis

An Author Correction to this article was published on 14 February 2022

This article has been updated

Abstract

The scaling of single-cell data exploratory analysis with the rapidly growing diversity and quantity of single-cell omics datasets demands more interpretable and robust data representation that is generalizable across datasets. Here, we have developed a ‘linearly interpretable’ framework that combines the interpretability and transferability of linear methods with the representational power of non-linear methods. Within this framework we introduce a data representation and visualization method, GraphDR, and a structure discovery method, StructDR, that unifies cluster, trajectory and surface estimation and enables their confidence set inference.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: A linearly interpretable data representation method that captures the structure of single-cell data while preserving interpretability and transferability.
Fig. 2: Density-based generalized trajectory estimation and inference.

Similar content being viewed by others

Data availability

The 339-dataset benchmark dataset published by Saelens et al.10 was downloaded from https://zenodo.org/record/1443566. The unnormalized performance scores were extracted from https://github.com/dynverse/dynbenchmark_results/blob/1ac55e6c54a950890208b1f7730092d39783dfd2/06-benchmark/benchmark_results_unnormalised.rds. The normalized scores were computed as in ref. 10, with the scaling factors kept to the same values as the original methods benchmarked. Other singe-cell datasets analyzed in this paper were from refs. 6,9,13,14,18,23,24,25. The Scanpy package8 was used for preprocessing steps when needed, as described previously26. We created a Zenodo record, https://zenodo.org/record/3710980 (ref. 27), that contains all the input data used in this paper.

Code availability

All methods described in this paper are implemented in an open-source Python package, quasildr (https://github.com/jzthree/quasildr). A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.9410876.v1)28.

Change history

References

  1. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  2. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018).

    Article  Google Scholar 

  3. Haghverdi, L., Buettner, F. & Theis, F. J. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 31, 2989–2998 (2015).

    Article  CAS  Google Scholar 

  4. Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).

    Article  CAS  Google Scholar 

  5. Weinreb, C., Wolock, S. & Klein, A. M. SPRING: a kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics 34, 1246–1248 (2018).

    Article  CAS  Google Scholar 

  6. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).

    Article  CAS  Google Scholar 

  7. Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).

    Article  CAS  Google Scholar 

  8. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  Google Scholar 

  9. Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360(6392), eaar3131 (2018).

    Article  Google Scholar 

  10. Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).

    Article  CAS  Google Scholar 

  11. Hochgerner, H., Zeisel, A., Lönnerberg, P. & Linnarsson, S. Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing. Nat. Neurosci. 21, 290–299 (2018).

    Article  CAS  Google Scholar 

  12. Marques, S. et al. Oligodendrocyte heterogeneity in the mouse juvenile and adult central nervous system. Science 352, 1326–1329 (2016).

    Article  CAS  Google Scholar 

  13. Fincher, C. T., Wurtzel, O., de Hoog, T., Kravarik, K. M. & Reddien, P. W. Cell type transcriptome atlas for the planarian Schmidtea mediterranea. Science 360(6391), eaaq1736 (2018).

    Article  Google Scholar 

  14. Plass, M. et al. Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics. Science 360(6391), eaaq1723 (2018).

    Article  Google Scholar 

  15. Genovese, C. R., Perone-Pacifico, M., Verdinelli, I. & Wasserman, L. Nonparametric ridge estimation. Ann. Stat. 42, 1511–1545 (2014).

    Article  Google Scholar 

  16. Ozertem, U. & Erdogmus, D. Locally defined principal curves and surfaces. J. Mach. Learn. Res. 12, 1249–1286 (2011).

    Google Scholar 

  17. Chen, Y. C., Genovese, C. R. & Wasserman, L. Asymptotic theory for density ridges. Ann. Stat. 43, 1896–1928 (2015).

    Google Scholar 

  18. Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014 (2018).

    Article  CAS  Google Scholar 

  19. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    Article  CAS  Google Scholar 

  20. Malkov, Yu. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2020).

    Article  Google Scholar 

  21. Dong, W., Charikar, M. & Li, K. Efficient K-nearest neighbor graph construction for generic similarity measures. In WWW 2011: Proceedings of the 20th International Conference on World Wide Web 577–586 (https://doi.org/10.1145/1963405.1963487, 2011).

  22. Saragih, J. M., Lucey, S. & Cohn, J. F. Face alignment through subspace constrained mean-shifts. In 2009 IEEE 12th International Conference on Computer Vision 1034–1041 (https://doi.org/10.1109/ICCV.2009.5459377, 2009).

  23. Briggs, J. A. et al. The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution. Science 360(6392), eaar5780 (2018).

    Article  Google Scholar 

  24. Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128(8), e20–31 (2016).

    Article  CAS  Google Scholar 

  25. Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

    Article  CAS  Google Scholar 

  26. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    Article  CAS  Google Scholar 

  27. Zhou J. & Troyanskaya, O. An analytical framework for interpretable and generalizable single-cell data analysis (Dataset). Zenodo https://doi.org/10.5281/zenodo.3710980 (2020).

  28. Zhou, J. & Troyanskaya, O. An analytical framework for interpretable and generalizable single-cell data analysis. Code Ocean https://doi.org/10.24433/CO.9410876.v1 (2021).

Download references

Acknowledgements

The authors acknowledge all members of the Troyanskaya laboratory and Zhou laboratory for helpful discussions. This work was performed using the high-performance computing resources (supported by the Scientific Computing Core) at the Flatiron Institute and the BioHPC at UT Southwestern Medical Center. J.Z. is supported by the Cancer Prevention and Research Institute of Texas grant (RR190071) and the UT Southwestern Endowed Scholars program. O.G.T. is supported by National Institutes of Health grant nos. R01HG005998, U54HL117798 and R01GM071966, US Department of Health and Human Services grant no. HHSN272201000054C and Simons Foundation grant no. 395506. O.G.T. is a senior fellow of the Genetic Networks program of the Canadian Institute for Advanced Research.

Author information

Authors and Affiliations

Authors

Contributions

J.Z. conceived the framework, developed the computational methods, and performed the analyses. J.Z. and O.G.T. wrote the paper.

Corresponding authors

Correspondence to Jian Zhou or Olga G. Troyanskaya.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Methods thanks Yvan Saeys and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Visualization of first two principal components in PCA, GraphDR, and tSNE visualizations.

We compared the PCA, GraphDR, and tSNE representations by the values of first two principal components (PCs, shown by color) on a developing mouse hippocampus dataset (a,b) (Hochgerner et al. 201811) and a mature mouse brain dataset (c,d) (Zeisel et al. 201818). The top weighted genes by absolute values for the first two PCs are also shown (b, d).

Extended Data Fig. 2 Dataset alignment with GraphDR further improves dataset comparison.

Comparison with applying GraphDR without (a, c) and with (b, d) graph-based dataset alignment on two hematopoietic datasets (Nestorowa et al. 201624 and Paul et al. 201525). The GraphDR visualizations are colored by cell types (a, b) and by datasets (c, d). The cell types are common myeloid progenitors (CMPs), granulocyte-monocyte progenitors (GMPs), lymphoid multipotent progenitors (LMPPs), long-term HSCs (LTHSC), megakaryocyte-erythrocyte progenitors (MEPs), multipotent progenitors (MPPs). Specifically, GraphDR with graph-based dataset alignment constructs a joint graph that also connects the nearest neighbors between datasets (see batch design in Extended Data Fig. 3).

Extended Data Fig. 3 Experimental design encoding through graph construction.

Experimental design information can be encoded through graph construction in GraphDR. Each arrow indicates that nearest-neighbor connections are established between the two groups, where two connected cells are in the two different groups. Self-loop indicates nearest-neighbor connections from cells within a group. Basic design constructs a nearest neighbor graph using all cells, which is suitable for single-batch experiments or experiments with minimal batch effects. Batch design addresses batch effects by introducing nearest-neighbor connections between all pairs of batches, in addition to with-in batch nearest-neighbor connections. Time-series design extends basic design by only allowing connections between the same and adjacent time points. Batch + time series design introduces nearest neighbor connections between two batches in the same or adjacent time points.

Extended Data Fig. 4 Visualization of zebrafish whole embryo single-cell developmental landscape with GraphDR.

Application of GraphDR to a single-cell dataset (Farrell et al. 20189) with a time-series design. a. Single-cell visualization by GraphDR, colored by developmental stages. b. Comparative visualization of developmental stages. This shows the ‘cross-section’ view by visualizing the second and third dimensions. c,d. Single-cell visualization by GraphDR, colored by cell origins.

Extended Data Fig. 5 Visualization of Xenopus tropicalis whole embryo single-cell developmental landscape with GraphDR.

This is an example of applying GraphDR to a single-cell dataset with a batch + time-series design (Briggs et al. 201823). a. Single-cell visualization by GraphDR, colored by developmental stages. b. Comparative visualization of developmental stages. This shows the ‘cross-section’ view by visualizing the second and third dimensions. c,d. Single-cell visualization by GraphDR, colored by cell origins.

Extended Data Fig. 6 Schematic overview of StructDR density ridge estimation procedures with the SCMS algorithm.

(a,b) StructDR starts from performing kernel density estimation with Gaussian kernel on the input cells. (c) Based on the estimated density function, and a selected density ridge dimensionality d (d = 1 in this example), the SCMS update can be derived for any position in the space from the gradient and Hessian of the log density function. For any data point or position of interest, iteratively updating the position with the SCMS update will project the data point or position to density ridges of chosen dimensionality. (d). Optional step: construct graph connecting points on the density ridges with one of two optional methods (Methods). The backbone of the graph can be specified based on a betweenness centrality threshold.

Extended Data Fig. 7 Overview of the unified framework of cluster, trajectory, and surface analysis with StructDR.

(a) StructDR uses the SCMS update for the estimation of clusters, trajectories, and surfaces, which can all be derived based on gradient and Hessian of log density function. (b) Examples of projection paths by SCMS updates for zero, one, and two-dimensional density ridges. (c). Comparisons of SCMS algorithms for 0, 1, 2, or k-dimensional density ridges. The SCMS update can identify any k-dimensional density ridges, by projecting a gradient-based update onto subspace spanned by the k + 1 th to last eigenvector of the Hessian of log density function.

Extended Data Fig. 8 Performance score distributions on the 339-dataset benchmark shown by dataset type.

Per-dataset performance scores are computed based on Saelens et al. 2019. The performance score distributions are shown with violin plots, separated into panels by dataset types. The performance of applying StructDR + GraphDR with two graph construction algorithms, MST and SimpleNNG, are shown along with the performance of other algorithms benchmarked in Saelens et al. 201910.

Extended Data Fig. 9 Trajectory identification with zero, one, and two-dimensional density ridges example on a developmental hippocampus single-cell dataset.

The circle symbols indicate zero-dimensional density ridge positions (local maxima of density function). The red dots indicate one-dimensional density ridge positions (trajectory). The black dots indicate two-dimensional density ridge positions.

Extended Data Fig. 10 Simulation studies of confidence sets construction with nonparametric ridge estimation.

100 simulation datasets were generated. For each dataset the confidence sets for each estimated trajectory were estimated with 20 bootstraps. x-axis shows the expected coverage probabilities of the constructed confidence sets. y-axis shows the observed proportion that the true trajectory position is covered by the confidence set.

Supplementary information

Supplementary Information

Supplementary text: StructDR algorithm for the estimation of single-cell cluster, trajectory and surface structures based on SCMS. Supplementary Fig. 1: Graphical interface for interactive single-cell visualization and analysis. The elements of the interface include a method interface for different types of analyses: dimensionality reduction, clustering and trajectory analysis (left), a 3D interactive cell visualization interface (middle), and an interactive filter interface including cell selection and gene selection tools (right). All interfaces are updated upon receiving any input.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, J., Troyanskaya, O.G. An analytical framework for interpretable and generalizable single-cell data analysis. Nat Methods 18, 1317–1321 (2021). https://doi.org/10.1038/s41592-021-01286-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-021-01286-1

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics