Abstract
The scaling of single-cell data exploratory analysis with the rapidly growing diversity and quantity of single-cell omics datasets demands more interpretable and robust data representation that is generalizable across datasets. Here, we have developed a ‘linearly interpretable’ framework that combines the interpretability and transferability of linear methods with the representational power of non-linear methods. Within this framework we introduce a data representation and visualization method, GraphDR, and a structure discovery method, StructDR, that unifies cluster, trajectory and surface estimation and enables their confidence set inference.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The 339-dataset benchmark dataset published by Saelens et al.10 was downloaded from https://zenodo.org/record/1443566. The unnormalized performance scores were extracted from https://github.com/dynverse/dynbenchmark_results/blob/1ac55e6c54a950890208b1f7730092d39783dfd2/06-benchmark/benchmark_results_unnormalised.rds. The normalized scores were computed as in ref. 10, with the scaling factors kept to the same values as the original methods benchmarked. Other singe-cell datasets analyzed in this paper were from refs. 6,9,13,14,18,23,24,25. The Scanpy package8 was used for preprocessing steps when needed, as described previously26. We created a Zenodo record, https://zenodo.org/record/3710980 (ref. 27), that contains all the input data used in this paper.
Code availability
All methods described in this paper are implemented in an open-source Python package, quasildr (https://github.com/jzthree/quasildr). A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.9410876.v1)28.
Change history
14 February 2022
A Correction to this paper has been published: https://doi.org/10.1038/s41592-022-01421-6
References
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018).
Haghverdi, L., Buettner, F. & Theis, F. J. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics 31, 2989–2998 (2015).
Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
Weinreb, C., Wolock, S. & Klein, A. M. SPRING: a kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics 34, 1246–1248 (2018).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360(6392), eaar3131 (2018).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Hochgerner, H., Zeisel, A., Lönnerberg, P. & Linnarsson, S. Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing. Nat. Neurosci. 21, 290–299 (2018).
Marques, S. et al. Oligodendrocyte heterogeneity in the mouse juvenile and adult central nervous system. Science 352, 1326–1329 (2016).
Fincher, C. T., Wurtzel, O., de Hoog, T., Kravarik, K. M. & Reddien, P. W. Cell type transcriptome atlas for the planarian Schmidtea mediterranea. Science 360(6391), eaaq1736 (2018).
Plass, M. et al. Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics. Science 360(6391), eaaq1723 (2018).
Genovese, C. R., Perone-Pacifico, M., Verdinelli, I. & Wasserman, L. Nonparametric ridge estimation. Ann. Stat. 42, 1511–1545 (2014).
Ozertem, U. & Erdogmus, D. Locally defined principal curves and surfaces. J. Mach. Learn. Res. 12, 1249–1286 (2011).
Chen, Y. C., Genovese, C. R. & Wasserman, L. Asymptotic theory for density ridges. Ann. Stat. 43, 1896–1928 (2015).
Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014 (2018).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Malkov, Yu. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2020).
Dong, W., Charikar, M. & Li, K. Efficient K-nearest neighbor graph construction for generic similarity measures. In WWW 2011: Proceedings of the 20th International Conference on World Wide Web 577–586 (https://doi.org/10.1145/1963405.1963487, 2011).
Saragih, J. M., Lucey, S. & Cohn, J. F. Face alignment through subspace constrained mean-shifts. In 2009 IEEE 12th International Conference on Computer Vision 1034–1041 (https://doi.org/10.1109/ICCV.2009.5459377, 2009).
Briggs, J. A. et al. The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution. Science 360(6392), eaar5780 (2018).
Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128(8), e20–31 (2016).
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Zhou J. & Troyanskaya, O. An analytical framework for interpretable and generalizable single-cell data analysis (Dataset). Zenodo https://doi.org/10.5281/zenodo.3710980 (2020).
Zhou, J. & Troyanskaya, O. An analytical framework for interpretable and generalizable single-cell data analysis. Code Ocean https://doi.org/10.24433/CO.9410876.v1 (2021).
Acknowledgements
The authors acknowledge all members of the Troyanskaya laboratory and Zhou laboratory for helpful discussions. This work was performed using the high-performance computing resources (supported by the Scientific Computing Core) at the Flatiron Institute and the BioHPC at UT Southwestern Medical Center. J.Z. is supported by the Cancer Prevention and Research Institute of Texas grant (RR190071) and the UT Southwestern Endowed Scholars program. O.G.T. is supported by National Institutes of Health grant nos. R01HG005998, U54HL117798 and R01GM071966, US Department of Health and Human Services grant no. HHSN272201000054C and Simons Foundation grant no. 395506. O.G.T. is a senior fellow of the Genetic Networks program of the Canadian Institute for Advanced Research.
Author information
Authors and Affiliations
Contributions
J.Z. conceived the framework, developed the computational methods, and performed the analyses. J.Z. and O.G.T. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Methods thanks Yvan Saeys and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Visualization of first two principal components in PCA, GraphDR, and tSNE visualizations.
We compared the PCA, GraphDR, and tSNE representations by the values of first two principal components (PCs, shown by color) on a developing mouse hippocampus dataset (a,b) (Hochgerner et al. 201811) and a mature mouse brain dataset (c,d) (Zeisel et al. 201818). The top weighted genes by absolute values for the first two PCs are also shown (b, d).
Extended Data Fig. 2 Dataset alignment with GraphDR further improves dataset comparison.
Comparison with applying GraphDR without (a, c) and with (b, d) graph-based dataset alignment on two hematopoietic datasets (Nestorowa et al. 201624 and Paul et al. 201525). The GraphDR visualizations are colored by cell types (a, b) and by datasets (c, d). The cell types are common myeloid progenitors (CMPs), granulocyte-monocyte progenitors (GMPs), lymphoid multipotent progenitors (LMPPs), long-term HSCs (LTHSC), megakaryocyte-erythrocyte progenitors (MEPs), multipotent progenitors (MPPs). Specifically, GraphDR with graph-based dataset alignment constructs a joint graph that also connects the nearest neighbors between datasets (see batch design in Extended Data Fig. 3).
Extended Data Fig. 3 Experimental design encoding through graph construction.
Experimental design information can be encoded through graph construction in GraphDR. Each arrow indicates that nearest-neighbor connections are established between the two groups, where two connected cells are in the two different groups. Self-loop indicates nearest-neighbor connections from cells within a group. Basic design constructs a nearest neighbor graph using all cells, which is suitable for single-batch experiments or experiments with minimal batch effects. Batch design addresses batch effects by introducing nearest-neighbor connections between all pairs of batches, in addition to with-in batch nearest-neighbor connections. Time-series design extends basic design by only allowing connections between the same and adjacent time points. Batch + time series design introduces nearest neighbor connections between two batches in the same or adjacent time points.
Extended Data Fig. 4 Visualization of zebrafish whole embryo single-cell developmental landscape with GraphDR.
Application of GraphDR to a single-cell dataset (Farrell et al. 20189) with a time-series design. a. Single-cell visualization by GraphDR, colored by developmental stages. b. Comparative visualization of developmental stages. This shows the ‘cross-section’ view by visualizing the second and third dimensions. c,d. Single-cell visualization by GraphDR, colored by cell origins.
Extended Data Fig. 5 Visualization of Xenopus tropicalis whole embryo single-cell developmental landscape with GraphDR.
This is an example of applying GraphDR to a single-cell dataset with a batch + time-series design (Briggs et al. 201823). a. Single-cell visualization by GraphDR, colored by developmental stages. b. Comparative visualization of developmental stages. This shows the ‘cross-section’ view by visualizing the second and third dimensions. c,d. Single-cell visualization by GraphDR, colored by cell origins.
Extended Data Fig. 6 Schematic overview of StructDR density ridge estimation procedures with the SCMS algorithm.
(a,b) StructDR starts from performing kernel density estimation with Gaussian kernel on the input cells. (c) Based on the estimated density function, and a selected density ridge dimensionality d (d = 1 in this example), the SCMS update can be derived for any position in the space from the gradient and Hessian of the log density function. For any data point or position of interest, iteratively updating the position with the SCMS update will project the data point or position to density ridges of chosen dimensionality. (d). Optional step: construct graph connecting points on the density ridges with one of two optional methods (Methods). The backbone of the graph can be specified based on a betweenness centrality threshold.
Extended Data Fig. 7 Overview of the unified framework of cluster, trajectory, and surface analysis with StructDR.
(a) StructDR uses the SCMS update for the estimation of clusters, trajectories, and surfaces, which can all be derived based on gradient and Hessian of log density function. (b) Examples of projection paths by SCMS updates for zero, one, and two-dimensional density ridges. (c). Comparisons of SCMS algorithms for 0, 1, 2, or k-dimensional density ridges. The SCMS update can identify any k-dimensional density ridges, by projecting a gradient-based update onto subspace spanned by the k + 1 th to last eigenvector of the Hessian of log density function.
Extended Data Fig. 8 Performance score distributions on the 339-dataset benchmark shown by dataset type.
Per-dataset performance scores are computed based on Saelens et al. 2019. The performance score distributions are shown with violin plots, separated into panels by dataset types. The performance of applying StructDR + GraphDR with two graph construction algorithms, MST and SimpleNNG, are shown along with the performance of other algorithms benchmarked in Saelens et al. 201910.
Extended Data Fig. 9 Trajectory identification with zero, one, and two-dimensional density ridges example on a developmental hippocampus single-cell dataset.
The circle symbols indicate zero-dimensional density ridge positions (local maxima of density function). The red dots indicate one-dimensional density ridge positions (trajectory). The black dots indicate two-dimensional density ridge positions.
Extended Data Fig. 10 Simulation studies of confidence sets construction with nonparametric ridge estimation.
100 simulation datasets were generated. For each dataset the confidence sets for each estimated trajectory were estimated with 20 bootstraps. x-axis shows the expected coverage probabilities of the constructed confidence sets. y-axis shows the observed proportion that the true trajectory position is covered by the confidence set.
Supplementary information
Supplementary Information
Supplementary text: StructDR algorithm for the estimation of single-cell cluster, trajectory and surface structures based on SCMS. Supplementary Fig. 1: Graphical interface for interactive single-cell visualization and analysis. The elements of the interface include a method interface for different types of analyses: dimensionality reduction, clustering and trajectory analysis (left), a 3D interactive cell visualization interface (middle), and an interactive filter interface including cell selection and gene selection tools (right). All interfaces are updated upon receiving any input.
Rights and permissions
About this article
Cite this article
Zhou, J., Troyanskaya, O.G. An analytical framework for interpretable and generalizable single-cell data analysis. Nat Methods 18, 1317–1321 (2021). https://doi.org/10.1038/s41592-021-01286-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-021-01286-1