Visualizing structure and transitions in high-dimensional biological data

Abstract

The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of PHATE and its ability to reveal structure in data.
Fig. 2
Fig. 3: Extracting branches and branchpoints from PHATE.
Fig. 4: PHATE most accurately represents manifold distances in a 2D embedding.
Fig. 5: Comparison of PHATE to other visualization methods on biological datasets.
Fig. 6: PHATE analysis of embryoid body scRNA-seq data with n = 16,825 cells.

Data availability

The embryoid body scRNA-seq and bulk RNA-seq datasets generated and analyzed during the current study are available from the Mendeley Data repository at https://doi.org/10.17632/v6n743h5ng.1. Supplementary Figure 14a contains images of the raw single cells while Supplementary Fig. 14f contains scatter plots showing the gating procedure for fluorescence activated cell sorting populations for the bulk RNA-seq data.

Code availability

Python, R and Matlab implementations of PHATE are available on GitHub (https://github.com/KrishnaswamyLab/PHATE) for academic use.

Change history

  • 02 January 2020

    An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

  1. 1.

    van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

  2. 2.

    Amir, E. D. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).

  3. 3.

    Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).

  4. 4.

    Tenenbaum, J. B., De Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).

  5. 5.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).

  6. 6.

    Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000).

  7. 7.

    Cox, T. F. & Cox, M. A. A. Multidimensional Scaling 2nd edn (Chapman & Hall/CRC, 2001).

  8. 8.

    De Silva, V. & Tenenbaum J. B. Sparse Multidimensional Scaling Using Landmark Points (Stanford University, 2004).

  9. 9.

    Unen, V. et al. Visual analysis of mass cytometry data by hierarchical stochastic neighbour embedding reveals rare cell types. Nat. Commun. 8, 1740 (2017).

  10. 10.

    Chen, L. & Buja, A. Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. J. Am. Stat. Assoc. 104, 209–219 (2009).

  11. 11.

    Moon, T. K. & Stirling, W. C. Mathematical Methods and Algorithms for Signal Processing (Prentice Hall, 2000).

  12. 12.

    Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).

  13. 13.

    Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).

  14. 14.

    Haghverdi, L., Buettner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).

  15. 15.

    Darrow, E. M. et al. Deletion of DXZ4 on the human inactive X chromosome alters higher-order genome architecture. Proc. Natl Acad. Sci. USA 113, E4504–E4512 (2016).

  16. 16.

    Cheng, X., Rachh, M. & Steinerberger, S. On the diffusion geometry of graph Laplacians and applications. Appl. Comput. Harmon. Anal. 46, 674–688 (2019).

  17. 17.

    Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

  18. 18.

    Zunder, E. R., Lujan, E., Goltsev, Y., Wernig, M. & Nolan, G. P. A continuous molecular roadmap to iPSC reprogramming through progression analysis of single-cell mass cytometry. Cell Stem Cell 16, 323–337 (2015).

  19. 19.

    Lui, K., Ding, G. W., Huang, R. & McCann, R. Dimensionality reduction has quantifiable imperfections: two geometric bounds. In Proc. 32nd International Conference on Neural Information Processing Systems (Eds. Bengio, S. et al.) 8453–8463 (Curran Associates, 2018).

  20. 20.

    Tsai, F. S. A visualization metric for dimensionality reduction. Expert Syst. Appl. 39, 1747–1752 (2012).

  21. 21.

    Bertini, E., Tatu, A. & Keim, D. Quality metrics in high-dimensional data visualization: an overview and systematization. IEEE Trans. Vis. Comput. Graph. 17, 2203–2212 (2011).

  22. 22.

    Maaten, Lvd, Postma, E. & Herik, Jvd Dimensionality reduction: a comparative review. J. Mach. Learn. Res. 10, 66–71 (2009).

  23. 23.

    Vankadara, L. C. & von Luxburg, U. Measures of distortion for machine learning. In Proc. 32nd International Conference on Neural Information Processing Systems (Eds. Bengio, S. et al.) 4886–4895 (Curran Associates, 2018).

  24. 24.

    Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).

  25. 25.

    Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

  26. 26.

    Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).

  27. 27.

    Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323 (2016).

  28. 28.

    Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

  29. 29.

    Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).

  30. 30.

    Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 34, 637–645 (2016).

  31. 31.

    Liiv, I. Seriation and matrix reordering methods: an historical overview. Stat. Anal. Data Min. 3, 70–91 (2010).

  32. 32.

    Hahsler, M., Hornik, K. & Buchta, C. Getting things in order: an introduction to the R package seriation. J. Stat. Soft. 25, 1–34 (2008).

  33. 33.

    Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).

  34. 34.

    Krishnaswamy, S. et al. Conditional density-based analysis of T cell signaling in single-cell data. Science 346, 1250689 (2014).

  35. 35.

    Polo, J. M. et al. A molecular roadmap of reprogramming somatic cells into iPS cells. Cell 151, 1617–1632 (2012).

  36. 36.

    Martin, G. R. & Evans, M. J. Differentiation of clonal lines of teratocarcinoma cells: formation of embryoid bodies in vitro. Proc. Natl Acad. Sci. USA 72, 1441–1445 (1975).

  37. 37.

    Bibel, M., Richter, J., Lacroix, E. & Barde, Y.-A. Generation of a defined and uniform population of CNS progenitors and neurons from mouse embryonic stem cells. Nat. Protocols 2, 1034–1043 (2007).

  38. 38.

    Kang, S.-M. et al. Efficient induction of oligodendrocytes from human embryonic stem cells. Stem Cells 25, 419–424 (2007).

  39. 39.

    Zhao, X., Liu, J. & Ahmad, I. Differentiation of embryonic stem cells to retinal cells in vitro. In Embryonic Stem Cell Protocols: Differentiation Models Vol. 2 (Ed. Turksen, K.) 401–416 (Humana Press, 2006).

  40. 40.

    Liour, S. S. et al. Further characterization of embryonic stem cell-derived radial glial cells. Glia 53, 43–56 (2006).

  41. 41.

    Nakano, T., Kodama, H. & Honjo, T. In vitro development of primitive and definitive erythrocytes from different precursors. Science 272, 722 (1996).

  42. 42.

    Nishikawa, S.-I., Nishikawa, S., Hirashima, M., Matsuyoshi, N. & Kodama, H. Progressive lineage analysis by cell sorting and culture identifies FLK1+ VE-cadherin+ cells at a diverging point of endothelial and hemopoietic lineages. Development 125, 1747–1757 (1998).

  43. 43.

    Wiles, M. V. & Keller, G. Multiple hematopoietic lineages develop from embryonic stem (ES) cells in culture. Development 111, 259–267 (1991).

  44. 44.

    Potocnik, A. J., Nielsen, P. J. & Eichmann, K. In vitro generation of lymphoid precursors from embryonic stem cells. EMBO J. 13, 5274 (1994).

  45. 45.

    Tsai, M. et al. In vivo immunological function of mast cells derived from embryonic stem cells: an approach for the rapid analysis of even embryonic lethal mutations in adult mice in vivo. Proc. Natl Acad. Sci. USA 97, 9186–9190 (2000).

  46. 46.

    Fairchild, P. et al. Directed differentiation of dendritic cells from mouse embryonic stem cells. Curr. Biol. 10, 1515–1518 (2000).

  47. 47.

    Yamashita, J. et al. Flk1-positive cells derived from embryonic stem cells serve as vascular progenitors. Nature 408, 92–96 (2000).

  48. 48.

    Maltsev, V. A., Rohwedel, J., Hescheler, J. & Wobus, A. M. Embryonic stem cells differentiate in vitro into cardiomyocytes representing sinusnodal, atrial and ventricular cell types. Mech. Dev. 44, 41–50 (1993).

  49. 49.

    Rohwedel, J. et al. Muscle cell differentiation of embryonic stem cells reflects myogenesis in vivo: developmentally regulated expression of myogenic determination genes and functional expression of ionic currents. Dev. Biol. 164, 87–101 (1994).

  50. 50.

    Kania, G., Blyszczuk, P., Jochheim, A., Ott, M. & Wobus, A. M. Generation of glycogen- and albumin-producing hepatocyte-like cells from embryonic stem cells. Biol. Chem. 385, 943–953 (2004).

  51. 51.

    Schroeder, I. S., Rolletschek, A., Blyszczuk, P., Kania, G. & Wobus, A. M. Differentiation of mouse embryonic stem cells to insulin-producing cells. Nat. Protocols 1, 495–507 (2006).

  52. 52.

    Geijsen, N. et al. Derivation of embryonic germ cells and male gametes from embryonic stem cells. Nature 427, 148–154 (2004).

  53. 53.

    Kehler, J., Hübner, K., Garrett, S. & Schöler, H. R. Generating oocytes and sperm from embryonic stem cells. Semin. Reprod. Med. 23, 222–233 (2005).

  54. 54.

    Betancur, P., Bronner-Fraser, M. & Sauka-Spengler, T. Assembling neural crest regulatory circuits into a gene regulatory network. Annu. Rev. Cell Dev. Biol. 26, 581–603 (2010).

  55. 55.

    Barembaum, M. & Bronner-Fraser, M. Early steps in neural crest specification. Semin. Cell Dev. Biol. 16, 642–646 (2005).

  56. 56.

    Treleaven, K. & Frazzoli, E. An explicit formulation of the earth movers distance with continuous road map distances. Preprint at arXiv https://arxiv.org/abs/1309.7098 (2013).

  57. 57.

    Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

  58. 58.

    Nadler, B., Lafon, S., Coifman, R. R. & Kevrekidis, I. Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators. In Proc 18th International Conference on Neural Information Processing Systems (Eds. Weiss, Y. et al.) 955–962 (MIT Press, 2005).

  59. 59.

    Nadler, B., Lafon, S., Coifman, R. R. & Kevrekidis, I. G. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Appl. Comput Harmon. Anal. 21, 113–127 (2006).

  60. 60.

    Butterworth, S. On the theory of filter amplifiers. Wireless Engineer 7, 536–541 (1930).

  61. 61.

    Neumann, J. Mathematische Grundlagen der Quantenmechanik. (Springer, 1932).

  62. 62.

    Anand, K., Bianconi, G. & Severini, S. Shannon and von Neumann entropy of random networks with heterogeneous expected degree. Phys. Rev. E 83, 036109 (2011).

  63. 63.

    Salicrú, M. & Pons, A. A. Sobre ciertas propiedades de la M-divergencia en análisis de datos. Qüestiió 9, 251–256 (1985).

  64. 64.

    Salicrú, M., Sanchez, A., Conde, J. & Sanchez, P. Entropy measures associated with K and M divergences. Soochow J. Math. 21, 291–298 (1995).

  65. 65.

    Wolf, G., Rotbart, A., David, G. & Averbuch, A. Coarse-grained localized diffusion. Appl. Comput. Harm. Anal. 33, 388–400 (2012).

  66. 66.

    Platt, J. Fastmap, metricmap, and landmark mds are all Nystrom algorithms. In Proc. 10th International Workshop on Artificial Intelligence and Statistics (Eds. Cowell, R. & Ghahramani, Z.) (AI/Stats, 2005).

  67. 67.

    Yang, T., Liu, J., McMillan, L. & Wang, W. A fast approximation to multidimensional scaling. In Proc. IEEE Workshop on Computation Intensive Methods for Computer Vision (IEEE, 2006).

  68. 68.

    Gigante, S. et al. Compressed diffusion. In The 13th International Conference on Sampling Theory and Applications (Bordeaux, France), sampta2019:267712 (2019).

  69. 69.

    Costa, J. A. & Hero, A. O. III Determining intrinsic dimension and entropy of high-dimensional shape spaces. In Statistics and Analysis of Shapes (Eds Hamid, K. & Yezzi Jr, A) 231–252 (Birkhäuser, 2006).

  70. 70.

    Carter, K. M., Raich, R. & Hero, A. O. III On local intrinsic dimension estimation and its applications. IEEE Trans. Signal Process. 58, 650–663 (2010).

  71. 71.

    Levina, E. & Bickel, P. J. Maximum likelihood estimation of intrinsic dimension. In Proc. 18th International Conference on Neural Information Processing Systems (ed. Weiss, Y.) 777–784 (Curran Associates, 2005).

  72. 72.

    David, G. & Averbuch, A. Hierarchical data organization, clustering and denoising via localized diffusion folders. Appl. Comput. Harmon. Anal. 33, 1–23 (2012).

  73. 73.

    Rubner, Y., Tomasi, C. & Guibas, L. J. A metric for distributions with applications to image databases. In Proc. IEEE Sixth International Conference on Computer Vision 59–66 (IEEE, 1998).

  74. 74.

    Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011).

  75. 75.

    Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  76. 76.

    Grün, D., Kester, L. & Van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014).

  77. 77.

    Balasubramanian, M. & Schwartz, E. L. The isomap algorithm and topological stability. Science 295, 7–7 (2002).

  78. 78.

    van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).

  79. 79.

    Vieth, B., Ziegenhain, C., Parekh, S., Enard, W. & Hellmann, I. powsimR: power analysis for bulk and single cell rna-seq experiments. Bioinformatics 33, 3486–3488 (2017).

  80. 80.

    Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093 (2013).

  81. 81.

    Hwang, B., Lee, J. H. & Bang, D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 96 (2018).

  82. 82.

    Kim, J. K., Kolodziejczyk, A. A., Ilicic, T., Teichmann, S. A. & Marioni, J. C. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 6, 8687 (2015).

Download references

Acknowledgements

This research was supported in part by the Gruber Foundation (to S.G.); the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health (NIH) (award number F31HD097958) (to D.B.B.); an Alfred P. Sloan Fellowship (grant FG-2016-6607); a DARPA Young Faculty Award (grant D16AP00117); National Science Foundation (NSF) grants 1620216, 1912906; an NSF CAREER award (grant 1845856) (to M.J.H.); NIH grant 1R01HG008383-01A1 (to R.R.C.); NIH grant R01GM107092 (to N.B.I.); IVADO (Institut de valorisation des données) (to G.W.); the Chan–Zuckerberg Initiative (grant 182702); NIH grant R01GM130847; the State of Connecticut (grant 16-RMB-YALE-07) (to S.K.); and NIH grant R01GM135929 (to M.J.H., G.W. and S.K.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Author information

K.R.M., S.K., G.W. and D.v.D. envisioned the project. K.R.M., D.v.D., S.G. and G.W. implemented the method. K.R.M., D.v.D., S.G., S.K. and N.B.I. performed the analyses. K.R.M., S.K., G.W., and N.B.I. wrote the paper. D.v.D., S.G. and D.B.B. assisted in writing. D.B.B., W.S.C. and K.Y. assisted in the analysis. K.R.M., G.W., M.J.H. and R.R.C. developed the mathematical foundations of the method. Z.W., A.v.d.E. and N.B.I. were responsible for data acquisition and processing.

Correspondence to Natalia B. Ivanova or Guy Wolf or Smita Krishnaswamy.

Ethics declarations

Competing interests

Smita Krishnaswamy serves on the scientific advisory board of AI Therapeutics.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Comparison of PHATE to DM on the artificial tree (n=1440 60-dimensional data points).

(A) PHATE applied to the artificial tree data. Only two PHATE coordinates are needed to separate all branches. (B) The first six diffusion map coordinates of the artificial tree data. At least five of these coordinates are necessary to separate all of the branches.

Supplementary Figure 2 Impact of potential distances and PHATE parameters on the resulting visualization.

(A) Comparison of Diffusion Maps (blue) and PHATE (orange) embeddings on data (black) from a half circle (left, n = 100 data points) and a full circle (right, n = 100 data points). Both the data and the embeddings have been centered about the mean and rescaled by the max Euclidean norm. For the full circle, both embeddings are identical (up to centering & scaling) to the original circle. However, for the half circle, the Diffusion Maps embedding (blue) suffers from instabilities that generate significantly higher densities near the two end points. The PHATE embedding (orange) does not exhibit these instabilities. (B) The α-decaying kernel \(K_{\alpha ,\sigma }\left( x \right) = \exp \left( { - \left( {\frac{{\left| x \right|}}{\sigma }} \right)^\alpha } \right)\) as a function of x for different values of α and σ = 1 (left) and σ = 4 (right). As α increases, \(K_{\alpha ,\sigma }\left( x \right)\) becomes more constant for \(x \in ( - \sigma ,\sigma )\) and the tails of the kernel become lighter (i.e., decay to zero more quickly) for \(x \notin ( - \sigma ,\sigma ).\) (C) Demonstration of the effect of the scale t on the PHATE visualization for the artificial tree data (n = 1440 60-dimensional data points) colored by branch. The first column shows the VNE H(t) (see Eq. 5) of the diffusion affinities as a function of the time scale t. The other columns give the PHATE visualization with different values of t. The red dots in the first column indicate the values of t chosen for the plots. The red dot surrounded by a black box indicate the chosen value of t for the visualization in Figure 1B of the artificial tree data. Values of t that are too low can give noisy visualizations while very high values of t can result in a loss of information in the visualization. (D) Visualization of scRNA-seq data measured from mouse retinal bipolar neurons (Shekhar et al., Cell, vol. 166, no. 5, pp. 1308-1323, 2016), using different informational distances defined via the parameter γ. n = 27499 single cells.

Supplementary Figure 3 Comparison of PHATE to various methods on multiple artificial and non-biological datasets.

Note that methods with strong structural assumptions on the data, such as t-SNE (clusters) and Monocle2 (tree) are expected to fail on the subset of datasets which do not fit their assumptions. See Supplementary Note 2 for discussion. See the figure for the respective sample sizes for each dataset.

Supplementary Figure 4 Visual and quantitative demonstrations of the robustness of PHATE to subsampling and the choice of parameters.

(A) The PHATE visualization for the iPSC mass cytometry dataset from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) with varying number of subsample sizes N. The main branches present for N = 10000 are also visible for the other values of N, demonstrating that the PHATE embedding is robust to the size of the subsample. (B) The PHATE visualization of the same iPSC CyTOF dataset with varying scale parameter t with \(n = 50000\) cells. The embeddings for all t preserve the branching structure and the visualizations are very similar to each other, demonstrating that the embedding is robust to the choice of t. (C) Heatmap of the Spearman correlation coefficient between geodesic distances of the ground truth data and the Euclidean distances of the PHATE visualization applied to the simulated paths dataset using Splatter (Zappia et al., Genome Biology, vol. 18, no. 1, p. 174, 2017). The results are presented using different values for k, t, and α. The value of t selected using the kneepoint method in this case is 8. The number of simulated cells is n = 3000. (D) Heatmap of the Spearman correlation coefficient between geodesic distances of the ground truth data and the Euclidean distances of the PHATE visualization applied to the simulated groups dataset using Splatter. The results are presented using different values for k, t, and α. For both the groups and paths datasets, the results are very stable for \(\alpha \ge 10\). The value of t selected using the kneepoint method in this case is 8. The number of simulated cells is n = 3000.

Supplementary Figure 5 Visual and quantitative demonstrations of the reproducibility of PHATE compared to PCA, tSNE, and UMAP.

Reproducibility was computed on 4 different datasets (4 columns) that were generated using Splatter. The different runs had different random seeds and n = 2000 cells. (A) Boxplots show RMSE computed between 10 runs of each method. RMSE was computed between each unique pair of runs (thus 45 in total) after aligning the pair of embeddings with Procrustes. Thus, RMSE here quantifies how much embeddings change between runs, with lower RMSE signifying greater reproducibility. In the boxplots, the box limits indicate the lower and upper quartile values with a line at the median while the whiskers show the range of the data. (B) For each method (rows) and each dataset (columns) two example runs are shown (orange and blue points) to visually demonstrate the reproducibility. In line with the RMSE boxplots, PHATE and PCA show almost perfectly overlapping embeddings while tSNE and UMAP show significant variability between runs.

Supplementary Figure 6 Scalability tests of PHATE.

(A) Scalable PHATE embedding of iPSC CyTOF data \(\left( {n = 220450\,cells} \right)\) from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) with a subset of the landmarks shown in red (200 out of 2000). (B) Robustness of PHATE to the number of landmarks chosen. PHATE on the EB data (\(n = 16825\) cells) computed using increasing numbers of landmarks (X-axis) was compared to exact PHATE, i.e. without landmarks. Comparison was done using Procrustes analysis (optimal linear transformation) and the sum of squared error (SSE, Y-axis) is shown. To ensure a stable embedding that accurately approximates exact PHATE we choose 2000 landmarks as default. The inset shows the histogram of pairwise distances in the visualization computed using fast PHATE (2000 landmarks) on the EB data vs. the pairwise distances from exact PHATE. The correspondence and the Pearson correlation coefficient are very high. (C) PHATE and t-SNE embeddings of a mouse brain cell dataset from 10X genomics with a large number of cells (\(n = 1,300,774\) cells). The PHATE embedding was calculated with 2000 landmarks and completed in three hours. A subset (10 of 60) of the clusters provided by 10X are shown in color, the rest in gray. t-SNE shatters the cluster structure, while PHATE retains clusters as contiguous groups of cells. (D) Runtime of PHATE, t-SNE and UMAP on increasingly large subsamples of the EB data. Runtime was averaged across four runs. (E) Runtime of 12 visualization methods shown in Figures S3 and S8 across 19 datasets and corresponding line of best fit for each method. Where a method ran out of memory or took longer than one hour, the runtime is not shown and linear fits are cut off accordingly.

Supplementary Figure 7 Annotated PHATE visualizations of CyTOF iPSC data (n = 50000 cells) from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) and branch expression analysis.

(A) The primary branch point between the two major branches (reprogrammed and refractory) of the data is highlighted. (B) The PHATE visualization colored by Lin28 (a marker associated with the transition to pluripotency (Polo et al., Cell, vol. 151, no. 7, pp. 1617-1632, 2012)) and Ccasp3 (associated with cell apoptosis). Lin28 expression is limited to the reprogrammed branch while Ccasp3 is primarily expressed in the refractory branch, indicating that the failure to reprogram may initiate apoptosis in these cells. (C) Analysis of branches on the PHATE embedding for the same iPSC CyTOF data, (D) bone marrow scRNA-seq dataset (\(n = 2730\) cells) from Paul et al. (Cell, vol. 163, no. 7, pp. 1663-1677, 2015), and (E) newly generated embryoid body scRNA-seq data (\(n = 16825\) cells). (Left) The PHATE visualization with identified branches. (Middle) Expression level for each cell ordered by branch and ordering within the branch. Cell ordering is calculated using Wanderlust (Bendall et al., Cell, vol. 157, no. 3, pp. 714-725, 2014) starting on the left-most point of each branch. Expression levels are z-scored for each gene. A colorbar is given below the expression matrices that identifies each branch and (in the case of the bone marrow scRNA-seq data) cell type. (Right) DREMI scores (Krishnaswamy et al., Science, vol. 346, no. 6213, p. 1250689, 2014) between gene expression levels and cell order within each branch. MAGIC (van Dijk et al., Cell, vol. 174, no. 3, pp. 716-729, 2018) is applied first in (D) and (E) to impute missing values using the same kernel used for PHATE and smaller t. For branch analysis of the bone marrow data in (D), we used 3 PHATE dimensions to obtain clearer branch separation.

Supplementary Figure 8 Comparison of PHATE to various methods on multiple biological datasets.

Note that methods with strong structural assumptions on the data, such as t-SNE (clusters) and Monocle2 (tree) are expected to fail on the subset of datasets which do not fit their assumptions. See Supplementary Note 2 for discussion.

Supplementary Figure 9 PHATE preserves separations and cluster structure in addition to continuum structure.

To quantify the ability of PHATE to preserve cluster structure, we generated 30 random datasets with cluster structure using the Splatter package (Zappia et al., Genome Biology, vol. 18, no. 1, p. 174, 2017). Each dataset has \(n = 2000\) cells and between 7 and 14 clusters. We then computed the Adjusted Rand Index (Rand, Journal of the American Statistical Association, vol. 66, no. 336, pp. 846-850, 1971) (ARI, y-axis) between the ground truth clusters and clusters obtained by running k-means clustering on the embeddings. An ARI of 1 means perfect recovery of the clusters. We performed this analysis on Splatter data with increasing amounts of noise added during generation. For each noise level we compare clustering on the raw data, on 2-dimensional PCA, 2D t-SNE, 2D UMAP, and 2D PHATE. On average, PHATE preserves local cluster structure as well or better than the other methods. In the boxplots, the box limits indicate the lower and upper quartile values with a line at the median while the whiskers show the range of the data.

Supplementary Figure 10 PHATE reveals structure in a variety of high-dimensional datasets.

(A) A 3D PHATE visualization of the Frey Faces dataset (\(n = 1965\) images) used in Roweis and Saul (Science, vol. 290, no. 5500, pp. 2323-2326, 2000). Points are colored by time within the video. Multiple branches corresponding to different poses are clearly visible. (B) PCA and PHATE embeddings of microbiome data from the American Gut project (\(n = 9660\) human samples), colored by body site, and branches annotated by their dominant genera or phyla. (C) The PHATE embedding of the same data from the American Gut project colored by 2 genera (bacteroides and prevotella) and a phylum (actinobacteria) of bacteria. (D) The PHATE embedding of only the fecal samples from the American Gut project (\(n = 8596\)) colored by various genera (bacteroides and prevotella) and phyla (firmicutes, verrucomicrobia, and proteobacteria) of bacteria. Each PHATE branch is associated with one of these bacteria groups. (E) PCA and PHATE embeddings of SNP data from the Human Origins dataset (\(n = 2345\) present-day humans) showing genotyped present-day humans from 203 populations (Patterson et al., Genetics, vol. 192, no. 3, pp. 1065-1093, 2012) with the population legend in (F).

Supplementary Figure 11 PHATE reveals structure in a variety of connectivity datasets.

(A) 3D PHATE visualization of human Hi-C data (Darrow et al., Proceedings of the National Academy of Sciences, p. 201609643, 2016) using all 23 chromosomes at 50 kb resolution (\(n = 56702\) locations on the chromosomes), colored by chromosome. Each point corresponds to a genomic fragment. (B) PHATE visualizations of the same human Hi-C data in A for chromosome 1 at 10 kb resolution colored by chromosome location (\(n = 22128\) chromosome locations). (C) 2D PHATE visualization of the same human Hi-C data for chromosome 1 at 10 kb resolution, colored by selected chromatin modification markers from ChIP-seq data (\(n = 22128\) chromosome locations). (D) Force-directed layout and PHATE visualizations of Facebook network data with data points colored by their degree (number of connections). The subnetworks are taken from the friend networks of selected individuals within the entire network. In all cases, PHATE reveals more structure. For the entire network, \(n = 3927\) nodes. For subnetworks 1 and 2, \(n = 1034\) and 532 nodes, respectively.

Supplementary Figure 12 Additional analysis with PHATE on scRNA-seq data measured from mouse retinal bipolar neurons from Shekhar et al. (Cell, vol. 166, no. 5, pp. 1308-1323, 2016).

(A) i. Initial PHATE embedding (\(n = 27499\) cells). The rod bipolar cells cluster (cluster 1) is circled. ii. Subsequent PHATE embedding of cluster 1, colored by k-means clustering to show heterogeneity within rod bipolar cells (\(n = 10889\) cells). (B) Transcriptional characterization of subtypes of rod bipolar cells from cluster 1, using known bipolar cell markers.

Supplementary Figure 13 PHATE using reweighted distances to highlight specific biological processes or “views” of the data.

(A) PHATE embedding of the CyTOF iPSC data (\(n = 220450\)) from Zunder et al. (Cell Stem Cell, vol. 16, no. 3, pp. 323-337, 2015) using (i) unweighted distances, (ii) distances after upweighting cell cycle markers, (iii) distances after upweighting stem cell markers, (iv) distances after upweighting mitotic markers. (B) PHATE embedding of the same dataset colored by different markers (columns). From top to bottom: (i) PHATE cell cycle “view”, (ii) PHATE stem cell “view” (iii) PHATE mitotic “view”.

Supplementary Figure 14 Further analysis of the EB scRNA-seq data.

(A) Inverted images of hESCs and EBs at each timepoint of data collection. Structures of different densities are clearly visible late in the time course (D15-D27) indicating the formation of distinct cell types. The experiments were repeated independently n = 3 times. (B) The PHATE embedding of the EB data (\(n = 16825\) cells) colored by expression levels of selected markers. (C) Heatmap showing gene expression level in each cell in four of the branches starting with ESC. The number of cells in each branch is \(n = 2294,9507,5543\), and 4938 for the EN, ME, NE, and NC branches, respectively. Cell ordering is determined using Wanderlust (Bendall et al., Cell, vol. 157, no. 3, pp. 714-725, 2014). Genes were selected either manually or by high DREMI scores (Krishnaswamy et al., Science, vol. 346, no. 6213, p. 1250689, 2014) between gene expression and cell ordering. (D) The PHATE embedding of the EB data (\(n = 16825\) cells) colored by CD49d expression level from the scRNA-seq data (top) and by Spearman correlation between the scRNA-seq transcription factor expression and the CD49d-sorted bulk RNA-seq transcription factor expression per cell (bottom, n = 1213 transcription factors). (E) Same as (D), with CD142 and CD82. The Spearman correlation coefficient is highest in branch vii, which is the branch with the highest CD142 and CD82 expression. Bottom right: Scatter plot of single cell expression levels (\(n = 16825\) cells) between CD82 and CD142. Color corresponds to the Spearman correlation between the scRNA-seq expression and the CD142+CD82+ sorted bulk RNA-seq expression (\(n = 15111\) genes). The branch with highest correlation corresponds to cells that are positive in both CD142 and CD82. (F) Scatter plots showing the gating procedure for FACS sorting cell populations of sub-branch iii (CD49d and CD63) and sub-branch vii (CD82 and CD142). The experiments were repeated independently n = 3 times.

Supplementary information

Supplementary Materials

Supplementary Figs. 1–14, Supplementary Tables 1–5 and Supplementary Notes 1–4.

Reporting Summary

Supplementary Video 1

The mesoderm branch. Rotating video of 3D PHATE visualizations of the mesoderm branch of the EB scRNA-seq data colored by the geometric mean of selected genes at each stage of the lineage specification tree in Fig. 6b.

Supplementary Video 2

The mesoderm branch. Rotating video of 3D PHATE visualizations of the mesoderm branch of the EB scRNA-seq data colored by the geometric mean of selected genes at each stage of the lineage specification tree in Fig. 6b.

Supplementary Video 3

The neuroectoderm branches. Rotating video of 3D PHATE visualizations of the neuroectoderm branches of the EB scRNA-seq data colored by the geometric mean of selected genes at each stage of the lineage specification tree in Fig. 6b.

Supplementary Video 4

PHATE visualizing the Frey Face dataset. Video showing the PHATE visualization (left) for the Frey Face dataset used by Roweis and Saul6 (right). PHATE reveals multiple branches in the data that correspond to different poses. Two of the branches are highlighted in this video. The corresponding point in the PHATE visualization is highlighted as the video progresses.

Supplementary Video 5

PHATE visualizing chromosome 1 in Hi-C data. Rotating 3D PHATE visualization of chromosome 1 in the Hi-C data from Darrow et al.15 at a resolution of 10 kilobases. Multiple folds are clearly visible in the visualization.

Supplementary Video 6

PHATE visualizing all chromosomes in Hi-C data. Rotating 3D PHATE visualization of all chromosomes in the Hi-C data from Darrow et al.15 at a resolution of 50 kilobases. The embedding resembles the fractal globule structure proposed in Lieberman-Aiden et al.57.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Moon, K.R., van Dijk, D., Wang, Z. et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol 37, 1482–1492 (2019). https://doi.org/10.1038/s41587-019-0336-3

Download citation

Further reading