Initialization is critical for preserving global data structure in both t-SNE and UMAP

Kobak, Dmitry; Linderman, George C.

doi:10.1038/s41587-020-00809-z

Matters Arising
Published: 01 February 2021

Initialization is critical for preserving global data structure in both t-SNE and UMAP

Nature Biotechnology volume 39, pages 156–157 (2021)Cite this article

20k Accesses
110 Citations
216 Altmetric
Metrics details

Subjects

The Original Article was published on 03 December 2018

Access through your institution

Buy or subscribe

arising from Becht, E. et al. Nature Biotechnology https://doi.org/10.1038/nbt.4314 (2019)

One of the most ubiquitous analysis tools in single-cell transcriptomics and cytometry is t-distributed stochastic neighbor embedding (t-SNE)¹, which is used to visualize individual cells as points on a two-dimensional scatterplot such that similar cells are positioned close together². A related algorithm, called uniform manifold approximation and projection (UMAP)³, has attracted substantial attention in the single-cell community⁴. In Nature Biotechnology, Becht et al.⁴ argued that UMAP is preferable to t-SNE because it better preserves the global structure of the data and is more consistent across runs. Here we show that this alleged superiority of UMAP can be entirely attributed to different choices of initialization in the implementations used by Becht et al.: the t-SNE implementations by default used random initialization, while the UMAP implementation used a technique called Laplacian eigenmaps (LE)⁵ to initialize the embedding. We show that UMAP with random initialization preserves global structure as poorly as t-SNE with random initialization, while t-SNE with informative initialization performs as well as UMAP with informative initialization. On the basis of these observations, we argue that there is currently no evidence that the UMAP algorithm per se has any advantage over t-SNE in terms of preserving global structure. We also contend that these algorithms should always use informative initialization by default.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: t-SNE and UMAP with random and non-random initialization.**

Data availability

The data in this study were sourced from refs. ^8,9,10.

Code availability

The R code extending the analysis of Becht et al. is available at https://github.com/linqiaozhi/DR_benchmark_initialization. The Python code used to produce Fig. 1 is available at https://github.com/dkobak/tsne-umap-init.

References

van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).
Article Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–40 (2019).
Article CAS Google Scholar
Belkin, M. & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 585–591 (2002).
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).
Article Google Scholar
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).
Article CAS Google Scholar
Samusik, N., Good, Z., Spitzer, M. H., Davis, K. L. & Nolan, G. P. Automated mapping of phenotype space with single-cell data. Nat. Methods 13, 493–496 (2016).
Article CAS Google Scholar
Wong, M. T. et al. A high-dimensional atlas of human T cell diversity reveals tissue-specific trafficking and cytokine signatures. Immunity 45, 442–456 (2016).
Article CAS Google Scholar
Han, X. et al. Mapping the mouse cell atlas by Microwell-seq. Cell 172, 1091–1107 (2018).
Article CAS Google Scholar
Policar, P. G., Strazar, M. & Zupan, B. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding. Preprint at bioRxiv https://doi.org/10.1101/731877 (2019).
Böhm, J. N., Berens, B. & Kobak, D. A unifying perspective on neighbor embeddings along the attraction–repulsion spectrum. Preprint at https://arxiv.org/abs/2007.08902 (2020).

Download references

Acknowledgements

The authors thank P. Berens, S. Steinerberger and Y. Kluger for discussions and helpful comments. D.K. was supported by the Deutsche Forschungsgemeinschaft (BE5601/4-1 and the Cluster of Excellence ‘Machine Learning—New Perspectives for Science’, EXC 2064, project number 390727645), the Federal Ministry of Education and Research (FKZ 01GQ1601 and 01IS18039A) and the National Institute of Mental Health of the National Institutes of Health under award number U19MH114830. G.C.L. was supported by the National Human Genome Research Institute (F30HG010102) and National Institutes of Health MSTP training grant T32GM007205. A portion of the benchmarks were run on computational resources funded by the National Institutes of Health (R01GM131642, principal investigator: Y. Kluger). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Institute for Ophthalmic Research, University of Tübingen, Tübingen, Germany
Dmitry Kobak
Applied Mathematics Program, Yale University, New Haven, CT, USA
George C. Linderman

Authors

Dmitry Kobak
View author publications
You can also search for this author in PubMed Google Scholar
George C. Linderman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors contributed equally.

Corresponding authors

Correspondence to Dmitry Kobak or George C. Linderman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Preservation of pairwise distances in embeddings.

The exact analogue of Fig. 5 in the original publication by Becht et al.⁴ To quote the original caption: ‘Box plots represent distances across pairs of points in the embeddings, binned using 50 equal-width bins over the pairwise distances in the original space using 10,000 randomly selected points, leading to 49,995,000 pairs of pairwise distances. […] The value of the Pearson correlation coefficient computed over the pairs of pairwise distances is reported. For the box plots, the central bar represents the median, and the top and bottom boundary of the boxes represent the 75th and 25th percentiles, respectively. The whiskers represent 1.5 times the interquartile range above (or, respectively, below) the top (or, respectively, bottom) box boundary, truncated to the data range if applicable.’ We recomputed all embeddings (except for the UMAP with LE initialization of the Wong et al.⁹ dataset, which was loaded from external source, as in the code accompanying the original publication). All algorithms were run with the same parameters as in the original publication (which always were the default parameters, apart from n_neighbors set to 30 in UMAP for the Han et al.¹⁰ dataset; we kept this value for both initializations). We used the same version of FIt-SNE as in the original publication, to make sure that all the default parameters stayed the same. Y-axis goes from zero to the maximum pairwise distance in all subplots.

Extended Data Fig. 2 Reproducibility of large-scale structures in embeddings.

The exact analogue of Fig. 6 in the original publication⁴. To quote the original caption: ‘Bar plots represent the average unsigned Pearson correlation coefficient of the points’ coordinates in the embedding of subsamples versus in the embedding of the full dataset, thus measuring the correlation of coordinates in subsamples versus in the embedding of the full dataset, up to symmetries along the graph axes. Bar heights represent the average across three replicates and vertical bars the corresponding s.d.’

Extended Data Fig. 3 Qualitative assessment of the reproducibility of embeddings using the Samusik et al.8 dataset.

The exact analogue of Supplementary Fig. 7a from the original publication⁴. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 4 Qualitative assessment of the reproducibility of embeddings using the Wong et al.9 dataset.

The exact analogue of Supplementary Fig. 7b from the original publication⁴. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 5 Qualitative assessment of the reproducibility of embeddings using the Han et al.10 dataset.

The exact analogue of Supplementary Fig. 7c from the original publication⁴. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 6 Annotated embeddings of the Samusik_01 dataset (sample size n=86,864).

Top row: UMAP with random initialization (left) and t-SNE with random initialization (right). Bottom row: UMAP with default initialization (left) and t-SNE with PCA initialization (right). The bottom-left and upper-right panels are analogues of Fig. 2a,b from the original publication⁴. Note that the T cells are not colocalized in the UMAP embedding with random initialization.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kobak, D., Linderman, G.C. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nat Biotechnol 39, 156–157 (2021). https://doi.org/10.1038/s41587-020-00809-z

Download citation

Received: 02 December 2019
Accepted: 23 December 2020
Published: 01 February 2021
Issue Date: February 2021
DOI: https://doi.org/10.1038/s41587-020-00809-z

This article is cited by

Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters
- Lucy Xia
- Christy Lee
- Jingyi Jessica Li
Nature Communications (2024)
CNN Multibeam Seabed Sediment Classification Combined with a Novel Feature Optimization Method
- Michael Anokye
- Xiaodong Cui
- Hongxia Liu
Mathematical Geosciences (2024)
Enhancing cluster analysis via topological manifold learning
- Moritz Herrmann
- Daniyal Kazempour
- Peer Kröger
Data Mining and Knowledge Discovery (2024)
Establishment and Application of Steel Composition Prediction Model Based on t-Distributed Stochastic Neighbor Embedding (t-SNE) Dimensionality Reduction Algorithm
- Xin Liu
- Yanping Bao
- Chao Gu
Journal of Sustainable Metallurgy (2024)
GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership
- Peter Carbonetto
- Kaixuan Luo
- Matthew Stephens
Genome Biology (2023)

Initialization is critical for preserving global data structure in both t-SNE and UMAP

Subjects

Access options

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data

Extended Data Fig. 1 Preservation of pairwise distances in embeddings.

Extended Data Fig. 2 Reproducibility of large-scale structures in embeddings.

Extended Data Fig. 3 Qualitative assessment of the reproducibility of embeddings using the Samusik et al.8 dataset.

Extended Data Fig. 4 Qualitative assessment of the reproducibility of embeddings using the Wong et al.9 dataset.

Extended Data Fig. 5 Qualitative assessment of the reproducibility of embeddings using the Han et al.10 dataset.

Extended Data Fig. 6 Annotated embeddings of the Samusik_01 dataset (sample size n=86,864).

Rights and permissions

About this article

Cite this article

This article is cited by

Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters

CNN Multibeam Seabed Sediment Classification Combined with a Novel Feature Optimization Method

Enhancing cluster analysis via topological manifold learning

Establishment and Application of Steel Composition Prediction Model Based on t-Distributed Stochastic Neighbor Embedding (t-SNE) Dimensionality Reduction Algorithm

GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership

Search

Quick links

Subjects

Access options

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Extended data

Extended Data Fig. 3 Qualitative assessment of the reproducibility of embeddings using the Samusik et al.8 dataset.

Extended Data Fig. 4 Qualitative assessment of the reproducibility of embeddings using the Wong et al.9 dataset.

Extended Data Fig. 5 Qualitative assessment of the reproducibility of embeddings using the Han et al.10 dataset.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links