Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Initialization is critical for preserving global data structure in both t-SNE and UMAP

The Original Article was published on 03 December 2018

This is a preview of subscription content

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: t-SNE and UMAP with random and non-random initialization.

Data availability

The data in this study were sourced from refs. 8,9,10.

Code availability

The R code extending the analysis of Becht et al. is available at https://github.com/linqiaozhi/DR_benchmark_initialization. The Python code used to produce Fig. 1 is available at https://github.com/dkobak/tsne-umap-init.

References

  1. 1.

    van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  2. 2.

    Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).

    Article  Google Scholar 

  3. 3.

    McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  4. 4.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–40 (2019).

    CAS  Article  Google Scholar 

  5. 5.

    Belkin, M. & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 585–591 (2002).

  6. 6.

    Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).

    Article  Google Scholar 

  7. 7.

    Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).

    CAS  Article  Google Scholar 

  8. 8.

    Samusik, N., Good, Z., Spitzer, M. H., Davis, K. L. & Nolan, G. P. Automated mapping of phenotype space with single-cell data. Nat. Methods 13, 493–496 (2016).

    CAS  Article  Google Scholar 

  9. 9.

    Wong, M. T. et al. A high-dimensional atlas of human T cell diversity reveals tissue-specific trafficking and cytokine signatures. Immunity 45, 442–456 (2016).

    CAS  Article  Google Scholar 

  10. 10.

    Han, X. et al. Mapping the mouse cell atlas by Microwell-seq. Cell 172, 1091–1107 (2018).

    CAS  Article  Google Scholar 

  11. 11.

    Policar, P. G., Strazar, M. & Zupan, B. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding. Preprint at bioRxiv https://doi.org/10.1101/731877 (2019).

  12. 12.

    Böhm, J. N., Berens, B. & Kobak, D. A unifying perspective on neighbor embeddings along the attraction–repulsion spectrum. Preprint at https://arxiv.org/abs/2007.08902 (2020).

Download references

Acknowledgements

The authors thank P. Berens, S. Steinerberger and Y. Kluger for discussions and helpful comments. D.K. was supported by the Deutsche Forschungsgemeinschaft (BE5601/4-1 and the Cluster of Excellence ‘Machine Learning—New Perspectives for Science’, EXC 2064, project number 390727645), the Federal Ministry of Education and Research (FKZ 01GQ1601 and 01IS18039A) and the National Institute of Mental Health of the National Institutes of Health under award number U19MH114830. G.C.L. was supported by the National Human Genome Research Institute (F30HG010102) and National Institutes of Health MSTP training grant T32GM007205. A portion of the benchmarks were run on computational resources funded by the National Institutes of Health (R01GM131642, principal investigator: Y. Kluger). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Affiliations

Authors

Contributions

The authors contributed equally.

Corresponding authors

Correspondence to Dmitry Kobak or George C. Linderman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Preservation of pairwise distances in embeddings.

The exact analogue of Fig. 5 in the original publication by Becht et al.4 To quote the original caption: ‘Box plots represent distances across pairs of points in the embeddings, binned using 50 equal-width bins over the pairwise distances in the original space using 10,000 randomly selected points, leading to 49,995,000 pairs of pairwise distances. […] The value of the Pearson correlation coefficient computed over the pairs of pairwise distances is reported. For the box plots, the central bar represents the median, and the top and bottom boundary of the boxes represent the 75th and 25th percentiles, respectively. The whiskers represent 1.5 times the interquartile range above (or, respectively, below) the top (or, respectively, bottom) box boundary, truncated to the data range if applicable.’ We recomputed all embeddings (except for the UMAP with LE initialization of the Wong et al.9 dataset, which was loaded from external source, as in the code accompanying the original publication). All algorithms were run with the same parameters as in the original publication (which always were the default parameters, apart from n_neighbors set to 30 in UMAP for the Han et al.10 dataset; we kept this value for both initializations). We used the same version of FIt-SNE as in the original publication, to make sure that all the default parameters stayed the same. Y-axis goes from zero to the maximum pairwise distance in all subplots.

Extended Data Fig. 2 Reproducibility of large-scale structures in embeddings.

The exact analogue of Fig. 6 in the original publication4. To quote the original caption: ‘Bar plots represent the average unsigned Pearson correlation coefficient of the points’ coordinates in the embedding of subsamples versus in the embedding of the full dataset, thus measuring the correlation of coordinates in subsamples versus in the embedding of the full dataset, up to symmetries along the graph axes. Bar heights represent the average across three replicates and vertical bars the corresponding s.d.’

Extended Data Fig. 3 Qualitative assessment of the reproducibility of embeddings using the Samusik et al.8 dataset.

The exact analogue of Supplementary Fig. 7a from the original publication4. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 4 Qualitative assessment of the reproducibility of embeddings using the Wong et al.9 dataset.

The exact analogue of Supplementary Fig. 7b from the original publication4. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 5 Qualitative assessment of the reproducibility of embeddings using the Han et al.10 dataset.

The exact analogue of Supplementary Fig. 7c from the original publication4. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 6 Annotated embeddings of the Samusik_01 dataset (sample size n=86,864).

Top row: UMAP with random initialization (left) and t-SNE with random initialization (right). Bottom row: UMAP with default initialization (left) and t-SNE with PCA initialization (right). The bottom-left and upper-right panels are analogues of Fig. 2a,b from the original publication4. Note that the T cells are not colocalized in the UMAP embedding with random initialization.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kobak, D., Linderman, G.C. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nat Biotechnol 39, 156–157 (2021). https://doi.org/10.1038/s41587-020-00809-z

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing