Dimensionality reduction for visualizing single-cell data using UMAP

Abstract

Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large number of parameters generated in single-cell studies. Recently, a nonlinear dimensionality-reduction technique, uniform manifold approximation and projection (UMAP), was developed for the analysis of any type of high-dimensional data. Here we apply it to biological data, using three well-characterized mass cytometry and single-cell RNA sequencing datasets. Comparing the performance of UMAP with five other tools, we find that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters. The work highlights the use of UMAP for improved visualization and interpretation of single-cell data.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: UMAP embeds local and large-scale structure of the data.
Figure 2: UMAP embeddings of bone marrow and blood samples recapitulate hematopoiesis.
Figure 3: Run times of five dimensionality reduction methods for inputs of varying sizes.
Figure 4: Analysis of local data structure in embeddings produced by each algorithm.
Figure 5: Preservation of pairwise distances in embeddings.
Figure 6: Reproducibility of large-scale structures in embeddings.

References

  1. 1

    Saeys, Y., Van Gassen, S. & Lambrecht, B.N. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol. 16, 449–462 (2016).

    CAS  Article  Google Scholar 

  2. 2

    Tenenbaum, J.B., De Silva, V. & Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).

    CAS  Article  Google Scholar 

  3. 3

    Coifman, R.R. et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc. Natl. Acad. Sci. USA 102, 7426–7431 (2005).

    CAS  Article  Google Scholar 

  4. 4

    Van Der Maaten, L. & Hinton, G. Visualizing high-dimensional data using t-SNE. journal of machine learning research. J. Mach. Learn. Res. 9, 26 (2008).

    Google Scholar 

  5. 5

    Amir, A.D. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).

    CAS  Article  Google Scholar 

  6. 6

    van Unen, V. et al. Mass cytometry of the human mucosal immune system identifies tissue- and disease-associated immune subsets. Immunity 44, 1227–1239 (2016).

    CAS  Article  Google Scholar 

  7. 7

    McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  8. 8

    McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).

    Article  Google Scholar 

  9. 9

    Han, X. et al. Mapping the mouse cell atlas by microwell-seq. Cell 172, 1091–1107.e17 (2018).

    CAS  Article  Google Scholar 

  10. 10

    Samusik, N., Good, Z., Spitzer, M.H., Davis, K.L. & Nolan, G.P. Automated mapping of phenotype space with single-cell data. Nat. Methods 13, 493–496 (2016).

    CAS  Article  Google Scholar 

  11. 11

    Wong, M.T. et al. A high-dimensional atlas of human T cell diversity reveals tissue-specific trafficking and cytokine signatures. Immunity 45, 442–456 (2016).

    CAS  Article  Google Scholar 

  12. 12

    Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).

    Google Scholar 

  13. 13

    Linderman, G.C., Rachh, M., Hoskins, J.G., Steinerberger, S. & Kluger, Y. Efficient algorithms for t-distributed stochastic neighborhood embedding. Preprint at https://arxiv.org/abs/1712.09005 (2017).

  14. 14

    Ding, J., Condon, A. & Shah, S.P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

    Article  Google Scholar 

  15. 15

    Levine, J.H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).

    CAS  Article  Google Scholar 

  16. 16

    Huang, H., Li, Y. & Liu, B. Transcriptional regulation of mast cell and basophil lineage commitment. Semin. Immunopathol. 38, 539–548 (2016).

    Article  Google Scholar 

  17. 17

    Wattenberg, M., Viégas, F. & Johnson, I. How to use t-SNE effectively. Distill 1, e2 (2016).

    Article  Google Scholar 

  18. 18

    de Graaf, C.A. et al. Haemopedia: an expression atlas of murine hematopoietic cells. Stem Cell Rep. 7, 571–582 (2016).

    Article  Google Scholar 

  19. 19

    Mårtensson, I.-L., Keenan, R.A. & Licence, S. The pre-B-cell receptor. Curr. Opin. Immunol. 19, 137–142 (2007).

    Article  Google Scholar 

  20. 20

    Wolf, F.A., Angerer, P. & Theis, F.J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  Google Scholar 

  21. 21

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    CAS  Article  Google Scholar 

  22. 22

    Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank members of the Singapore Immunology Network and notably members of the E.W.N. laboratory. We thank S. Li, Y. Simoni, M. Chng, Y. Cheng, J.W. Lim and M. Fehlings for their insightful feedback. This study was funded by A-STAR/SIgN core funding and A-STAR/SIgN immunomonitoring platform funding.

Author information

Affiliations

Authors

Contributions

E.B., L.M., J.H., C.-A.D., I.W.H.K. and E.W.N. analyzed data. L.G.N., F.G. and E.W.N. helped supervise the project. L.M. and J.H. developed UMAP. All authors participated in writing and revising the manuscript.

Corresponding author

Correspondence to Evan W Newell.

Ethics declarations

Competing interests

E.W.N. is a board director and shareholder of immunoSCAPE Pte. Ltd., which is an immune profiling service provider.

Integrated supplementary information

Supplementary Figure 1 Phenograph clustering identifies cell clusters in the Wong dataset

a) Phenotypic characterization of the phenograph clusters. Each cluster medoid is represented after column-wise Z-score transformation. b) Identification of each phenograph cluster of both UMAP (left), t-SNE (middle) and 2D PCA (right). For clarity, only twelve clusters are shown per plot.

Supplementary Figure 2 Annotation of the tissue of origins on UMAP, t-SNE and PCA plots

Scatterplot of embeddings of the Wong dataset using UMAP (top), t-SNE (middle) and 2D PCA (bottom) color-coded by tissues of origin.

Supplementary Figure 3 Identification of unlabeled erythrocytes in the Samusik_01 dataset

Expression of Ter119 (a marker for mature erythrocytes) color-coded on the UMAP embedding of the Samusik_01 dataset.

Supplementary Figure 4 Surface densities of events in UMAP and t-SNE embeddings

Heatmap of the density of a 300x300 square grid of the UMAP or t-SNE projections for the Samusik_01 dataset. The number of events in each bin is color-coded.

Supplementary Figure 5 Pre-filtering of the Han dataset

Top: UMAP projection of the full Han dataset annotated by AUC scores for various cell lineages (red: high score, blue: low score). Bottom: full Han dataset colored by sample type, Sample ID and pre-filtering status.

Supplementary Figure 6 Side-by-side comparison of each dimensionality reduction method across all datasets annotated by cell types.

Scatterplots of six dimensionality-reduction methods and 6 datasets. Cell populations are annotated using manual gating (Samusik dataset), manually-labelled Phenograph clusters (Wong dataset) or sample of origin (Han_400k dataset).

Supplementary Figure 7 Qualitative assessment of the reproducibility of embeddings

Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for five dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples. Datasets shown are the a) Samusik_all, b) Wong and c) Han_400k datasets.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7 (PDF 1451 kb)

Life Sciences Reporting Summary (PDF 130 kb)

Supplementary Table 1

Description of the datasets (XLSX 5 kb)

Supplementary Table 2

Algorithms benchmarked (XLSX 5 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Becht, E., McInnes, L., Healy, J. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 37, 38–44 (2019). https://doi.org/10.1038/nbt.4314

Download citation

Further reading