Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Assessing single-cell transcriptomic variability through density-preserving data visualization

Abstract

Nonlinear data visualization methods, such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), summarize the complex transcriptomic landscape of single cells in two dimensions or three dimensions, but they neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subsets of cells are given more visual space than warranted by their transcriptional diversity in the dataset. Here we present den-SNE and densMAP, which are density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to accurately incorporate information about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data. Applied to recently published datasets, our methods reveal significant changes in transcriptomic variability in a range of biological processes, including heterogeneity in transcriptomic variability of immune cells in blood and tumor, human immune cell specialization and the developmental trajectory of Caenorhabditis elegans. Our methods are readily applicable to visualizing high-dimensional data in other scientific domains.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of density-preserving data visualization.
Fig. 2: Density-preserving visualization more accurately captures the true underlying shape of synthetic datasets than existing tools.
Fig. 3: Density-preserving visualization reveals heterogeneity in transcriptomic variability of immune cells in blood and tumor.
Fig. 4: Density-preserving visualization of PBMCs reveals monocyte and DC subsets that differ in transcriptomic variability.
Fig. 5: Density-preserving visualization of C. elegans development reveals temporal dynamics of transcriptomic variability in different developmental lineages.

Data availability

The lung cancer7 and C. elegans9 datasets are available from the Gene Expression Omnibus (GEO) database with accession numbers GSE127465 and GSE126954, respectively. The PBMC dataset8 is available from 10× Genomics at https://support.10xgenomics.com/single-cell-gene-expression/datasets. For our validation datasets, the secondary lung cancer dataset17 is available from GEO (GSE99254), and the PBMC2 (ref. 22) and PBMC3 (ref. 23) datasets can be accessed through the Broad Institute’s Single Cell Portal (https://singlecell.broadinstitute.org/) with dataset IDs SCP43 and SCP345, respectively. Data access applications for the UK Biobank data can be submitted at https://www.ukbiobank.ac.uk/. The MNIST dataset is available at http://yann.lecun.com/exdb/mnist/. We also provide our preprocessed data for the main datasets (lung cancer, PBMC and C. elegans) at http://densvis.csail.mit.edu/datasets.

Code availability

We provide the software for den-SNE and densMAP in the densVis package available at http://densvis.csail.mit.edu/and https://github.com/hhcho/densvis. Our densMAP implementation is also available as part of the Python umap package (https://github.com/lmcinnes/umap).

References

  1. 1.

    Hie, B. et al. Computational methods for single-cell RNA sequencing. Ann. Rev. Biomed. Data Sci. 3, 339–364 (2020).

    Article  Google Scholar 

  2. 2.

    Chen, G., Ning, B. & Shi, T. Single-cell RNA-seq technologies and related computational data analysis. Front. Genet. 10, 317 (2019).

    CAS  Article  Google Scholar 

  3. 3.

    van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  4. 4.

    McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  5. 5.

    Amir, E.-aD. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).

    CAS  Article  Google Scholar 

  6. 6.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38 (2019).

    CAS  Article  Google Scholar 

  7. 7.

    Zilionis, R. et al. Single-cell transcriptomics of human and mouse lung cancers reveals conserved myeloid populations across individuals and species. Immunity 50, 1317–1334 (2019).

    CAS  Article  Google Scholar 

  8. 8.

    Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  9. 9.

    Packer, J. S. et al. A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science 365, eaax1971 (2019).

  10. 10.

    Healey, C. G. & Enns, J. T. Large datasets at a glance: combining textures and colors in scientific visualization. IEEE Trans. Vis. Comput. Graph. 5, 145–167 (1999).

    Article  Google Scholar 

  11. 11.

    Pearson, K. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572 (1901).

    Article  Google Scholar 

  12. 12.

    Cox, T. & Cox, M. Multidimensional Scaling, Second Edition (Chapman & Hall/CRC, 2001).

  13. 13.

    Tenenbaum, J. B., De Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).

    CAS  Article  Google Scholar 

  14. 14.

    Whiteside, T. L. & Parmiani, G. Tumor-infiltrating lymphocytes: their phenotype, functions and clinical use. Cancer Immunol. Immunother. 39, 15–21 (1994).

    CAS  Article  Google Scholar 

  15. 15.

    Bignon, A. et al. DUSP4-mediated accelerated T-cell senescence in idiopathic CD4 lymphopenia. Blood 125, 2507–2518 (2015).

    CAS  Article  Google Scholar 

  16. 16.

    Agenes, F., Bosco, N., Mascarell, L., Fritah, S. & Ceredig, R. Differential expression of regulator of G-protein signalling transcripts and in vivo migration of CD4+ naive and regulatory T cells. Immunology 115, 179–188 (2005).

    CAS  Article  Google Scholar 

  17. 17.

    Guo, X. et al. Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing. Nat. Med. 24, 978–985 (2018).

    CAS  Article  Google Scholar 

  18. 18.

    Xiong, X., Zhao, Y., He, H. & Sun, Y. Ribosomal protein S27-like and S27 interplay with p53–MDM2 axis as a target, a substrate and a regulator. Oncogene 30, 1798–1811 (2011).

    CAS  Article  Google Scholar 

  19. 19.

    Palucka, K. A., Taquet, N., Sanchez-Chapuis, F. & Gluckman, J. C. Dendritic cells as the terminal stage of monocyte differentiation. J. Immunol. 160, 4587–4595 (1998).

    CAS  PubMed  Google Scholar 

  20. 20.

    Stansfield, B. K. & Ingram, D. A. Clinical significance of monocyte heterogeneity. Clin. Transl. Med. 4, 5 (2015).

    Article  Google Scholar 

  21. 21.

    Wells, C. A. et al. Alternate transcription of the Toll-like receptor signaling cascade. Genome Biol. 7, R10 (2006).

    Article  Google Scholar 

  22. 22.

    Villani, A.-C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).

    Article  Google Scholar 

  23. 23.

    Slyper, M., Waldman, J., Dionne, D. & Li, B. Study: ICA: blood mononuclear cells (2 donors, 2 sites). https://singlecell.broadinstitute.org/single_cell/study/SCP345/ica-blood-mononuclear-cells-2-donors-2-sites.

  24. 24.

    Guilliams, M. et al. Dendritic cells, monocytes and macrophages: a unified nomenclature based on ontogeny. Nat. Rev. Immunol. 14, 571–578 (2014).

  25. 25.

    Hutchison, L. A. D., Berger, B. & Kohane, I. S. Meta-analysis of Caenorhabditis elegans single-cell developmental data reveals multi-frequency oscillation in gene activation. Bioinformatics 36, 4047–4057 (2019).

  26. 26.

    Freytag, V. et al. Genome-wide temporal expression profiling in Caenorhabditis elegans identifies a core gene set related to long-term memory. J. Neurosci. 37, 6661–6672 (2017).

    CAS  Article  Google Scholar 

  27. 27.

    Minkina, O. & Hunter, C. P. Intergenerational transmission of gene regulatory information in Caenorhabditis elegans. Trends Genet. 34, 54–64 (2018).

    CAS  Article  Google Scholar 

  28. 28.

    Maiden, M. C. J. Multilocus sequence typing of bacteria. Ann. Rev. Microbiol. 60, 561–588 (2006).

    CAS  Article  Google Scholar 

  29. 29.

    Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

  30. 30.

    Nicol, T. L. Detecting racial bias in algorithms and machine learning. J. Inf. Commun. Ethics Soc. 16, 252–260 (2018).

    Article  Google Scholar 

  31. 31.

    Diaz-Papkovich, A., Anderson-Trocmé, L., Ben-Eghan, C. & Gravel, S. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLoS Genet. 15, 1–24 (2019).

    Article  Google Scholar 

  32. 32.

    Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).

    CAS  Article  Google Scholar 

  33. 33.

    Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191 (2018).

    CAS  Article  Google Scholar 

  34. 34.

    Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).

    CAS  Article  Google Scholar 

  35. 35.

    Eades, P. A heuristic for graph drawing. Congressus Numerantium 42, 149–160 (1984).

    Google Scholar 

  36. 36.

    Harel, D. & Koren, Y. A fast multi-scale method for drawing large graphs. In International Symposium on Graph Drawing 183–196 (Springer, 2000).

  37. 37.

    Jansen, C. et al. Building gene regulatory networks from scatac-seq and scrna-seq using linked self organizing maps. PLoS Comput. Biol. 15, e1006555 (2019).

    Article  Google Scholar 

  38. 38.

    Dai, H. & Guan, Y. The nubeam reference-free approach to analyze metagenomic sequencing reads. Genome Res. 30, 1364–1375 (2020).

    CAS  Article  Google Scholar 

  39. 39.

    Eling, N., Richard, A. C., Richardson, S., Marioni, J. C. & Vallejos, C. A. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7, 284–294 (2018).

    CAS  Article  Google Scholar 

  40. 40.

    Castex, G. M. Frames of reference: the effects of ethnocentric map projections on professional practice. Social Work 38, 685–693 (1993).

    Google Scholar 

  41. 41.

    Haemer, K. W. Area bias in map presentation. Am. Stat. 3, 19 (1949).

    Google Scholar 

  42. 42.

    Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).

    CAS  Article  Google Scholar 

  43. 43.

    Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943 (2019).

    CAS  Article  Google Scholar 

  44. 44.

    Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat. Biotechnol. 37, 685–691 (2019).

    CAS  Article  Google Scholar 

  45. 45.

    Gelman, A. et al. Bayesian Data Analysis (CRC Press, 2013).

  46. 46.

    Hotelling, H. Relations between two sets of variates. Biometrika 28, 321–377 (1936).

    Article  Google Scholar 

  47. 47.

    Andrew, G., Arora, R., Bilmes, J. & Livescu, K. Deep canonical correlation analysis. In International Conference on Machine Learning, vol. 28, 1247–1255 (2013).

  48. 48.

    Kobak, D., Linderman, G., Steinerberger, S., Kluger, Y. & Berens, P. Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases 124–139 (Springer, 2019).

  49. 49.

    Healey, C. G. & Enns, J. T. Building perceptual textures to visualize multidimensional datasets. In Proceedings Visualization ’98 (Cat. No.98CB36276), 111–118 (IEEE, 1998).

  50. 50.

    Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Statist. 18, 50–60 (1947).

Download references

Acknowledgements

This work is, in part, supported by NIH U01 CA250554 (to B.B). H.C. is partially supported by Eric and Wendy Schmidt through the Schmidt Fellows Program at the Broad Institute. The authors thank B. Hie, B. DeMeo, E. Zhong and J. Peters for helpful discussions. Our visualization of genotype data was conducted using the UK Biobank Resource under application number 46341 in keeping with the informed consent given by its participants. BioRender.com was used to generate Fig. 4d.

Author information

Affiliations

Authors

Contributions

All authors conceived the method, evaluated results and wrote the manuscript. A.N. and H.C. implemented the software and conducted the experiments. B.B. and H.C. guided the research.

Corresponding authors

Correspondence to Bonnie Berger or Hyunghoon Cho.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Density-preserving methods more accurately visualize diversity of small subpopulations in UKBB data.

We visualize the genotype profiles of 97,676 UKBB participants (a 20% subsample of the dataset) using a, densMAP, b, UMAP, c, den-SNE and d, t-SNE. For each, in the left plot, points corresponding to white people are colored by five computationally-identified subpopulations (Methods); in the middle plot, non-white people are colored according to their ethnicity; right shows correlation of local radius between the original dataset and the embedding, with points colored by ethnicity and R2 reported. We show the analogous scatter plots using neighborhood count to measure in the visualization in Supplementary Figure 18. As 94% of the the people in the UKB dataset self-identified as white, the UMAP and t-SNE plots give overwhelming visual space to this group, hiding the genetic variability of the other ethnic groups. The density-preserving plots, however, clearly expand the clusters of non-white people as well as certain white subpopulations, more accurately conveying their genetic diversity.

Extended Data Fig. 2 Density-preserving visualization of MNIST handwritten digit image dataset reveals the relative homogeneity of the digit 1.

We visualize the MNIST handwritten digits with a, denSNE and t-SNE and b, densMAP and UMAP, with points colored by digit. Note that the size of the cluster corresponding to the digit 1 shrinks under both density-preserving algorithms. Plots on the right show the correlation of the local radii between the original dataset and the embedding in each algorithm, with points colored by digit and the R2 score reported. The higher R2 for the density-preserving methods illustrates that the digit 1 indeed has higher density than the other digits. We show the analogous scatter plots using neighborhood count to measure local density in Supplementary Figure 19.

Extended Data Fig. 3 den-SNE and densMAP are nearly as efficient as t-SNE and UMAP in runtime and memory.

We compare a, den-SNE and t-SNE and b, densMAP and UMAP with respect to runtime and peak memory usage on all the datasets analyzed in this study. For these tests, we exclude the time taken to compute the local radii of the final embedding, which is used only for evaluation and does not affect the embedding. Left plots running time in seconds at different data sizes (achieved by subsampling the datasets; Methods); middle shows the ratio of the density-preserving algorithm’s runtime to that of the original method; right shows peak memory usage over different data sizes. Although density-preserving methods take longer, the overhead is small (around 30% additional runtime for den-SNE and 20% additional runtime for densMAP both for our largest dataset). Both densMAP and UMAP obtain fast runtimes for large datasets, taking less than ~ 30 minutes for all our datasets. Peak memory usage is the same between t-SNE and den-SNE, and differs by a small constant between UMAP and densMAP.

Supplementary information

Supplementary Information

Supplementary Figs. 1–20, Supplementary Tables 1–11 and Supplementary Notes 1–4

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Narayan, A., Berger, B. & Cho, H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nat Biotechnol (2021). https://doi.org/10.1038/s41587-020-00801-7

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing