Assessing single-cell transcriptomic variability through density-preserving data visualization

Narayan, Ashwin; Berger, Bonnie; Cho, Hyunghoon

doi:10.1038/s41587-020-00801-7

Article
Published: 18 January 2021

Assessing single-cell transcriptomic variability through density-preserving data visualization

Nature Biotechnology volume 39, pages 765–774 (2021)Cite this article

14k Accesses
54 Citations
60 Altmetric
Metrics details

Subjects

Abstract

Nonlinear data visualization methods, such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), summarize the complex transcriptomic landscape of single cells in two dimensions or three dimensions, but they neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subsets of cells are given more visual space than warranted by their transcriptional diversity in the dataset. Here we present den-SNE and densMAP, which are density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to accurately incorporate information about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data. Applied to recently published datasets, our methods reveal significant changes in transcriptomic variability in a range of biological processes, including heterogeneity in transcriptomic variability of immune cells in blood and tumor, human immune cell specialization and the developmental trajectory of Caenorhabditis elegans. Our methods are readily applicable to visualizing high-dimensional data in other scientific domains.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of density-preserving data visualization.**

**Fig. 2: Density-preserving visualization more accurately captures the true underlying shape of synthetic datasets than existing tools.**

**Fig. 3: Density-preserving visualization reveals heterogeneity in transcriptomic variability of immune cells in blood and tumor.**

**Fig. 4: Density-preserving visualization of PBMCs reveals monocyte and DC subsets that differ in transcriptomic variability.**

**Fig. 5: Density-preserving visualization of *C. elegans* development reveals temporal dynamics of transcriptomic variability in different developmental lineages.**

Visualizing structure and transitions in high-dimensional biological data

Article 03 December 2019

Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets

Article Open access 28 November 2019

Correspondence analysis for dimension reduction, batch integration, and visualization of single-cell RNA-seq data

Article Open access 21 January 2023

Data availability

The lung cancer⁷ and C. elegans⁹ datasets are available from the Gene Expression Omnibus (GEO) database with accession numbers GSE127465 and GSE126954, respectively. The PBMC dataset⁸ is available from 10× Genomics at https://support.10xgenomics.com/single-cell-gene-expression/datasets. For our validation datasets, the secondary lung cancer dataset¹⁷ is available from GEO (GSE99254), and the PBMC2 (ref. ²²) and PBMC3 (ref. ²³) datasets can be accessed through the Broad Institute’s Single Cell Portal (https://singlecell.broadinstitute.org/) with dataset IDs SCP43 and SCP345, respectively. Data access applications for the UK Biobank data can be submitted at https://www.ukbiobank.ac.uk/. The MNIST dataset is available at http://yann.lecun.com/exdb/mnist/. We also provide our preprocessed data for the main datasets (lung cancer, PBMC and C. elegans) at http://densvis.csail.mit.edu/datasets.

Code availability

We provide the software for den-SNE and densMAP in the densVis package available at http://densvis.csail.mit.edu/and https://github.com/hhcho/densvis. Our densMAP implementation is also available as part of the Python umap package (https://github.com/lmcinnes/umap).

References

Hie, B. et al. Computational methods for single-cell RNA sequencing. Ann. Rev. Biomed. Data Sci. 3, 339–364 (2020).
Article Google Scholar
Chen, G., Ning, B. & Shi, T. Single-cell RNA-seq technologies and related computational data analysis. Front. Genet. 10, 317 (2019).
Article CAS Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Amir, E.-aD. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).
Article CAS Google Scholar
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38 (2019).
Article CAS Google Scholar
Zilionis, R. et al. Single-cell transcriptomics of human and mouse lung cancers reveals conserved myeloid populations across individuals and species. Immunity 50, 1317–1334 (2019).
Article CAS Google Scholar
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Packer, J. S. et al. A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science 365, eaax1971 (2019).
Healey, C. G. & Enns, J. T. Large datasets at a glance: combining textures and colors in scientific visualization. IEEE Trans. Vis. Comput. Graph. 5, 145–167 (1999).
Article Google Scholar
Pearson, K. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572 (1901).
Article Google Scholar
Cox, T. & Cox, M. Multidimensional Scaling, Second Edition (Chapman & Hall/CRC, 2001).
Tenenbaum, J. B., De Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Article CAS Google Scholar
Whiteside, T. L. & Parmiani, G. Tumor-infiltrating lymphocytes: their phenotype, functions and clinical use. Cancer Immunol. Immunother. 39, 15–21 (1994).
Article CAS Google Scholar
Bignon, A. et al. DUSP4-mediated accelerated T-cell senescence in idiopathic CD4 lymphopenia. Blood 125, 2507–2518 (2015).
Article CAS Google Scholar
Agenes, F., Bosco, N., Mascarell, L., Fritah, S. & Ceredig, R. Differential expression of regulator of G-protein signalling transcripts and in vivo migration of CD4⁺ naive and regulatory T cells. Immunology 115, 179–188 (2005).
Article CAS Google Scholar
Guo, X. et al. Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing. Nat. Med. 24, 978–985 (2018).
Article CAS Google Scholar
Xiong, X., Zhao, Y., He, H. & Sun, Y. Ribosomal protein S27-like and S27 interplay with p53–MDM2 axis as a target, a substrate and a regulator. Oncogene 30, 1798–1811 (2011).
Article CAS Google Scholar
Palucka, K. A., Taquet, N., Sanchez-Chapuis, F. & Gluckman, J. C. Dendritic cells as the terminal stage of monocyte differentiation. J. Immunol. 160, 4587–4595 (1998).
Article CAS PubMed Google Scholar
Stansfield, B. K. & Ingram, D. A. Clinical significance of monocyte heterogeneity. Clin. Transl. Med. 4, 5 (2015).
Article Google Scholar
Wells, C. A. et al. Alternate transcription of the Toll-like receptor signaling cascade. Genome Biol. 7, R10 (2006).
Article Google Scholar
Villani, A.-C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).
Article Google Scholar
Slyper, M., Waldman, J., Dionne, D. & Li, B. Study: ICA: blood mononuclear cells (2 donors, 2 sites). https://singlecell.broadinstitute.org/single_cell/study/SCP345/ica-blood-mononuclear-cells-2-donors-2-sites.
Guilliams, M. et al. Dendritic cells, monocytes and macrophages: a unified nomenclature based on ontogeny. Nat. Rev. Immunol. 14, 571–578 (2014).
Hutchison, L. A. D., Berger, B. & Kohane, I. S. Meta-analysis of Caenorhabditis elegans single-cell developmental data reveals multi-frequency oscillation in gene activation. Bioinformatics 36, 4047–4057 (2019).
Freytag, V. et al. Genome-wide temporal expression profiling in Caenorhabditis elegans identifies a core gene set related to long-term memory. J. Neurosci. 37, 6661–6672 (2017).
Article CAS Google Scholar
Minkina, O. & Hunter, C. P. Intergenerational transmission of gene regulatory information in Caenorhabditis elegans. Trends Genet. 34, 54–64 (2018).
Article CAS Google Scholar
Maiden, M. C. J. Multilocus sequence typing of bacteria. Ann. Rev. Microbiol. 60, 561–588 (2006).
Article CAS Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Nicol, T. L. Detecting racial bias in algorithms and machine learning. J. Inf. Commun. Ethics Soc. 16, 252–260 (2018).
Article Google Scholar
Diaz-Papkovich, A., Anderson-Trocmé, L., Ben-Eghan, C. & Gravel, S. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLoS Genet. 15, 1–24 (2019).
Article Google Scholar
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).
Article CAS Google Scholar
Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191 (2018).
Article CAS Google Scholar
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).
Article CAS Google Scholar
Eades, P. A heuristic for graph drawing. Congressus Numerantium 42, 149–160 (1984).
Google Scholar
Harel, D. & Koren, Y. A fast multi-scale method for drawing large graphs. In International Symposium on Graph Drawing 183–196 (Springer, 2000).
Jansen, C. et al. Building gene regulatory networks from scatac-seq and scrna-seq using linked self organizing maps. PLoS Comput. Biol. 15, e1006555 (2019).
Article Google Scholar
Dai, H. & Guan, Y. The nubeam reference-free approach to analyze metagenomic sequencing reads. Genome Res. 30, 1364–1375 (2020).
Article CAS Google Scholar
Eling, N., Richard, A. C., Richardson, S., Marioni, J. C. & Vallejos, C. A. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7, 284–294 (2018).
Article CAS Google Scholar
Castex, G. M. Frames of reference: the effects of ethnocentric map projections on professional practice. Social Work 38, 685–693 (1993).
Google Scholar
Haemer, K. W. Area bias in map presentation. Am. Stat. 3, 19 (1949).
Google Scholar
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Article CAS Google Scholar
Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943 (2019).
Article CAS Google Scholar
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Article CAS Google Scholar
Gelman, A. et al. Bayesian Data Analysis (CRC Press, 2013).
Hotelling, H. Relations between two sets of variates. Biometrika 28, 321–377 (1936).
Article Google Scholar
Andrew, G., Arora, R., Bilmes, J. & Livescu, K. Deep canonical correlation analysis. In International Conference on Machine Learning, vol. 28, 1247–1255 (2013).
Kobak, D., Linderman, G., Steinerberger, S., Kluger, Y. & Berens, P. Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases 124–139 (Springer, 2019).
Healey, C. G. & Enns, J. T. Building perceptual textures to visualize multidimensional datasets. In Proceedings Visualization ’98 (Cat. No.98CB36276), 111–118 (IEEE, 1998).
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Statist. 18, 50–60 (1947).

Download references

Acknowledgements

This work is, in part, supported by NIH U01 CA250554 (to B.B). H.C. is partially supported by Eric and Wendy Schmidt through the Schmidt Fellows Program at the Broad Institute. The authors thank B. Hie, B. DeMeo, E. Zhong and J. Peters for helpful discussions. Our visualization of genotype data was conducted using the UK Biobank Resource under application number 46341 in keeping with the informed consent given by its participants. BioRender.com was used to generate Fig. 4d.

Author information

Authors and Affiliations

Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
Ashwin Narayan & Bonnie Berger
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Ashwin Narayan, Bonnie Berger & Hyunghoon Cho
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
Ashwin Narayan, Bonnie Berger & Hyunghoon Cho

Authors

Ashwin Narayan
View author publications
You can also search for this author in PubMed Google Scholar
Bonnie Berger
View author publications
You can also search for this author in PubMed Google Scholar
Hyunghoon Cho
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceived the method, evaluated results and wrote the manuscript. A.N. and H.C. implemented the software and conducted the experiments. B.B. and H.C. guided the research.

Corresponding authors

Correspondence to Bonnie Berger or Hyunghoon Cho.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Density-preserving methods more accurately visualize diversity of small subpopulations in UKBB data.

We visualize the genotype profiles of 97,676 UKBB participants (a 20% subsample of the dataset) using a, densMAP, b, UMAP, c, den-SNE and d, t-SNE. For each, in the left plot, points corresponding to white people are colored by five computationally-identified subpopulations (Methods); in the middle plot, non-white people are colored according to their ethnicity; right shows correlation of local radius between the original dataset and the embedding, with points colored by ethnicity and R² reported. We show the analogous scatter plots using neighborhood count to measure in the visualization in Supplementary Figure 18. As 94% of the the people in the UKB dataset self-identified as white, the UMAP and t-SNE plots give overwhelming visual space to this group, hiding the genetic variability of the other ethnic groups. The density-preserving plots, however, clearly expand the clusters of non-white people as well as certain white subpopulations, more accurately conveying their genetic diversity.

Extended Data Fig. 2 Density-preserving visualization of MNIST handwritten digit image dataset reveals the relative homogeneity of the digit 1.

We visualize the MNIST handwritten digits with a, denSNE and t-SNE and b, densMAP and UMAP, with points colored by digit. Note that the size of the cluster corresponding to the digit 1 shrinks under both density-preserving algorithms. Plots on the right show the correlation of the local radii between the original dataset and the embedding in each algorithm, with points colored by digit and the R² score reported. The higher R² for the density-preserving methods illustrates that the digit 1 indeed has higher density than the other digits. We show the analogous scatter plots using neighborhood count to measure local density in Supplementary Figure 19.

Extended Data Fig. 3 den-SNE and densMAP are nearly as efficient as t-SNE and UMAP in runtime and memory.

We compare a, den-SNE and t-SNE and b, densMAP and UMAP with respect to runtime and peak memory usage on all the datasets analyzed in this study. For these tests, we exclude the time taken to compute the local radii of the final embedding, which is used only for evaluation and does not affect the embedding. Left plots running time in seconds at different data sizes (achieved by subsampling the datasets; Methods); middle shows the ratio of the density-preserving algorithm’s runtime to that of the original method; right shows peak memory usage over different data sizes. Although density-preserving methods take longer, the overhead is small (around 30% additional runtime for den-SNE and 20% additional runtime for densMAP both for our largest dataset). Both densMAP and UMAP obtain fast runtimes for large datasets, taking less than ~ 30 minutes for all our datasets. Peak memory usage is the same between t-SNE and den-SNE, and differs by a small constant between UMAP and densMAP.

Supplementary information

Supplementary Information

Supplementary Figs. 1–20, Supplementary Tables 1–11 and Supplementary Notes 1–4

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Narayan, A., Berger, B. & Cho, H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nat Biotechnol 39, 765–774 (2021). https://doi.org/10.1038/s41587-020-00801-7

Download citation

Received: 12 May 2020
Accepted: 14 December 2020
Published: 18 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1038/s41587-020-00801-7

This article is cited by

Extrapolative prediction of small-data molecular property using quantum mechanics-assisted machine learning
- Hajime Shimakawa
- Akiko Kumada
- Masahiro Sato
npj Computational Materials (2024)
Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters
- Lucy Xia
- Christy Lee
- Jingyi Jessica Li
Nature Communications (2024)
SCA: recovering single-cell heterogeneity through information-based dimensionality reduction
- Benjamin DeMeo
- Bonnie Berger
Genome Biology (2023)
Image sensing with multilayer nonlinear optical neural networks
- Tianyu Wang
- Mandar M. Sohoni
- Peter L. McMahon
Nature Photonics (2023)
Single-cell trajectory analysis reveals a CD9 positive state to contribute to exit from stem cell-like and embryonic diapause states and transit to drug-resistant states
- Xi Li
- Alfonso Poire
- Gordon B. Mills
Cell Death Discovery (2023)