Brief Communication | Published:

Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data

Nature Methodsvolume 16pages243245 (2019) | Download Citation

Abstract

t-distributed stochastic neighbor embedding (t-SNE) is widely used for visualizing single-cell RNA-sequencing (scRNA-seq) data, but it scales poorly to large datasets. We dramatically accelerate t-SNE, obviating the need for data downsampling, and hence allowing visualization of rare cell populations. Furthermore, we implement a heatmap-style visualization for scRNA-seq based on one-dimensional t-SNE for simultaneously visualizing the expression patterns of thousands of genes. Software is available at https://github.com/KlugerLab/FIt-SNE and https://github.com/KlugerLab/t-SNE-Heatmaps.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

FIt-SNE is available at https://github.com/KlugerLab/FIt-SNE. The code for all experiments is available on request and will be publicly available at https://github.com/KlugerLab/FIt-SNE-paper on publication.

Data availability

The dataset of 1.3 million mouse brain cells and FACS-purified PBMCs of Zheng et al.23 can be downloaded from the 10X Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets/). Two other public scRNA-seq datasets from NCBI Gene Expression Omnibus (GEO) were used: Hrvatin et al. (GSE102827) and Shekhar et al. (GSE81905).

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Nat. Protoc. 13, 599–604 (2018).

  2. 2.

    10X Genomics. Transciptional profiling of 1.3 million brain cells with the chromium single cell 3′ solution. SequMed BioTechnology http://www.sequmed.com/Private/Files/20170726/6363668905396462451645665.pdf (2017).

  3. 3.

    Tasic, B. et al. Nature 563, 72–78 (2018).

  4. 4.

    van der Maaten, L. J. Mach. Learn. Res. 15, 3221–3245 (2014).

  5. 5.

    Yianilos, P. N. in Proc. Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ‘93) 311–321 (Society for Industrial and Applied Mathematics, 1993).

  6. 6.

    Bernhardsson, E. Annoy: approximate nearest neighbors in C++/Python optimized for memory usage and loading/saving to disk. GitHub https://github.com/spotify/annoy (2017).

  7. 7.

    Linderman, G. C. & Steinerberger, S. arXiv Preprint at http://arXiv.org/1706.02582 (2017).

  8. 8.

    Belkina, A. C. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/10/24/451690 (2018).

  9. 9.

    Kobak, D. & Berens, P. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/10/25/453449 (2018).

  10. 10.

    Cheng, Y., Wong, M. T., van der Maaten, L. & Newell, E. W. J. Immunol. 196, 924–932 (2015).

  11. 11.

    Galili, T., O’Callaghan, A., Sidi, J. & Sievert, C. Bioinformatics 34, 1600–1602 (2017).

  12. 12.

    Shekhar, K. et al. Cell 166, 1308–1323 (2016).

  13. 13.

    van der Maaten, L. & Hinton, G. J. Mach. Learn. Res. 9, 2579–2605 (2008).

  14. 14.

    Barnes, J. & Hut, P. Nature 324, 446–449 (1986).

  15. 15.

    Dahlquist, G. & Björck, Å. Numerical Methods in Scientific Computing Vol. 1 (Society for Industrial and Applied Mathematics, Philadelphia, 2008).

  16. 16.

    Trefethen, L. N. Approximation Theory and Approximation Practice (SIAM, Philadelphia, 2013).

  17. 17.

    Abramowitz, M. & Stegun, I. A. in Handbook of Mathematical Function: With Formulas, Graphs and Mathematical Tables (Dover Publications, Mineola, 1965).

  18. 18.

    Trefethen, L. N. & Weideman, J. A. C. J. Approx. Theory 65, 247–260 (1991).

  19. 19.

    Halko, N., Martinsson, P.-G., Shkolnisky, Y. & Tygert, M. SIAM J. Sci. Comput. 33, 2580–2594 (2011).

  20. 20.

    Halko, N., Martinsson, P.-G. & Tropp, J. A. SIAM Rev. 53, 217–288 (2011).

  21. 21.

    Witten, R. & Candes, E. Algorithmica 72, 264–281 (2015).

  22. 22.

    Li, H. et al. ACM Trans. Math. Software 43, 28 (2017).

  23. 23.

    Zheng, G. X. Y. et al. Nat. Commun. 8, 14049 (2017).

  24. 24.

    Wolf, F. A., Angerer, P. & Theis, F. J. Genome. Biol. 19, 15 (2018).

  25. 25.

    Hrvatin, S. et al. Nat. Neurosci. 21, 120–129 (2018).

  26. 26.

    Erichson, N. B., Voronin, S., Brunton, S. L. & Kutz, J. N. arXiv Preprint at http://arXiv.org/1608.02148 (2016).

Download references

Acknowledgements

The authors thank V. Rokhlin, D. Kobak, M. Tygert and J. Zhao for many useful discussions. The authors also thank J. Spilden and I. Taylor for help with testing FIt-SNE on their CyTOF and scRNA-seq datasets.

G.C.L. was supported in part by NIH grants F30HG010102, 1R01HG008383-01A1 and US NIH MSTP Training Grant T32GM007205. M.R. was supported in part by AFOSR grant no. FA9550-16-10175 and NIH grant no. 1R01HG008383-01A1. S.S. was supported in part by the NSF (DMS-1763179) and the Alfred P. Sloan Foundation. Y.K. was supported in part by NIH grant no. 1R01HG008383-01A1.

Author information

Affiliations

  1. Applied Mathematics Program, Yale University, New Haven, CT, USA

    • George C. Linderman
    • , Manas Rachh
    • , Jeremy G. Hoskins
    •  & Yuval Kluger
  2. Department of Mathematics, Yale University, New Haven, CT, USA

    • Stefan Steinerberger
  3. Department of Pathology, Yale University School of Medicine, New Haven, CT, USA

    • Yuval Kluger

Authors

  1. Search for George C. Linderman in:

  2. Search for Manas Rachh in:

  3. Search for Jeremy G. Hoskins in:

  4. Search for Stefan Steinerberger in:

  5. Search for Yuval Kluger in:

Contributions

G.C.L., M.R., J.G.H., S.S. and Y.K. conceived and designed the project. G.C.L. implemented the method. All authors wrote and edited the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Yuval Kluger.

Integrated supplementary information

  1. Supplementary Figure 1 Accuracy of approximation to the repulsive term.

    Accuracy of computing Frep,i using FFT-accelerated Interpolation-based (FI) t-SNE as compared to the Barnes-Hut (BH) t-SNE implementation over 1000 iterations.

  2. Supplementary Figure 2 The populations identified in Fig. 1 are apparent by embedding using exact nearest neighbors (VP trees) and approximate nearest neighbors (ANN).

    1.3 million mouse brain cells are embedding using FIt-SNE with ANN and VP trees; a random 100,000 sized subset of the embedded cells is shown.

  3. Supplementary Figure 3 FIt-SNE of 1.3 million mouse brain cells using exact nearest neighbors (VP trees) vs. FIt-SNE of same cells using approximate nearest neighbors (ANNOY).

    1.3 million mouse brain cells are embedding using t-SNE with ANN and VP trees; a random 100,000 sized subset of the embedded cells is shown, colored by Louvain clustering in the original high-dimensional space. The 1N error is computed as the proportion of cells for which the nearest neighbor in the embedding is a member of the same cluster.

  4. Supplementary Figure 4 FIt-SNE of purified peripheral blood monocyte cell (PBMC) populations using exact nearest neighbors (VP trees) vs. FIt-SNE of same cells using approximate nearest neighbors (ANNOY).

    64,664 purified PBMCs of Zheng et al.23 are embedding using t-SNE with ANN and VP trees. The 1N error is computed as the proportion of cells for which the nearest neighbor in the embedding is a member of the same population.

  5. Supplementary Figure 5 FIt-SNE of mouse cortical cells using exact nearest neighbors (VP trees) vs. FIt-SNE of same cells using approximate nearest neighbors (ANNOY).

    48,266 cells from Hrvatin et al.25 are embedding using t-SNE with ANN and VP trees and labeled as the subtypes in that paper. The 1N error is computed as the proportion of cells for which the nearest neighbor in the embedding is a member of the same subtype.

  6. Supplementary Figure 6 The importance of early exaggeration when embedding large datasets.

    1.3 million mouse brain cells are embedded using default early exaggeration setting of 250 (left) and also embedded using setting of 2000 (right). Cells are colored by Louvain clustering in the original high-dimensional space (independent of the t-SNE). Many clusters are broken up when the number of early exaggeration iterations is insufficient, for example the 6 clusters highlighted (bottom).

  7. Supplementary Figure 7 t-SNE heatmap of retinal bipolar cells from Shekhar et al.12.

    Genes presented are the 25 genes most associated with each marker gene and cluster metagene (denoted by blue). The heatmap is interactive, allowing users to zoom into a region of interest (see Supplemental Fig. 8).

  8. Supplementary Figure 8 t-SNE heatmap of retinal bipolar cells from Shekhar et al.12, zoomed into region of interest.

    Zooming into a section of the t-SNE heatmap in Supplementary Fig. 7.

  9. Supplementary Figure 9 Standard heatmap of retinal bipolar cells from Shekhar et al.12.

    Using the same genes (rows) in the same ordering as Fig. 2e and Supplementary Fig. 7, cells were clustered using hierarchical clustering (columns), for comparison to Fig. 2e.

  10. Supplementary Figure 10 An illustration of the algorithm.

    Both the intervals on the left are (z0, z0 + R), and both the intervals on the right are (y0, y0 + R). In the lower intervals, the white squares denote the locations zj and yi, and in the upper intervals the white circles indicate the locations of the equispaced nodes \(\tilde{Z}_{i}\) and \(\tilde{Z}_{j}\). The arrows illustrate how a point zj communicates with a point yi.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–10 and Supplementary Tables 1–3

  2. Reporting Summary

Source data

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0308-4