Exploring single-cell data with deep multitasking neural networks

Amodio, Matthew; van Dijk, David; Srinivasan, Krishnan; Chen, William S.; Mohsen, Hussein; Moon, Kevin R.; Campbell, Allison; Zhao, Yujiao; Wang, Xiaomei; Venkataswamy, Manjunatha; Desai, Anita; Ravi, V.; Kumar, Priti; Montgomery, Ruth; Wolf, Guy; Krishnaswamy, Smita

doi:10.1038/s41592-019-0576-7

Article
Published: 07 October 2019

Exploring single-cell data with deep multitasking neural networks

Matthew Amodio¹^na1,
David van Dijk^1,2^na1,
Krishnan Srinivasan¹^na1,
William S. Chen³,
Hussein Mohsen ORCID: orcid.org/0000-0002-6263-8865⁴,
Kevin R. Moon⁵,
Allison Campbell³,
Yujiao Zhao⁶,
Xiaomei Wang⁶,
Manjunatha Venkataswamy⁷,
Anita Desai⁷,
V. Ravi⁷,
Priti Kumar⁸,
Ruth Montgomery ORCID: orcid.org/0000-0002-8661-4454⁶,
Guy Wolf ORCID: orcid.org/0000-0002-6740-059X^9,10^na1 &
…
Smita Krishnaswamy ORCID: orcid.org/0000-0001-5823-1985^1,2^na1

Nature Methods volume 16, pages 1139–1145 (2019)Cite this article

20k Accesses
147 Citations
269 Altmetric
Metrics details

Subjects

Abstract

It is currently challenging to analyze single-cell data consisting of many cells and samples, and to address variations arising from batch effects and different sample preparations. For this purpose, we present SAUCIE, a deep neural network that combines parallelization and scalability offered by neural networks, with the deep representation of data that can be learned by them to perform many single-cell data analysis tasks. Our regularizations (penalties) render features learned in hidden layers of the neural network interpretable. On large, multi-patient datasets, SAUCIE’s various hidden layers contain denoised and batch-corrected data, a low-dimensional visualization and unsupervised clustering, as well as other information that can be used to explore the data. We analyze a 180-sample dataset consisting of 11 million T cells from dengue patients in India, measured with mass cytometry. SAUCIE can batch correct and identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The pipeline for analyzing single-cell data in large cohorts with SAUCIE.

**Fig. 2: Regularizations and architecture choices in SAUCIE.**

**Fig. 3: A comparison of the different analysis tasks performed by SAUCIE against other methods.**

**Fig. 4: Demonstration of SAUCIE’s batch correction abilities.**

**Fig. 5: SAUCIE produces patient manifolds from single-cell cluster signatures.**

**Fig. 6: SAUCIE identifies and characterizes cellular clusters, whose proportions can be used to compare patients.**

Deepometry, a framework for applying supervised and weakly supervised deep learning to imaging cytometry

Article 18 June 2021

Minh Doan, Claire Barnes, … Paul Rees

VoPo leverages cellular heterogeneity for predictive modeling of single-cell data

Article Open access 27 July 2020

Natalie Stanley, Ina A. Stelzer, … Nima Aghaeepour

Structure-preserving visualisation of high dimensional single-cell datasets

Article Open access 20 June 2019

Benjamin Szubert, Jennifer E. Cole, … Ignat Drozdov

Data availability

Data for the dengue dataset is available at Cytobank, with accession number 82023.

Code availability

SAUCIE is written in Python using the TensorFlow library for deep learning. The source code is available at https://github.com/KrishnaswamyLab/SAUCIE/.

References

Tan, J. et al. Unsupervised extraction of stable expression signatures from public compendia with an ensemble of neural networks. Cell Syst. 5, 63–71 (2017).
Article CAS Google Scholar
Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. In Pacific Symposium on Biocomputing 2018 Vol. 23 (PSB, 2018).
Wang, W., Huang, Y., Wang, Y. & Wang, L. Generalized autoencoder: a neural network framework for dimensionality reduction. In CVPR Workshops (eds Betke, M. & Davis, J.) 496–503 (IEEE, 2014).
Tan, J. et al. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. In Pacific Symposium on Biocomputing 2015. Vol. 20 (PSB, 2015).
Tan, J., Hammond, J. H., Hogan, D. A. & Greene, C. S. Adage-based integration of publicly available pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. MSystems 1, e00025-15 (2016).
Article Google Scholar
Chen, H., Shen, J., Wang, L. and Song, J. Leveraging stacked denoising autoencoder in prediction of pathogen-host protein-protein interactions. In Proc. 2017 IEEE International Congress on Big Data (BigData Congress) 368–375 (IEEE, 2017).
Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinforma. 17, S9 (2016).
Article Google Scholar
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).
Article Google Scholar
Courbariaux, M., Bengio, Y. & David, J.-P. Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28 (NIPS 2015) (eds Cortez, C. et al.) 3123–3131 (JMLR, 2015).
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R. & Bengio, Y. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or −1. Preprint at https://arxiv.org/abs/1602.02830 (2016).
Tang, W., Hua, G. and Wang, L. How to train a compact binary neural network with high accuracy? In Thirty-First AAAI Conference on Artificial Intelligence (eds Singh, S. & Markovitch, S.) 2625–2631 (ACM, 2017).
Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).
Article CAS Google Scholar
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 34, 637 (2016).
Article CAS Google Scholar
Chevrier, S. et al. An immune atlas of clear cell renal cell carcinoma. Cell 169, 736–749 (2017).
Article CAS Google Scholar
Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018).
Article CAS Google Scholar
Zunder, E. R., Lujan, E., Goltsev, Y., Wernig, M. & Nolan, G. P. A continuous molecular roadmap to ipsc reprogramming through progression analysis of single-cell mass cytometry. Cell Stem Cell 16, 323–337 (2015).
Article CAS Google Scholar
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323 (2016).
Article CAS Google Scholar
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Article CAS Google Scholar
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Article CAS Google Scholar
Single Cell Gene Expression Datasets (10x Genomics, 2017); https://support.10xgenomics.com/single-cell-gene-expression/datasets
Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).
Article CAS Google Scholar
Lopez, R., Regier, J., Cole, M., Jordan, M. & Yosef, N. A deep generative model for single-cell RNA sequencing with application to detecting differentially expressed genes. Preprint at https://arxiv.org/abs/1710.05086 (2017).
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421 (2018).
Article CAS Google Scholar
Butler, A. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS Google Scholar
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Article CAS Google Scholar
McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Moon, K. R. et al. PHATE: a dimensionality reduction method for visualizing trajectory structures in high-dimensional biological data. Preprint at https://doi.org/10.1101/120378 (2017).
Van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).
Article CAS Google Scholar
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).
Article Google Scholar
Regev, A. et al. Science forum: the human cell atlas. eLife 6, e27041 (2017).
Article Google Scholar
Panda, A. et al. Age-associated decrease in tlr function in primary human dendritic cells predicts influenza vaccine response.J. Immunol. 184, 2518–2527 (2010).
Article CAS Google Scholar
Tsai, C.-Y. et al. Type I IFNs and IL-18 regulate the antiviral response of primary human γδ-T cells against dendritic cells infected with dengue virus.J. Immunol. 194, 3890–3900 (2015).
Article CAS Google Scholar
Garcillán, B. et al. GD-T lymphocytes in the diagnosis of human T cell receptor immunodeficiencies. Front. Immunol. 6, 20 (2015).
Article Google Scholar
Chien, Y.-H., Meyer, C. & Bonneville, M. γδ-T cells: first line of defense and beyond. Annu. Rev. Immunol. 32, 121–155 (2014).
Article CAS Google Scholar
Cimini, E. et al. Human Zika infection induces a reduction of IFN-γ producing CD4 T-cells and a parallel expansion of effector Vδ2 T-cells. Sci. Rep. 7, 6313 (2017).
Article Google Scholar
Moon, K. R. et al. Manifold learning-based methods for analyzing single-cell RNA-sequencing data.Curr. Opin. Syst. Biol. 7, 36–46 (2017).
Article Google Scholar
Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems Conference 2014 2924–2932 (JMLR, 2014).
Anand, K., Bianconi, G. & Severini, S. Shannon and von Neumann entropy of random networks with heterogeneous expected degree. Phys. Rev. E 83, 036109 (2011).
Article Google Scholar
Rellahan, B. L., Bluestone, J. A., Houlden, B. A., Cotterman, M. M. & Matis, L. A. Junctional sequences influence the specificity of gamma/delta T cell receptors. J. Exp. Med. 173, 503–506 (1991).
Article CAS Google Scholar
Büttner, M. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Article Google Scholar
Luo, B. & Hancock, E. R. Iterative procrustes alignment with the EM algorithm. Image Vis. Comput. 20, 377–396 (2002).
Article Google Scholar
Lui, K., Ding, G. W., Huang, R. & McCann, R. Dimensionality reduction has quantifiable imperfections: two geometric bounds. In Advances in Neural Information Processing Systems 31 (NIPS 2018) (eds Bengio, S. et al.) 8461–8471 (JMLR, 2018).
Tsherniak, A. et al. Defining a cancer dependency map. Cell 170, 564–576 (2017).
Article CAS Google Scholar
Yao, Y. et al. The natural killer cell response to West Nile virus in young and old individuals with or without a prior history of infection. PLoS ONE 12, e0172625 (2017).
Article Google Scholar
Yao, Y. et al. CyTOF supports efficient detection of immune cell subsets from small samples. J. Immunol. Methods 415, 1–5 (2014).
Article CAS Google Scholar

Download references

Acknowledgements

This research was supported in part by: the Indo-U.S. Vaccine Action Program, the National Institute of Allergy and Infectious Diseases of the NIH (Award no. AI089992 to R.R.M.); IVADO (L’institut de valorisation des données to G.W.) and the Chan–Zuckerberg Initiative (grant no. 182702 to S.K.).

Author information

These authors contributed equally: Matthew Amodio, David van Dijk, Krishnan Srinivasan, Guy Wolf, Smita Krishnaswamy.

Authors and Affiliations

Department of Computer Science, Yale University, New Haven, CT, USA
Matthew Amodio, David van Dijk, Krishnan Srinivasan & Smita Krishnaswamy
Department of Genetics, Yale University, New Haven, CT, USA
David van Dijk & Smita Krishnaswamy
School of Medicine, Yale University, New Haven, CT, USA
William S. Chen & Allison Campbell
Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
Hussein Mohsen
Department of Mathematics and Statistics, Utah State University, Logan, UT, USA
Kevin R. Moon
Department of Rheumatology, Yale University, New Haven, CT, USA
Yujiao Zhao, Xiaomei Wang & Ruth Montgomery
Department of Neurovirology, NIMHANS, Bangalore, India
Manjunatha Venkataswamy, Anita Desai & V. Ravi
Department of Microbial Pathogenesis, Yale University, New Haven, CT, USA
Priti Kumar
Department of Mathematics and Statistics, Université de Montréal, Montréal, Quebec, Canada
Guy Wolf
Mila – Quebec Artificial Intelligence Institute, Montréal, Quebec, Canada
Guy Wolf

Authors

Matthew Amodio
View author publications
You can also search for this author in PubMed Google Scholar
David van Dijk
View author publications
You can also search for this author in PubMed Google Scholar
Krishnan Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
William S. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hussein Mohsen
View author publications
You can also search for this author in PubMed Google Scholar
Kevin R. Moon
View author publications
You can also search for this author in PubMed Google Scholar
Allison Campbell
View author publications
You can also search for this author in PubMed Google Scholar
Yujiao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaomei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Manjunatha Venkataswamy
View author publications
You can also search for this author in PubMed Google Scholar
Anita Desai
View author publications
You can also search for this author in PubMed Google Scholar
V. Ravi
View author publications
You can also search for this author in PubMed Google Scholar
Priti Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Ruth Montgomery
View author publications
You can also search for this author in PubMed Google Scholar
Guy Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Smita Krishnaswamy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.A., S.K., G.W. and D.v.D. envisioned the project. M.A., K.S. and D.v.D., implemented the model and performed the analyses. M.A., S.K., G.W. and D.v.D. wrote the paper. K.S., W.S.C., H.M., A.C. and K.R.M. provided assistance in writing and analysis. Y.Z., X.W., M.V., A.D., V.R., P.K. and R.M. were responsible for data acquisition and processing.

Corresponding author

Correspondence to Smita Krishnaswamy.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Comparison of runtimes on an increasing number of points.

The number of points is represented on the horizontal axis and the time in seconds the method took to complete is on the vertical axis. If a method ran out of resources and could not complete a run for a certain number of points, that is demarcated with an ‘x’ and no further time points were attempted for that method.

Supplementary Figure 2 A comparison of the SAUCIE clustering to other clustering methods on artificial and real data.

Rows show the different datasets. Along with the first artificial dataset, there are two CyTOF datasets and three scRNA-seq datasets, with samples of size 2000, 27499, 50000, 3005, 2730, and 2000 for the GMM, Shekhar et al, Chevrier et al, Ziesel et al, Paul et al, and Setty et al datasets, respectively. Columns show the different clustering methods. From left to right: True “ground truth” labels, SAUCIE, kmeans, Phenograph, scVI. In (b) and (c), we add the scores for the modularity and silhouette heuristics from Supplementary Table 1, respectively.

Supplementary Figure 3 A comparison of batch correction with SAUCIE to other methods.

Comparisons include an artificial dataset, two technical replicates from the dengue CyTOF data, non-technical replicates on scRNA-seq batches from mouse cortex, and then public data from Chevrier et al, Azizi et al, and Setty et al with samples of size 2000, 41721, 8530, 9998, 4376, 2000, and 24741, respectively. Rows show the different datasets. Columns show the different batch correction methods. From left to right: The original data prior to batch correction, SAUCIE, mutual nearest neighbors (MNN), canonical correlation analysis (CCA). In (b) and (c), we add graphs of the mixing score and shape preserving score results from Supplementary Table 2 for quantitative evaluation, respectively.

Supplementary Figure 4 A comparison of the SAUCIE visualization to other methods on a number of artificial and real datasets.

The columns show the different methods. From left to right: SAUCIE, PCA, Monocle2, Diffusion Maps, UMAP, tSNE, PHATE. The rows show the different datasets. From top to bottom: Artificially generated trees with varying amounts of noise, random tree generated with diffusion limited aggregation (DLA), intersecting half circles, Gaussian mixture model, scRNA-seq hematopoiesis from Paul et al, CyTOF T cell development from Setty et al, CyTOF ipsc from Zunder at al, scRNA-seq retinal bipolar cells from Shekhar et al, scRNA-seq mouse cortex from Zeisel et al with samples of size 1440, 1440, 1440, 2000, 1500, 2000, 55000, 2730, 250170, 220450, 27499, and 3005, respectively. In (b), we add a graph of the precision-recall metric results from Supplementary Table 3 for quantitative evaluation.

Supplementary Figure 5 A comparison of imputation methods including SAUCIE.

Several gene-gene associations are shown from the 10x mouse cortex dataset subset (4142 cells). From left to right: The original (sparse) data, data after imputation with SAUCIE, MAGIC, scImpute, and nearest neighbor completion.

Supplementary Figure 6 A comparison of imputation with SAUCIE to other methods on the simulated dropout experiment.

Increasing amounts of dropout are along the horizontal axis from left to right, and the accuracy of each method as measured by R² is along the vertical axis. The time each method took to complete is in the legend in seconds.

Supplementary Figure 7 SAUCIE batch correction removes technical artifacts from CyTOF instrument.

Four select marker abundances with samples grouped by day they were run on the cytometry instrument, with each day having fourteen distinct samples in the group. For each marker, the fourteen samples before batch correction are shown to the left of the same fourteen samples after batch correction.

Supplementary Figure 8 SAUCIE batch correction preserves relative values in samples from dengue data.

Histograms of marker expression (top: IL-6, bottom: CD86) of samples run together on the cytometry instrument on day two, separated by sample. The twelve samples were of size 50988, 41212, 29337, 177804, 177492, 154054, 95476, 82782, 82194, 182486, 137240, and 113506, respectively. The values for each sample and marker are shown before SAUCIE batch correction (left) and after SAUCIE batch correction (right). The box plots depict the minimum and maximum (whiskers), median (center line), 25^th and 75^th percentiles (box limits) of marker values for all cells within the specified sample.

Supplementary Figure 9 Controlling the granularity of clustering with SAUCIE hyperparameters.

The granularity of the clustering, as measured by the total number of clusters found. Each line represents a fixed value of λ_d as λ_c increases from left to right.

Supplementary Figure 10 An illustration of the metaclustering process on the dengue dataset.

Metaclustering results from N=10 samples from the dengue dataset. Top left: cluster centroids embedded by tSNE and colored by metacluster, sized according to the number of cells in each cluster. Top right: cluster centroids colored by sample, also sized according to the number of cells in each cluster. Bottom left: a cell-level heatmap of expression grouped by metacluster. Bottom right: the composition of each metacluster by sample.

Supplementary Figure 11 An illustration of the SAUCIE pipeline on the dengue dataset.

Left: cell-level heatmap of expression grouped by cluster. Top right: cluster centroids embedded by tSNE, sized according to the number of cells in each cluster. Bottom right: the composition of each cluster by sample.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amodio, M., van Dijk, D., Srinivasan, K. et al. Exploring single-cell data with deep multitasking neural networks. Nat Methods 16, 1139–1145 (2019). https://doi.org/10.1038/s41592-019-0576-7

Download citation

Received: 24 August 2018
Accepted: 19 August 2019
Published: 07 October 2019
Issue Date: November 2019
DOI: https://doi.org/10.1038/s41592-019-0576-7

This article is cited by

Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks
- Jamshaid A. Shahir
- Natalie Stanley
- Jeremy E. Purvis
BMC Bioinformatics (2024)
scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data
- Ziqi Zhang
- Xinye Zhao
- Xiuwei Zhang
Nature Communications (2024)
Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN
- Yanay Rosen
- Maria Brbić
- Jure Leskovec
Nature Methods (2024)
A new and effective two-step clustering approach for single cell RNA sequencing data
- Ruiyi Li
- Jihong Guan
- Shuigeng Zhou
BMC Genomics (2023)
PAUSE: principled feature attribution for unsupervised gene expression analysis
- Joseph D. Janizek
- Anna Spiro
- Su-In Lee
Genome Biology (2023)