It is currently challenging to analyze single-cell data consisting of many cells and samples, and to address variations arising from batch effects and different sample preparations. For this purpose, we present SAUCIE, a deep neural network that combines parallelization and scalability offered by neural networks, with the deep representation of data that can be learned by them to perform many single-cell data analysis tasks. Our regularizations (penalties) render features learned in hidden layers of the neural network interpretable. On large, multi-patient datasets, SAUCIE’s various hidden layers contain denoised and batch-corrected data, a low-dimensional visualization and unsupervised clustering, as well as other information that can be used to explore the data. We analyze a 180-sample dataset consisting of 11 million T cells from dengue patients in India, measured with mass cytometry. SAUCIE can batch correct and identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $23.17 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Data for the dengue dataset is available at Cytobank, with accession number 82023.
SAUCIE is written in Python using the TensorFlow library for deep learning. The source code is available at https://github.com/KrishnaswamyLab/SAUCIE/.
Tan, J. et al. Unsupervised extraction of stable expression signatures from public compendia with an ensemble of neural networks. Cell Syst. 5, 63–71 (2017).
Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. In Pacific Symposium on Biocomputing 2018 Vol. 23 (PSB, 2018).
Wang, W., Huang, Y., Wang, Y. & Wang, L. Generalized autoencoder: a neural network framework for dimensionality reduction. In CVPR Workshops (eds Betke, M. & Davis, J.) 496–503 (IEEE, 2014).
Tan, J. et al. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. In Pacific Symposium on Biocomputing 2015. Vol. 20 (PSB, 2015).
Tan, J., Hammond, J. H., Hogan, D. A. & Greene, C. S. Adage-based integration of publicly available pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. MSystems 1, e00025-15 (2016).
Chen, H., Shen, J., Wang, L. and Song, J. Leveraging stacked denoising autoencoder in prediction of pathogen-host protein-protein interactions. In Proc. 2017 IEEE International Congress on Big Data (BigData Congress) 368–375 (IEEE, 2017).
Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinforma. 17, S9 (2016).
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).
Courbariaux, M., Bengio, Y. & David, J.-P. Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28 (NIPS 2015) (eds Cortez, C. et al.) 3123–3131 (JMLR, 2015).
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R. & Bengio, Y. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or −1. Preprint at https://arxiv.org/abs/1602.02830 (2016).
Tang, W., Hua, G. and Wang, L. How to train a compact binary neural network with high accuracy? In Thirty-First AAAI Conference on Artificial Intelligence (eds Singh, S. & Markovitch, S.) 2625–2631 (ACM, 2017).
Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 34, 637 (2016).
Chevrier, S. et al. An immune atlas of clear cell renal cell carcinoma. Cell 169, 736–749 (2017).
Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018).
Zunder, E. R., Lujan, E., Goltsev, Y., Wernig, M. & Nolan, G. P. A continuous molecular roadmap to ipsc reprogramming through progression analysis of single-cell mass cytometry. Cell Stem Cell 16, 323–337 (2015).
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323 (2016).
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
Single Cell Gene Expression Datasets (10x Genomics, 2017); https://support.10xgenomics.com/single-cell-gene-expression/datasets
Levine, J. H. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015).
Lopez, R., Regier, J., Cole, M., Jordan, M. & Yosef, N. A deep generative model for single-cell RNA sequencing with application to detecting differentially expressed genes. Preprint at https://arxiv.org/abs/1710.05086 (2017).
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421 (2018).
Butler, A. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Moon, K. R. et al. PHATE: a dimensionality reduction method for visualizing trajectory structures in high-dimensional biological data. Preprint at https://doi.org/10.1101/120378 (2017).
Van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).
Regev, A. et al. Science forum: the human cell atlas. eLife 6, e27041 (2017).
Panda, A. et al. Age-associated decrease in tlr function in primary human dendritic cells predicts influenza vaccine response.J. Immunol. 184, 2518–2527 (2010).
Tsai, C.-Y. et al. Type I IFNs and IL-18 regulate the antiviral response of primary human γδ-T cells against dendritic cells infected with dengue virus.J. Immunol. 194, 3890–3900 (2015).
Garcillán, B. et al. GD-T lymphocytes in the diagnosis of human T cell receptor immunodeficiencies. Front. Immunol. 6, 20 (2015).
Chien, Y.-H., Meyer, C. & Bonneville, M. γδ-T cells: first line of defense and beyond. Annu. Rev. Immunol. 32, 121–155 (2014).
Cimini, E. et al. Human Zika infection induces a reduction of IFN-γ producing CD4 T-cells and a parallel expansion of effector Vδ2 T-cells. Sci. Rep. 7, 6313 (2017).
Moon, K. R. et al. Manifold learning-based methods for analyzing single-cell RNA-sequencing data.Curr. Opin. Syst. Biol. 7, 36–46 (2017).
Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems Conference 2014 2924–2932 (JMLR, 2014).
Anand, K., Bianconi, G. & Severini, S. Shannon and von Neumann entropy of random networks with heterogeneous expected degree. Phys. Rev. E 83, 036109 (2011).
Rellahan, B. L., Bluestone, J. A., Houlden, B. A., Cotterman, M. M. & Matis, L. A. Junctional sequences influence the specificity of gamma/delta T cell receptors. J. Exp. Med. 173, 503–506 (1991).
Büttner, M. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).
Luo, B. & Hancock, E. R. Iterative procrustes alignment with the EM algorithm. Image Vis. Comput. 20, 377–396 (2002).
Lui, K., Ding, G. W., Huang, R. & McCann, R. Dimensionality reduction has quantifiable imperfections: two geometric bounds. In Advances in Neural Information Processing Systems 31 (NIPS 2018) (eds Bengio, S. et al.) 8461–8471 (JMLR, 2018).
Tsherniak, A. et al. Defining a cancer dependency map. Cell 170, 564–576 (2017).
Yao, Y. et al. The natural killer cell response to West Nile virus in young and old individuals with or without a prior history of infection. PLoS ONE 12, e0172625 (2017).
Yao, Y. et al. CyTOF supports efficient detection of immune cell subsets from small samples. J. Immunol. Methods 415, 1–5 (2014).
This research was supported in part by: the Indo-U.S. Vaccine Action Program, the National Institute of Allergy and Infectious Diseases of the NIH (Award no. AI089992 to R.R.M.); IVADO (L’institut de valorisation des données to G.W.) and the Chan–Zuckerberg Initiative (grant no. 182702 to S.K.).
The authors declare no competing interests.
Peer review information Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
The number of points is represented on the horizontal axis and the time in seconds the method took to complete is on the vertical axis. If a method ran out of resources and could not complete a run for a certain number of points, that is demarcated with an ‘x’ and no further time points were attempted for that method.
Supplementary Figure 2 A comparison of the SAUCIE clustering to other clustering methods on artificial and real data.
Rows show the different datasets. Along with the first artificial dataset, there are two CyTOF datasets and three scRNA-seq datasets, with samples of size 2000, 27499, 50000, 3005, 2730, and 2000 for the GMM, Shekhar et al, Chevrier et al, Ziesel et al, Paul et al, and Setty et al datasets, respectively. Columns show the different clustering methods. From left to right: True “ground truth” labels, SAUCIE, kmeans, Phenograph, scVI. In (b) and (c), we add the scores for the modularity and silhouette heuristics from Supplementary Table 1, respectively.
Comparisons include an artificial dataset, two technical replicates from the dengue CyTOF data, non-technical replicates on scRNA-seq batches from mouse cortex, and then public data from Chevrier et al, Azizi et al, and Setty et al with samples of size 2000, 41721, 8530, 9998, 4376, 2000, and 24741, respectively. Rows show the different datasets. Columns show the different batch correction methods. From left to right: The original data prior to batch correction, SAUCIE, mutual nearest neighbors (MNN), canonical correlation analysis (CCA). In (b) and (c), we add graphs of the mixing score and shape preserving score results from Supplementary Table 2 for quantitative evaluation, respectively.
Supplementary Figure 4 A comparison of the SAUCIE visualization to other methods on a number of artificial and real datasets.
The columns show the different methods. From left to right: SAUCIE, PCA, Monocle2, Diffusion Maps, UMAP, tSNE, PHATE. The rows show the different datasets. From top to bottom: Artificially generated trees with varying amounts of noise, random tree generated with diffusion limited aggregation (DLA), intersecting half circles, Gaussian mixture model, scRNA-seq hematopoiesis from Paul et al, CyTOF T cell development from Setty et al, CyTOF ipsc from Zunder at al, scRNA-seq retinal bipolar cells from Shekhar et al, scRNA-seq mouse cortex from Zeisel et al with samples of size 1440, 1440, 1440, 2000, 1500, 2000, 55000, 2730, 250170, 220450, 27499, and 3005, respectively. In (b), we add a graph of the precision-recall metric results from Supplementary Table 3 for quantitative evaluation.
Several gene-gene associations are shown from the 10x mouse cortex dataset subset (4142 cells). From left to right: The original (sparse) data, data after imputation with SAUCIE, MAGIC, scImpute, and nearest neighbor completion.
Supplementary Figure 6 A comparison of imputation with SAUCIE to other methods on the simulated dropout experiment.
Increasing amounts of dropout are along the horizontal axis from left to right, and the accuracy of each method as measured by R2 is along the vertical axis. The time each method took to complete is in the legend in seconds.
Four select marker abundances with samples grouped by day they were run on the cytometry instrument, with each day having fourteen distinct samples in the group. For each marker, the fourteen samples before batch correction are shown to the left of the same fourteen samples after batch correction.
Supplementary Figure 8 SAUCIE batch correction preserves relative values in samples from dengue data.
Histograms of marker expression (top: IL-6, bottom: CD86) of samples run together on the cytometry instrument on day two, separated by sample. The twelve samples were of size 50988, 41212, 29337, 177804, 177492, 154054, 95476, 82782, 82194, 182486, 137240, and 113506, respectively. The values for each sample and marker are shown before SAUCIE batch correction (left) and after SAUCIE batch correction (right). The box plots depict the minimum and maximum (whiskers), median (center line), 25th and 75th percentiles (box limits) of marker values for all cells within the specified sample.
The granularity of the clustering, as measured by the total number of clusters found. Each line represents a fixed value of λd as λc increases from left to right.
Metaclustering results from N=10 samples from the dengue dataset. Top left: cluster centroids embedded by tSNE and colored by metacluster, sized according to the number of cells in each cluster. Top right: cluster centroids colored by sample, also sized according to the number of cells in each cluster. Bottom left: a cell-level heatmap of expression grouped by metacluster. Bottom right: the composition of each metacluster by sample.
Left: cell-level heatmap of expression grouped by cluster. Top right: cluster centroids embedded by tSNE, sized according to the number of cells in each cluster. Bottom right: the composition of each cluster by sample.