In single-cell RNA sequencing (scRNA-seq) studies, only a small fraction of the transcripts present in each cell are sequenced. This leads to unreliable quantification of genes with low or moderate expression, which hinders downstream analysis. To address this challenge, we developed SAVER (single-cell analysis via expression recovery), an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Svensson, V. et al. Nat. Methods 14, 381–387 (2017).
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Nat. Methods 11, 740–742 (2014).
Finak, G. et al. Genome Biol. 16, 278 (2015).
Pierson, E. & Yau, C. Genome Biol. 16, 241 (2015).
van Dijk, D. et al. bioRxiv preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).
Li, W. V. & Li, J. J. Nat. Commun. 9, 997 (2018).
Wills, Q. F. et al. Nat. Biotechnol. 31, 748–752 (2013).
Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. PLoS Biol. 4, e309 (2006).
Shaffer, S. M. et al. Nature 546, 431–435 (2017).
Torre, E. et al. Cell Syst. 6, 171–179 (2018).
Jiang, L., Chen, H., Pinello, L. & Yuan, G. C. Genome Biol. 17, 144 (2016).
Baron, M. et al. Cell Syst. 3, 346–360 (2016).
Chen, R., Wu, X., Jiang, L. & Zhang, Y. Cell Rep. 18, 3227–3241 (2017).
La Manno, G. et al. Cell 167, 566–580 (2016).
Zeisel, A. et al. Science 347, 1138–1142 (2015).
Herdin, M., Czink, N., Ozcelik, H. & Bonek, E. Proc. 2005 IEEE 61st Vehicular Technology Conference 1, 136–140 (IEEE: Piscataway, NJ, 2005).
Korthauer, K. D. et al. Genome Biol. 17, 222 (2016).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Nat. Biotechnol. 33, 495–502 (2015).
Van Der Maaten, L. J. P. & Hinton, G. E. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Hrvatin, S. et al. Nat. Neurosci. 21, 120–129 (2018).
Satija, R. et al. Seurat: guided clustering tutorial. Satija Lab http://satijalab.org/seurat/pbmc3k_tutorial.html (2018).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. Workflow package: simpleSingleCell. Bioconductor https://bioconductor.org/help/workflows/simpleSingleCell/ (2016).
Kiselev, V. et al. Analysis of single cell RNA-seq data. Hemberg Lab https://hemberg-lab.github.io/scRNA.seq.course/index.html (2018).
Wang, J. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/12/01/227033 (2017).
Wagner, F., Yan, Y. & Yanai, I. b ioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/21/217737 (2017).
Lun, A. T., Bach, K. & Marioni, J. C. Genome Biol. 17, 75 (2016).
Vallejos, C. A., Marioni, J. C. & Richardson, S. PLOS Comput. Biol. 11, e1004333 (2015).
Bacher, R. et al. Nat. Methods 14, 584–586 (2017).
Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).
Padovan-Merhar, O. et al. Mol. Cell 58, 339–352 (2015).
Rubin, D. B. Multiple Imputation for Nonresponse in Surveys (John Wiley, Hoboken, NJ, 1987).
This work was supported by the NIH (grant R01HG006137 to N.R.Z. and M.H.; grant R01GM125301 to N.R.Z., M.L., and M.H.; R21 HD085201 to J.I.M. and H.D.; NIH New Innovator Award DP2 OD008514 to A.R.; R33 EB019767, P30 CA016520, and 4DN U01 HL129998 to A.R. and E.T.; F30 AI114475 to S.S.; R01GM108600 and R01HL113147 to M.L.; DP2MH107055 to R.B.), the NSF (Graduate Fellowship DGE-1321851 to M.H.), the Wharton Dean’s Fund (to J.W.), the NCI (NIH/NCI PSOC award U54 CA193417 to A.R. and E.T.), the NSF (CAREER award 1350601 to A.R. and E.T.), the NIH Center for Photogenomics (RM1 HG007743 to A.R. and E.T.), a Penn Epigenetics Program Pilot award (A.R. and E.T.), the Charles E. Kauffman Foundation (KA2016-85223 to A.R. and E.T.), the Tara Miller Melanoma Foundation (to A.R. and E.T.), the Searle Scholars Program (15-SSP-102 to R.B.), the March of Dimes Foundation (1-FY-15-344 to R.B.), a Linda Pechenik Montague Investigator award (R.B.), and the Charles E. Kauffman Foundation (KA2016-85223 to R.B.). This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (NSF OCI-1053575).
A.R. receives consulting income and A.R. and S.S. receive royalties related to Stellaris RNA FISH probes.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Comparison of the 15 FISH genes with all genes in the Drop-seq data in terms of (a) mean expression, (b) percent of cells with non-zero expression, and (c) Fano factor, a measure of dispersion. Number of cells for each gene can be found in Supplementary Table 2.
Comparison of distributions of expression across cells between FISH, observed Drop-seq, and SAVER recovered expression for 13 genes. Densities were calculated using the density function in R with a Gaussian smoothing kernel. The bandwidth for each gene was set to be the default bandwidth selected for the FISH density. The distribution of the SAVER recovered expression matches more closely to the FISH distribution than the observed Drop-seq. Drop-seq and SAVER distributions were calculated on n = 8,498 cells. Number of cells for FISH distributions can be found in Supplementary Table 2.
RNA FISH validation of MAGIC and scImpute results for 15 genes. (a) Comparison of Gini coefficient for each gene between FISH and MAGIC (left) and between FISH and scImpute recovered values (right) for n = 15 genes. (b) Comparison of Drop-seq, MAGIC, scImpute, and SAVER Kolmogorov-Smirnov (KS) distance to FISH distributions for the 15 genes. (c) Kernel density estimates of cross-cell expression distribution of LMNA (upper) and CCNA2 (lower) for MAGIC and scImpute. (d) Comparison of pair-wise gene correlations computed from Drop-seq, SAVER, MAGIC, and scImpute with those computed from FISH counts. (e) Scatterplots of expression levels between BABAM1 and LMNA. Pearson correlations were calculated across n=17,095 cells for FISH and n=8,498 cells for MAGIC and scImpute.
Gene-to-gene correlations of n = 1,000 genes from a null dataset with no real gene relationships (Supplementary Note 1). Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers). Violin plot extremes represent the minimum and maximum values.
Highly expressed genes and cells are selected from a real scRNA-seq dataset to create a reference dataset to serve as the ground truth. Down-sampling simulates efficiency loss leads to an observed dataset. Expression recovery algorithms are applied to the observed dataset and performance is measured by calculating the gene-wise and cell-wise correlations with the reference dataset.
Density plots of (a) gene-wise and cell-wise correlations with the reference and (b) % change in correlation compared to the observed data for SAVER, MAGIC, and scImpute.
Evaluation of SAVER against missing data imputation algorithms for the down-sampled data. The algorithms are: SAVER, k-nearest neighbors (KNN) imputation, singular value decomposition (SVD) imputation, and random forest (RF) imputation. (a) Gene-wise and cell-wise correlations for each method. Number of genes and cells can be found in Supplementary Table 3. Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers); outlier data beyond this range are not shown. (b) Comparison of gene-to-gene and cell-to-cell correlation matrices in terms of correlation matrix distance (CMD) from the reference.
Supplementary Figure 8 t-SNE and clustering for the datasets from Baron et al.12, Chen et al.13, and La Manno et al.14.
Cell clustering and t-SNE visualization of the Baron (n = 1,076 cells), Chen (n = 7,712 cells), and La Manno (n = 947 cells) reference, observed, and recovered down-sampled datasets. The colors represent the cell types identified by Seurat in the reference dataset. The Jaccard index measuring similarity between the observed/recovered clustering and reference clustering is displayed in the bottom right.
Cell clustering and t-SNE visualization of the (a) Baron (n = 1,076 cells), (b) Chen (n = 7,712), (c) La Manno (n = 947 cells), and (d) Zeisel (n = 1,799 cells) observed and recovered down-sampled datasets. The number of principal components (PCs) used in the t-SNE visualization and clustering is varied from 5 PCs to 25 PCs. The number of PCs chosen by the jackStraw method is denoted by the bold outline. The colors represent the cell types identified by Seurat in the reference dataset using the number of PCs chosen by jackStraw. The Jaccard index measuring similarity between observed/recovered clustering and reference clustering is displayed in the bottom right.
Poisson Lasso regression cross-validation plots from Glmnet and correlation with Zeisel reference plots for five genes from the 5% efficiency dataset. The x-axis represents the size of the shrinkage penalty in the LASSO regression. The dotted vertical line represents the model with the lowest cross-validation error. The horizontal line is the observed correlation with the reference and the black points represent the correlation of the SAVER estimate with the reference at each value of the shrinkage penalty. The red points in the cross-validation plot represents the mean cross-validation error as measured by Poisson deviance and the error bars represent ± 1 standard deviation. SAVER correlation with the reference is approximately maximized when using the model with the lowest cross-validation error.
The SAVER estimate is a weighted average of the normalized observed expression and the predicted expression. The weight is dependent on the predictability of the gene and the cell-specific efficiency. Four scenarios are shown: Predictable (low ϕg) versus unpredictable (high ϕg) gene, in a high or low efficiency experiment. In each of the scatter plots, each point is a gene, and for each gene, the vertical lines connect the normalized observed expression with the gene’s SAVER recovered value, which always lies between the normalized observed expression and the prediction (the 45 degree line).
Density plots of (a) library size across cells and (b) proportion of nonzero cells across genes for the original datasets used in the down-sampling experiment. The reference datasets were constructed by filtering such that roughly 50-60% of cells with the largest library size and 10-20% of genes with the highest proportion of nonzero cells were selected. The exact cutoffs were determined by trying to match mean expression and percentage zero between the original and the down-sampled datasets at the given efficiencies (Supplementary Table 3).
About this article
Cite this article
Huang, M., Wang, J., Torre, E. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15, 539–542 (2018). https://doi.org/10.1038/s41592-018-0033-z
Applied Mathematical Modelling (2021)
SIAM Journal on Applied Mathematics (2020)
DISC: a highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning
Genome Biology (2020)
Disease Models & Mechanisms (2020)