SAVER: gene expression recovery for single-cell RNA sequencing

Huang, Mo; Wang, Jingshu; Torre, Eduardo; Dueck, Hannah; Shaffer, Sydney; Bonasio, Roberto; Murray, John I.; Raj, Arjun; Li, Mingyao; Zhang, Nancy R.

doi:10.1038/s41592-018-0033-z

Brief Communication
Published: 25 June 2018

SAVER: gene expression recovery for single-cell RNA sequencing

Mo Huang¹,
Jingshu Wang¹,
Eduardo Torre^2,3,
Hannah Dueck⁴,
Sydney Shaffer³,
Roberto Bonasio⁵,
John I. Murray⁴,
Arjun Raj^3,4,
Mingyao Li⁶ &
…
Nancy R. Zhang¹

Nature Methods volume 15, pages 539–542 (2018)Cite this article

25k Accesses
392 Citations
50 Altmetric
Metrics details

Subjects

Abstract

In single-cell RNA sequencing (scRNA-seq) studies, only a small fraction of the transcripts present in each cell are sequenced. This leads to unreliable quantification of genes with low or moderate expression, which hinders downstream analysis. To address this challenge, we developed SAVER (single-cell analysis via expression recovery), an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: RNA FISH validation of SAVER results on Drop-seq data.**

**Fig. 2: Evaluation of SAVER by downsampling and cell clustering.**

Single-cell RNA counting at allele and isoform resolution using Smart-seq3

Article 04 May 2020

Tools for the analysis of high-dimensional single-cell RNA sequencing data

Article 27 March 2020

Bayesian inference of gene expression states from single-cell RNA-seq data

Article 29 April 2021

References

Svensson, V. et al. Nat. Methods 14, 381–387 (2017).
Article PubMed PubMed Central CAS Google Scholar
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Nat. Methods 11, 740–742 (2014).
Article PubMed PubMed Central CAS Google Scholar
Finak, G. et al. Genome Biol. 16, 278 (2015).
Article PubMed PubMed Central CAS Google Scholar
Pierson, E. & Yau, C. Genome Biol. 16, 241 (2015).
Article PubMed PubMed Central CAS Google Scholar
van Dijk, D. et al. bioRxiv preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).
Li, W. V. & Li, J. J. Nat. Commun. 9, 997 (2018).
Article PubMed PubMed Central CAS Google Scholar
Wills, Q. F. et al. Nat. Biotechnol. 31, 748–752 (2013).
Article PubMed CAS Google Scholar
Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. PLoS Biol. 4, e309 (2006).
Article PubMed PubMed Central CAS Google Scholar
Shaffer, S. M. et al. Nature 546, 431–435 (2017).
Article PubMed PubMed Central CAS Google Scholar
Torre, E. et al. Cell Syst. 6, 171–179 (2018).
Article PubMed CAS PubMed Central Google Scholar
Jiang, L., Chen, H., Pinello, L. & Yuan, G. C. Genome Biol. 17, 144 (2016).
Article PubMed PubMed Central CAS Google Scholar
Baron, M. et al. Cell Syst. 3, 346–360 (2016).
Article PubMed PubMed Central CAS Google Scholar
Chen, R., Wu, X., Jiang, L. & Zhang, Y. Cell Rep. 18, 3227–3241 (2017).
Article PubMed PubMed Central CAS Google Scholar
La Manno, G. et al. Cell 167, 566–580 (2016).
Article PubMed PubMed Central CAS Google Scholar
Zeisel, A. et al. Science 347, 1138–1142 (2015).
Article PubMed CAS Google Scholar
Herdin, M., Czink, N., Ozcelik, H. & Bonek, E. Proc. 2005 IEEE 61st Vehicular Technology Conference 1, 136–140 (IEEE: Piscataway, NJ, 2005).
Korthauer, K. D. et al. Genome Biol. 17, 222 (2016).
Article PubMed PubMed Central CAS Google Scholar
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Nat. Biotechnol. 33, 495–502 (2015).
Article PubMed PubMed Central CAS Google Scholar
Van Der Maaten, L. J. P. & Hinton, G. E. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Hrvatin, S. et al. Nat. Neurosci. 21, 120–129 (2018).
Article PubMed CAS Google Scholar
Satija, R. et al. Seurat: guided clustering tutorial. Satija Lab http://satijalab.org/seurat/pbmc3k_tutorial.html (2018).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. Workflow package: simpleSingleCell. Bioconductor https://bioconductor.org/help/workflows/simpleSingleCell/ (2016).
Kiselev, V. et al. Analysis of single cell RNA-seq data. Hemberg Lab https://hemberg-lab.github.io/scRNA.seq.course/index.html (2018).
Wang, J. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/12/01/227033 (2017).
Wagner, F., Yan, Y. & Yanai, I. b ioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/21/217737 (2017).
Lun, A. T., Bach, K. & Marioni, J. C. Genome Biol. 17, 75 (2016).
Article PubMed CAS Google Scholar
Vallejos, C. A., Marioni, J. C. & Richardson, S. PLOS Comput. Biol. 11, e1004333 (2015).
Article PubMed PubMed Central CAS Google Scholar
Bacher, R. et al. Nat. Methods 14, 584–586 (2017).
Article PubMed PubMed Central CAS Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).
Article PubMed PubMed Central Google Scholar
Padovan-Merhar, O. et al. Mol. Cell 58, 339–352 (2015).
Article PubMed PubMed Central CAS Google Scholar
Rubin, D. B. Multiple Imputation for Nonresponse in Surveys (John Wiley, Hoboken, NJ, 1987).
Book Google Scholar

Download references

Acknowledgements

This work was supported by the NIH (grant R01HG006137 to N.R.Z. and M.H.; grant R01GM125301 to N.R.Z., M.L., and M.H.; R21 HD085201 to J.I.M. and H.D.; NIH New Innovator Award DP2 OD008514 to A.R.; R33 EB019767, P30 CA016520, and 4DN U01 HL129998 to A.R. and E.T.; F30 AI114475 to S.S.; R01GM108600 and R01HL113147 to M.L.; DP2MH107055 to R.B.), the NSF (Graduate Fellowship DGE-1321851 to M.H.), the Wharton Dean’s Fund (to J.W.), the NCI (NIH/NCI PSOC award U54 CA193417 to A.R. and E.T.), the NSF (CAREER award 1350601 to A.R. and E.T.), the NIH Center for Photogenomics (RM1 HG007743 to A.R. and E.T.), a Penn Epigenetics Program Pilot award (A.R. and E.T.), the Charles E. Kauffman Foundation (KA2016-85223 to A.R. and E.T.), the Tara Miller Melanoma Foundation (to A.R. and E.T.), the Searle Scholars Program (15-SSP-102 to R.B.), the March of Dimes Foundation (1-FY-15-344 to R.B.), a Linda Pechenik Montague Investigator award (R.B.), and the Charles E. Kauffman Foundation (KA2016-85223 to R.B.). This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (NSF OCI-1053575).

Author information

Authors and Affiliations

Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, USA
Mo Huang, Jingshu Wang & Nancy R. Zhang
Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Eduardo Torre
Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
Eduardo Torre, Sydney Shaffer & Arjun Raj
Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Hannah Dueck, John I. Murray & Arjun Raj
Department of Cell and Developmental Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Roberto Bonasio
Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Mingyao Li

Authors

Mo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jingshu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Torre
View author publications
You can also search for this author in PubMed Google Scholar
Hannah Dueck
View author publications
You can also search for this author in PubMed Google Scholar
Sydney Shaffer
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Bonasio
View author publications
You can also search for this author in PubMed Google Scholar
John I. Murray
View author publications
You can also search for this author in PubMed Google Scholar
Arjun Raj
View author publications
You can also search for this author in PubMed Google Scholar
Mingyao Li
View author publications
You can also search for this author in PubMed Google Scholar
Nancy R. Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.R.Z. conceived and led this work. M.H., N.R.Z., and M.L. designed the model and estimation algorithm, implemented the SAVER software, designed the in silico experiments, and led the data analysis. J.W. validated the Poisson noise model in ERCC data. E.T., H.D., S.S., R.B., J.I.M., and A.R. performed the RNA FISH and Drop-seq experiments for the melanoma cell line. M.H. and N.R.Z wrote the paper with feedback from J.W. and M.L.

Corresponding author

Correspondence to Nancy R. Zhang.

Ethics declarations

Competing Interests

A.R. receives consulting income and A.R. and S.S. receive royalties related to Stellaris RNA FISH probes.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Comparison of FISH genes against all genes.

Comparison of the 15 FISH genes with all genes in the Drop-seq data in terms of (a) mean expression, (b) percent of cells with non-zero expression, and (c) Fano factor, a measure of dispersion. Number of cells for each gene can be found in Supplementary Table 2.

Supplementary Figure 2 FISH gene densities for all genes.

Comparison of distributions of expression across cells between FISH, observed Drop-seq, and SAVER recovered expression for 13 genes. Densities were calculated using the density function in R with a Gaussian smoothing kernel. The bandwidth for each gene was set to be the default bandwidth selected for the FISH density. The distribution of the SAVER recovered expression matches more closely to the FISH distribution than the observed Drop-seq. Drop-seq and SAVER distributions were calculated on n = 8,498 cells. Number of cells for FISH distributions can be found in Supplementary Table 2.

Supplementary Figure 3 FISH analysis with MAGIC and scImpute.

RNA FISH validation of MAGIC and scImpute results for 15 genes. (a) Comparison of Gini coefficient for each gene between FISH and MAGIC (left) and between FISH and scImpute recovered values (right) for n = 15 genes. (b) Comparison of Drop-seq, MAGIC, scImpute, and SAVER Kolmogorov-Smirnov (KS) distance to FISH distributions for the 15 genes. (c) Kernel density estimates of cross-cell expression distribution of LMNA (upper) and CCNA2 (lower) for MAGIC and scImpute. (d) Comparison of pair-wise gene correlations computed from Drop-seq, SAVER, MAGIC, and scImpute with those computed from FISH counts. (e) Scatterplots of expression levels between BABAM1 and LMNA. Pearson correlations were calculated across n=17,095 cells for FISH and n=8,498 cells for MAGIC and scImpute.

Supplementary Figure 4 Evaluation of methods on a null dataset.

Gene-to-gene correlations of n = 1,000 genes from a null dataset with no real gene relationships (Supplementary Note 1). Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers). Violin plot extremes represent the minimum and maximum values.

Supplementary Figure 5 Schematic of downsampling experiment.

Highly expressed genes and cells are selected from a real scRNA-seq dataset to create a reference dataset to serve as the ground truth. Down-sampling simulates efficiency loss leads to an observed dataset. Expression recovery algorithms are applied to the observed dataset and performance is measured by calculating the gene-wise and cell-wise correlations with the reference dataset.

Supplementary Figure 6 Density plots from downsampling experiment.

Density plots of (a) gene-wise and cell-wise correlations with the reference and (b) % change in correlation compared to the observed data for SAVER, MAGIC, and scImpute.

Supplementary Figure 7 Evaluation of missing-data-imputation methods.

Evaluation of SAVER against missing data imputation algorithms for the down-sampled data. The algorithms are: SAVER, k-nearest neighbors (KNN) imputation, singular value decomposition (SVD) imputation, and random forest (RF) imputation. (a) Gene-wise and cell-wise correlations for each method. Number of genes and cells can be found in Supplementary Table 3. Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers); outlier data beyond this range are not shown. (b) Comparison of gene-to-gene and cell-to-cell correlation matrices in terms of correlation matrix distance (CMD) from the reference.

Supplementary Figure 8 t-SNE and clustering for the datasets from Baron et al.12, Chen et al.¹³, and La Manno et al.¹⁴.

Cell clustering and t-SNE visualization of the Baron (n = 1,076 cells), Chen (n = 7,712 cells), and La Manno (n = 947 cells) reference, observed, and recovered down-sampled datasets. The colors represent the cell types identified by Seurat in the reference dataset. The Jaccard index measuring similarity between the observed/recovered clustering and reference clustering is displayed in the bottom right.

Supplementary Figure 9 Effect of the number of principal components on t-SNE and clustering.

Cell clustering and t-SNE visualization of the (a) Baron (n = 1,076 cells), (b) Chen (n = 7,712), (c) La Manno (n = 947 cells), and (d) Zeisel (n = 1,799 cells) observed and recovered down-sampled datasets. The number of principal components (PCs) used in the t-SNE visualization and clustering is varied from 5 PCs to 25 PCs. The number of PCs chosen by the jackStraw method is denoted by the bold outline. The colors represent the cell types identified by Seurat in the reference dataset using the number of PCs chosen by jackStraw. The Jaccard index measuring similarity between observed/recovered clustering and reference clustering is displayed in the bottom right.

Supplementary Figure 10 Glmnet cross-validation curves and correlation with reference.

Poisson Lasso regression cross-validation plots from Glmnet and correlation with Zeisel reference plots for five genes from the 5% efficiency dataset. The x-axis represents the size of the shrinkage penalty in the LASSO regression. The dotted vertical line represents the model with the lowest cross-validation error. The horizontal line is the observed correlation with the reference and the black points represent the correlation of the SAVER estimate with the reference at each value of the shrinkage penalty. The red points in the cross-validation plot represents the mean cross-validation error as measured by Poisson deviance and the error bars represent ± 1 standard deviation. SAVER correlation with the reference is approximately maximized when using the model with the lowest cross-validation error.

Supplementary Figure 11 Effect of predictability and efficiency on SAVER.

The SAVER estimate is a weighted average of the normalized observed expression and the predicted expression. The weight is dependent on the predictability of the gene and the cell-specific efficiency. Four scenarios are shown: Predictable (low ϕ_g) versus unpredictable (high ϕ_g) gene, in a high or low efficiency experiment. In each of the scatter plots, each point is a gene, and for each gene, the vertical lines connect the normalized observed expression with the gene’s SAVER recovered value, which always lies between the normalized observed expression and the prediction (the 45 degree line).

Supplementary Figure 12 Downsampling dataset cutoffs.

Density plots of (a) library size across cells and (b) proportion of nonzero cells across genes for the original datasets used in the down-sampling experiment. The reference datasets were constructed by filtering such that roughly 50-60% of cells with the largest library size and 10-20% of genes with the highest proportion of nonzero cells were selected. The exact cutoffs were determined by trying to match mean expression and percentage zero between the original and the down-sampled datasets at the given efficiencies (Supplementary Table 3).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, M., Wang, J., Torre, E. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15, 539–542 (2018). https://doi.org/10.1038/s41592-018-0033-z

Download citation

Received: 25 September 2017
Accepted: 30 April 2018
Published: 25 June 2018
Issue Date: July 2018
DOI: https://doi.org/10.1038/s41592-018-0033-z

This article is cited by

eSVD-DE: cohort-wide differential expression in single-cell RNA-seq data using exponential-family embeddings
- Kevin Z. Lin
- Yixuan Qiu
- Kathryn Roeder
BMC Bioinformatics (2024)
Niche-DE: niche-differential gene expression analysis in spatial transcriptomics data identifies context-dependent cell-cell interactions
- Kaishu Mason
- Anuja Sathe
- Nancy Zhang
Genome Biology (2024)
Challenges and best practices in omics benchmarking
- Thomas G. Brooks
- Nicholas F. Lahens
- Gregory R. Grant
Nature Reviews Genetics (2024)
scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data
- Songming Tang
- Xuejian Cui
- Shengquan Chen
Nature Communications (2024)
cnnImpute: missing value recovery for single cell RNA sequencing data
- Wenjuan Zhang
- Brandon Huckaby
- Mary Qu Yang
Scientific Reports (2024)