SAVER: gene expression recovery for single-cell RNA sequencing

Abstract

In single-cell RNA sequencing (scRNA-seq) studies, only a small fraction of the transcripts present in each cell are sequenced. This leads to unreliable quantification of genes with low or moderate expression, which hinders downstream analysis. To address this challenge, we developed SAVER (single-cell analysis via expression recovery), an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: RNA FISH validation of SAVER results on Drop-seq data.
Fig. 2: Evaluation of SAVER by downsampling and cell clustering.

References

  1. 1.

    Svensson, V. et al. Nat. Methods 14, 381–387 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. 2.

    Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Nat. Methods 11, 740–742 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  3. 3.

    Finak, G. et al. Genome Biol. 16, 278 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. 4.

    Pierson, E. & Yau, C. Genome Biol. 16, 241 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  5. 5.

    van Dijk, D. et al. bioRxiv preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).

  6. 6.

    Li, W. V. & Li, J. J. Nat. Commun. 9, 997 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  7. 7.

    Wills, Q. F. et al. Nat. Biotechnol. 31, 748–752 (2013).

    Article  PubMed  CAS  Google Scholar 

  8. 8.

    Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. PLoS Biol. 4, e309 (2006).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. 9.

    Shaffer, S. M. et al. Nature 546, 431–435 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  10. 10.

    Torre, E. et al. Cell Syst. 6, 171–179 (2018).

    Article  PubMed  CAS  Google Scholar 

  11. 11.

    Jiang, L., Chen, H., Pinello, L. & Yuan, G. C. Genome Biol. 17, 144 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. 12.

    Baron, M. et al. Cell Syst. 3, 346–360 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. 13.

    Chen, R., Wu, X., Jiang, L. & Zhang, Y. Cell Rep. 18, 3227–3241 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  14. 14.

    La Manno, G. et al. Cell 167, 566–580 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. 15.

    Zeisel, A. et al. Science 347, 1138–1142 (2015).

    Article  PubMed  CAS  Google Scholar 

  16. 16.

    Herdin, M., Czink, N., Ozcelik, H. & Bonek, E. Proc. 2005 IEEE 61st Vehicular Technology Conference 1, 136–140 (IEEE: Piscataway, NJ, 2005).

  17. 17.

    Korthauer, K. D. et al. Genome Biol. 17, 222 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  18. 18.

    Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Nat. Biotechnol. 33, 495–502 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. 19.

    Van Der Maaten, L. J. P. & Hinton, G. E. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  20. 20.

    Hrvatin, S. et al. Nat. Neurosci. 21, 120–129 (2018).

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  21. 21.

    Satija, R. et al. Seurat: guided clustering tutorial. Satija Lab http://satijalab.org/seurat/pbmc3k_tutorial.html (2018).

  22. 22.

    Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. Workflow package: simpleSingleCell. Bioconductor https://bioconductor.org/help/workflows/simpleSingleCell/ (2016).

  23. 23.

    Kiselev, V. et al. Analysis of single cell RNA-seq data. Hemberg Lab https://hemberg-lab.github.io/scRNA.seq.course/index.html (2018).

  24. 24.

    Wang, J. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/12/01/227033 (2017).

  25. 25.

    Wagner, F., Yan, Y. & Yanai, I. b ioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/21/217737 (2017).

  26. 26.

    Lun, A. T., Bach, K. & Marioni, J. C. Genome Biol. 17, 75 (2016).

    Article  PubMed  CAS  Google Scholar 

  27. 27.

    Vallejos, C. A., Marioni, J. C. & Richardson, S. PLOS Comput. Biol. 11, e1004333 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. 28.

    Bacher, R. et al. Nat. Methods 14, 584–586 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. 29.

    Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Padovan-Merhar, O. et al. Mol. Cell 58, 339–352 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  31. 31.

    Rubin, D. B. Multiple Imputation for Nonresponse in Surveys (John Wiley, Hoboken, NJ, 1987).

    Google Scholar 

Download references

Acknowledgements

This work was supported by the NIH (grant R01HG006137 to N.R.Z. and M.H.; grant R01GM125301 to N.R.Z., M.L., and M.H.; R21 HD085201 to J.I.M. and H.D.; NIH New Innovator Award DP2 OD008514 to A.R.; R33 EB019767, P30 CA016520, and 4DN U01 HL129998 to A.R. and E.T.; F30 AI114475 to S.S.; R01GM108600 and R01HL113147 to M.L.; DP2MH107055 to R.B.), the NSF (Graduate Fellowship DGE-1321851 to M.H.), the Wharton Dean’s Fund (to J.W.), the NCI (NIH/NCI PSOC award U54 CA193417 to A.R. and E.T.), the NSF (CAREER award 1350601 to A.R. and E.T.), the NIH Center for Photogenomics (RM1 HG007743 to A.R. and E.T.), a Penn Epigenetics Program Pilot award (A.R. and E.T.), the Charles E. Kauffman Foundation (KA2016-85223 to A.R. and E.T.), the Tara Miller Melanoma Foundation (to A.R. and E.T.), the Searle Scholars Program (15-SSP-102 to R.B.), the March of Dimes Foundation (1-FY-15-344 to R.B.), a Linda Pechenik Montague Investigator award (R.B.), and the Charles E. Kauffman Foundation (KA2016-85223 to R.B.). This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (NSF OCI-1053575).

Author information

Affiliations

Authors

Contributions

N.R.Z. conceived and led this work. M.H., N.R.Z., and M.L. designed the model and estimation algorithm, implemented the SAVER software, designed the in silico experiments, and led the data analysis. J.W. validated the Poisson noise model in ERCC data. E.T., H.D., S.S., R.B., J.I.M., and A.R. performed the RNA FISH and Drop-seq experiments for the melanoma cell line. M.H. and N.R.Z wrote the paper with feedback from J.W. and M.L.

Corresponding author

Correspondence to Nancy R. Zhang.

Ethics declarations

Competing Interests

A.R. receives consulting income and A.R. and S.S. receive royalties related to Stellaris RNA FISH probes.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Comparison of FISH genes against all genes.

Comparison of the 15 FISH genes with all genes in the Drop-seq data in terms of (a) mean expression, (b) percent of cells with non-zero expression, and (c) Fano factor, a measure of dispersion. Number of cells for each gene can be found in Supplementary Table 2.

Supplementary Figure 2 FISH gene densities for all genes.

Comparison of distributions of expression across cells between FISH, observed Drop-seq, and SAVER recovered expression for 13 genes. Densities were calculated using the density function in R with a Gaussian smoothing kernel. The bandwidth for each gene was set to be the default bandwidth selected for the FISH density. The distribution of the SAVER recovered expression matches more closely to the FISH distribution than the observed Drop-seq. Drop-seq and SAVER distributions were calculated on n = 8,498 cells. Number of cells for FISH distributions can be found in Supplementary Table 2.

Supplementary Figure 3 FISH analysis with MAGIC and scImpute.

RNA FISH validation of MAGIC and scImpute results for 15 genes. (a) Comparison of Gini coefficient for each gene between FISH and MAGIC (left) and between FISH and scImpute recovered values (right) for n = 15 genes. (b) Comparison of Drop-seq, MAGIC, scImpute, and SAVER Kolmogorov-Smirnov (KS) distance to FISH distributions for the 15 genes. (c) Kernel density estimates of cross-cell expression distribution of LMNA (upper) and CCNA2 (lower) for MAGIC and scImpute. (d) Comparison of pair-wise gene correlations computed from Drop-seq, SAVER, MAGIC, and scImpute with those computed from FISH counts. (e) Scatterplots of expression levels between BABAM1 and LMNA. Pearson correlations were calculated across n=17,095 cells for FISH and n=8,498 cells for MAGIC and scImpute.

Supplementary Figure 4 Evaluation of methods on a null dataset.

Gene-to-gene correlations of n = 1,000 genes from a null dataset with no real gene relationships (Supplementary Note 1). Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers). Violin plot extremes represent the minimum and maximum values.

Supplementary Figure 5 Schematic of downsampling experiment.

Highly expressed genes and cells are selected from a real scRNA-seq dataset to create a reference dataset to serve as the ground truth. Down-sampling simulates efficiency loss leads to an observed dataset. Expression recovery algorithms are applied to the observed dataset and performance is measured by calculating the gene-wise and cell-wise correlations with the reference dataset.

Supplementary Figure 6 Density plots from downsampling experiment.

Density plots of (a) gene-wise and cell-wise correlations with the reference and (b) % change in correlation compared to the observed data for SAVER, MAGIC, and scImpute.

Supplementary Figure 7 Evaluation of missing-data-imputation methods.

Evaluation of SAVER against missing data imputation algorithms for the down-sampled data. The algorithms are: SAVER, k-nearest neighbors (KNN) imputation, singular value decomposition (SVD) imputation, and random forest (RF) imputation. (a) Gene-wise and cell-wise correlations for each method. Number of genes and cells can be found in Supplementary Table 3. Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers); outlier data beyond this range are not shown. (b) Comparison of gene-to-gene and cell-to-cell correlation matrices in terms of correlation matrix distance (CMD) from the reference.

Supplementary Figure 8 t-SNE and clustering for the datasets from Baron et al.12, Chen et al.13, and La Manno et al.14.

Cell clustering and t-SNE visualization of the Baron (n = 1,076 cells), Chen (n = 7,712 cells), and La Manno (n = 947 cells) reference, observed, and recovered down-sampled datasets. The colors represent the cell types identified by Seurat in the reference dataset. The Jaccard index measuring similarity between the observed/recovered clustering and reference clustering is displayed in the bottom right.

Supplementary Figure 9 Effect of the number of principal components on t-SNE and clustering.

Cell clustering and t-SNE visualization of the (a) Baron (n = 1,076 cells), (b) Chen (n = 7,712), (c) La Manno (n = 947 cells), and (d) Zeisel (n = 1,799 cells) observed and recovered down-sampled datasets. The number of principal components (PCs) used in the t-SNE visualization and clustering is varied from 5 PCs to 25 PCs. The number of PCs chosen by the jackStraw method is denoted by the bold outline. The colors represent the cell types identified by Seurat in the reference dataset using the number of PCs chosen by jackStraw. The Jaccard index measuring similarity between observed/recovered clustering and reference clustering is displayed in the bottom right.

Supplementary Figure 10 Glmnet cross-validation curves and correlation with reference.

Poisson Lasso regression cross-validation plots from Glmnet and correlation with Zeisel reference plots for five genes from the 5% efficiency dataset. The x-axis represents the size of the shrinkage penalty in the LASSO regression. The dotted vertical line represents the model with the lowest cross-validation error. The horizontal line is the observed correlation with the reference and the black points represent the correlation of the SAVER estimate with the reference at each value of the shrinkage penalty. The red points in the cross-validation plot represents the mean cross-validation error as measured by Poisson deviance and the error bars represent ± 1 standard deviation. SAVER correlation with the reference is approximately maximized when using the model with the lowest cross-validation error.

Supplementary Figure 11 Effect of predictability and efficiency on SAVER.

The SAVER estimate is a weighted average of the normalized observed expression and the predicted expression. The weight is dependent on the predictability of the gene and the cell-specific efficiency. Four scenarios are shown: Predictable (low ϕg) versus unpredictable (high ϕg) gene, in a high or low efficiency experiment. In each of the scatter plots, each point is a gene, and for each gene, the vertical lines connect the normalized observed expression with the gene’s SAVER recovered value, which always lies between the normalized observed expression and the prediction (the 45 degree line).

Supplementary Figure 12 Downsampling dataset cutoffs.

Density plots of (a) library size across cells and (b) proportion of nonzero cells across genes for the original datasets used in the down-sampling experiment. The reference datasets were constructed by filtering such that roughly 50-60% of cells with the largest library size and 10-20% of genes with the highest proportion of nonzero cells were selected. The exact cutoffs were determined by trying to match mean expression and percentage zero between the original and the down-sampled datasets at the given efficiencies (Supplementary Table 3).

Supplementary information

Supplementary Text and Figures

Supplementary Figs. 1–12, Supplementary Tables 1–3, and Supplementary Notes 1–4

Reporting Summary

Supplementary Software

SAVER v1.0.0

Source Data, Figure 1

Source Data, Figure 2

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Huang, M., Wang, J., Torre, E. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15, 539–542 (2018). https://doi.org/10.1038/s41592-018-0033-z

Download citation

Further reading

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing