Brief Communication | Published:

SAVER: gene expression recovery for single-cell RNA sequencing

Nature Methodsvolume 15pages539542 (2018) | Download Citation

Abstract

In single-cell RNA sequencing (scRNA-seq) studies, only a small fraction of the transcripts present in each cell are sequenced. This leads to unreliable quantification of genes with low or moderate expression, which hinders downstream analysis. To address this challenge, we developed SAVER (single-cell analysis via expression recovery), an expression recovery method for unique molecule index (UMI)-based scRNA-seq data that borrows information across genes and cells to provide accurate expression estimates for all genes.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Svensson, V. et al. Nat. Methods 14, 381–387 (2017).

  2. 2.

    Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Nat. Methods 11, 740–742 (2014).

  3. 3.

    Finak, G. et al. Genome Biol. 16, 278 (2015).

  4. 4.

    Pierson, E. & Yau, C. Genome Biol. 16, 241 (2015).

  5. 5.

    van Dijk, D. et al. bioRxiv preprint at https://www.biorxiv.org/content/early/2017/02/25/111591 (2017).

  6. 6.

    Li, W. V. & Li, J. J. Nat. Commun. 9, 997 (2018).

  7. 7.

    Wills, Q. F. et al. Nat. Biotechnol. 31, 748–752 (2013).

  8. 8.

    Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. PLoS Biol. 4, e309 (2006).

  9. 9.

    Shaffer, S. M. et al. Nature 546, 431–435 (2017).

  10. 10.

    Torre, E. et al. Cell Syst. 6, 171–179 (2018).

  11. 11.

    Jiang, L., Chen, H., Pinello, L. & Yuan, G. C. Genome Biol. 17, 144 (2016).

  12. 12.

    Baron, M. et al. Cell Syst. 3, 346–360 (2016).

  13. 13.

    Chen, R., Wu, X., Jiang, L. & Zhang, Y. Cell Rep. 18, 3227–3241 (2017).

  14. 14.

    La Manno, G. et al. Cell 167, 566–580 (2016).

  15. 15.

    Zeisel, A. et al. Science 347, 1138–1142 (2015).

  16. 16.

    Herdin, M., Czink, N., Ozcelik, H. & Bonek, E. Proc. 2005 IEEE 61st Vehicular Technology Conference 1, 136–140 (IEEE: Piscataway, NJ, 2005).

  17. 17.

    Korthauer, K. D. et al. Genome Biol. 17, 222 (2016).

  18. 18.

    Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Nat. Biotechnol. 33, 495–502 (2015).

  19. 19.

    Van Der Maaten, L. J. P. & Hinton, G. E. J. Mach. Learn. Res. 9, 2579–2605 (2008).

  20. 20.

    Hrvatin, S. et al. Nat. Neurosci. 21, 120–129 (2018).

  21. 21.

    Satija, R. et al. Seurat: guided clustering tutorial. Satija Lab http://satijalab.org/seurat/pbmc3k_tutorial.html (2018).

  22. 22.

    Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. Workflow package: simpleSingleCell. Bioconductor https://bioconductor.org/help/workflows/simpleSingleCell/ (2016).

  23. 23.

    Kiselev, V. et al. Analysis of single cell RNA-seq data. Hemberg Lab https://hemberg-lab.github.io/scRNA.seq.course/index.html (2018).

  24. 24.

    Wang, J. et al. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/12/01/227033 (2017).

  25. 25.

    Wagner, F., Yan, Y. & Yanai, I. b ioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/21/217737 (2017).

  26. 26.

    Lun, A. T., Bach, K. & Marioni, J. C. Genome Biol. 17, 75 (2016).

  27. 27.

    Vallejos, C. A., Marioni, J. C. & Richardson, S. PLOS Comput. Biol. 11, e1004333 (2015).

  28. 28.

    Bacher, R. et al. Nat. Methods 14, 584–586 (2017).

  29. 29.

    Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).

  30. 30.

    Padovan-Merhar, O. et al. Mol. Cell 58, 339–352 (2015).

  31. 31.

    Rubin, D. B. Multiple Imputation for Nonresponse in Surveys (John Wiley, Hoboken, NJ, 1987).

Download references

Acknowledgements

This work was supported by the NIH (grant R01HG006137 to N.R.Z. and M.H.; grant R01GM125301 to N.R.Z., M.L., and M.H.; R21 HD085201 to J.I.M. and H.D.; NIH New Innovator Award DP2 OD008514 to A.R.; R33 EB019767, P30 CA016520, and 4DN U01 HL129998 to A.R. and E.T.; F30 AI114475 to S.S.; R01GM108600 and R01HL113147 to M.L.; DP2MH107055 to R.B.), the NSF (Graduate Fellowship DGE-1321851 to M.H.), the Wharton Dean’s Fund (to J.W.), the NCI (NIH/NCI PSOC award U54 CA193417 to A.R. and E.T.), the NSF (CAREER award 1350601 to A.R. and E.T.), the NIH Center for Photogenomics (RM1 HG007743 to A.R. and E.T.), a Penn Epigenetics Program Pilot award (A.R. and E.T.), the Charles E. Kauffman Foundation (KA2016-85223 to A.R. and E.T.), the Tara Miller Melanoma Foundation (to A.R. and E.T.), the Searle Scholars Program (15-SSP-102 to R.B.), the March of Dimes Foundation (1-FY-15-344 to R.B.), a Linda Pechenik Montague Investigator award (R.B.), and the Charles E. Kauffman Foundation (KA2016-85223 to R.B.). This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (NSF OCI-1053575).

Author information

Affiliations

  1. Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, USA

    • Mo Huang
    • , Jingshu Wang
    •  & Nancy R. Zhang
  2. Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

    • Eduardo Torre
  3. Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA

    • Eduardo Torre
    • , Sydney Shaffer
    •  & Arjun Raj
  4. Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

    • Hannah Dueck
    • , John I. Murray
    •  & Arjun Raj
  5. Department of Cell and Developmental Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

    • Roberto Bonasio
  6. Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

    • Mingyao Li

Authors

  1. Search for Mo Huang in:

  2. Search for Jingshu Wang in:

  3. Search for Eduardo Torre in:

  4. Search for Hannah Dueck in:

  5. Search for Sydney Shaffer in:

  6. Search for Roberto Bonasio in:

  7. Search for John I. Murray in:

  8. Search for Arjun Raj in:

  9. Search for Mingyao Li in:

  10. Search for Nancy R. Zhang in:

Contributions

N.R.Z. conceived and led this work. M.H., N.R.Z., and M.L. designed the model and estimation algorithm, implemented the SAVER software, designed the in silico experiments, and led the data analysis. J.W. validated the Poisson noise model in ERCC data. E.T., H.D., S.S., R.B., J.I.M., and A.R. performed the RNA FISH and Drop-seq experiments for the melanoma cell line. M.H. and N.R.Z wrote the paper with feedback from J.W. and M.L.

Competing Interests

A.R. receives consulting income and A.R. and S.S. receive royalties related to Stellaris RNA FISH probes.

Corresponding author

Correspondence to Nancy R. Zhang.

Integrated supplementary information

  1. Supplementary Figure 1 Comparison of FISH genes against all genes.

    Comparison of the 15 FISH genes with all genes in the Drop-seq data in terms of (a) mean expression, (b) percent of cells with non-zero expression, and (c) Fano factor, a measure of dispersion. Number of cells for each gene can be found in Supplementary Table 2.

  2. Supplementary Figure 2 FISH gene densities for all genes.

    Comparison of distributions of expression across cells between FISH, observed Drop-seq, and SAVER recovered expression for 13 genes. Densities were calculated using the density function in R with a Gaussian smoothing kernel. The bandwidth for each gene was set to be the default bandwidth selected for the FISH density. The distribution of the SAVER recovered expression matches more closely to the FISH distribution than the observed Drop-seq. Drop-seq and SAVER distributions were calculated on n = 8,498 cells. Number of cells for FISH distributions can be found in Supplementary Table 2.

  3. Supplementary Figure 3 FISH analysis with MAGIC and scImpute.

    RNA FISH validation of MAGIC and scImpute results for 15 genes. (a) Comparison of Gini coefficient for each gene between FISH and MAGIC (left) and between FISH and scImpute recovered values (right) for n = 15 genes. (b) Comparison of Drop-seq, MAGIC, scImpute, and SAVER Kolmogorov-Smirnov (KS) distance to FISH distributions for the 15 genes. (c) Kernel density estimates of cross-cell expression distribution of LMNA (upper) and CCNA2 (lower) for MAGIC and scImpute. (d) Comparison of pair-wise gene correlations computed from Drop-seq, SAVER, MAGIC, and scImpute with those computed from FISH counts. (e) Scatterplots of expression levels between BABAM1 and LMNA. Pearson correlations were calculated across n=17,095 cells for FISH and n=8,498 cells for MAGIC and scImpute.

  4. Supplementary Figure 4 Evaluation of methods on a null dataset.

    Gene-to-gene correlations of n = 1,000 genes from a null dataset with no real gene relationships (Supplementary Note 1). Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers). Violin plot extremes represent the minimum and maximum values.

  5. Supplementary Figure 5 Schematic of downsampling experiment.

    Highly expressed genes and cells are selected from a real scRNA-seq dataset to create a reference dataset to serve as the ground truth. Down-sampling simulates efficiency loss leads to an observed dataset. Expression recovery algorithms are applied to the observed dataset and performance is measured by calculating the gene-wise and cell-wise correlations with the reference dataset.

  6. Supplementary Figure 6 Density plots from downsampling experiment.

    Density plots of (a) gene-wise and cell-wise correlations with the reference and (b) % change in correlation compared to the observed data for SAVER, MAGIC, and scImpute.

  7. Supplementary Figure 7 Evaluation of missing-data-imputation methods.

    Evaluation of SAVER against missing data imputation algorithms for the down-sampled data. The algorithms are: SAVER, k-nearest neighbors (KNN) imputation, singular value decomposition (SVD) imputation, and random forest (RF) imputation. (a) Gene-wise and cell-wise correlations for each method. Number of genes and cells can be found in Supplementary Table 3. Box plots show the median (center line), interquartile range (hinges), and 1.5 times the interquartile range (whiskers); outlier data beyond this range are not shown. (b) Comparison of gene-to-gene and cell-to-cell correlation matrices in terms of correlation matrix distance (CMD) from the reference.

  8. Supplementary Figure 8 t-SNE and clustering for the datasets from Baron et al.12, Chen et al.13, and La Manno et al.14.

    Cell clustering and t-SNE visualization of the Baron (n = 1,076 cells), Chen (n = 7,712 cells), and La Manno (n = 947 cells) reference, observed, and recovered down-sampled datasets. The colors represent the cell types identified by Seurat in the reference dataset. The Jaccard index measuring similarity between the observed/recovered clustering and reference clustering is displayed in the bottom right.

  9. Supplementary Figure 9 Effect of the number of principal components on t-SNE and clustering.

    Cell clustering and t-SNE visualization of the (a) Baron (n = 1,076 cells), (b) Chen (n = 7,712), (c) La Manno (n = 947 cells), and (d) Zeisel (n = 1,799 cells) observed and recovered down-sampled datasets. The number of principal components (PCs) used in the t-SNE visualization and clustering is varied from 5 PCs to 25 PCs. The number of PCs chosen by the jackStraw method is denoted by the bold outline. The colors represent the cell types identified by Seurat in the reference dataset using the number of PCs chosen by jackStraw. The Jaccard index measuring similarity between observed/recovered clustering and reference clustering is displayed in the bottom right.

  10. Supplementary Figure 10 Glmnet cross-validation curves and correlation with reference.

    Poisson Lasso regression cross-validation plots from Glmnet and correlation with Zeisel reference plots for five genes from the 5% efficiency dataset. The x-axis represents the size of the shrinkage penalty in the LASSO regression. The dotted vertical line represents the model with the lowest cross-validation error. The horizontal line is the observed correlation with the reference and the black points represent the correlation of the SAVER estimate with the reference at each value of the shrinkage penalty. The red points in the cross-validation plot represents the mean cross-validation error as measured by Poisson deviance and the error bars represent ± 1 standard deviation. SAVER correlation with the reference is approximately maximized when using the model with the lowest cross-validation error.

  11. Supplementary Figure 11 Effect of predictability and efficiency on SAVER.

    The SAVER estimate is a weighted average of the normalized observed expression and the predicted expression. The weight is dependent on the predictability of the gene and the cell-specific efficiency. Four scenarios are shown: Predictable (low ϕg) versus unpredictable (high ϕg) gene, in a high or low efficiency experiment. In each of the scatter plots, each point is a gene, and for each gene, the vertical lines connect the normalized observed expression with the gene’s SAVER recovered value, which always lies between the normalized observed expression and the prediction (the 45 degree line).

  12. Supplementary Figure 12 Downsampling dataset cutoffs.

    Density plots of (a) library size across cells and (b) proportion of nonzero cells across genes for the original datasets used in the down-sampling experiment. The reference datasets were constructed by filtering such that roughly 50-60% of cells with the largest library size and 10-20% of genes with the highest proportion of nonzero cells were selected. The exact cutoffs were determined by trying to match mean expression and percentage zero between the original and the down-sampled datasets at the given efficiencies (Supplementary Table 3).

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figs. 1–12, Supplementary Tables 1–3, and Supplementary Notes 1–4

  2. Reporting Summary

  3. Supplementary Software

    SAVER v1.0.0

  4. Source Data, Figure 1

  5. Source Data, Figure 2

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0033-z