Brief Communication | Published:

Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning

Nature Methodsvolume 16pages311314 (2019) | Download Citation

Abstract

Recent advances in large-scale single-cell RNA-seq enable fine-grained characterization of phenotypically distinct cellular states in heterogeneous tissues. We present scScope, a scalable deep-learning-based approach that can accurately and rapidly identify cell-type composition from millions of noisy single-cell gene-expression profiles.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

scScope can be obtained as an installable Python package, via ‘pip install scScope’, and is available under the Apache license. All software, instructions and software updates will be maintained at https://github.com/AltschulerWu-Lab/scScope.

Data availability

Data for this paper are available in its text and Supplementary Information files or from the websites or database accessions referenced in the paper.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Gawad, C., Koh, W. & Quake, S. R. Nat. Rev. Genet. 17, 175–188 (2016).

  2. 2.

    Saliba, A.-E., Westermann, A. J., Gorski, S. A. & Vogel, J. Nucleic Acids Res. 42, 8845–8860 (2014).

  3. 3.

    Shalek, A. K. et al. Nature 510, 363–369 (2014).

  4. 4.

    Macosko, E. Z. et al. Cell 161, 1202–1214 (2015).

  5. 5.

    Zheng, G. X. Y. et al. Nat. Commun. 8, 14049 (2017).

  6. 6.

    Han, X. et al. Cell 172, 1091–1107 (2018).

  7. 7.

    Pierson, E. & Yau, C. Genome. Biol. 16, 241 (2015).

  8. 8.

    Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. Nat. Commun. 9, 284 (2018).

  9. 9.

    Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Nat. Methods 14, 414–416 (2017).

  10. 10.

    Cleary, B., Le, C., Cheung, A., Lander, E. S. & Regev, A. Cell 171, 1424–1436 (2017).

  11. 11.

    van Dijk, D. et al. Cell 174, 716–729 (2018).

  12. 12.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Nat. Biotechnol. 36, 411–420 (2018).

  13. 13.

    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. J. Mach. Learn. Res. 11, 3371–3408 (2010).

  14. 14.

    Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Nat. Methods 15, 1053–1058 (2018).

  15. 15.

    Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Nat. Commun. 10, 390 (2019).

  16. 16.

    Zappia, L., Phipson, B. & Oshlack, A. Genome. Biol. 18, 174 (2017).

  17. 17.

    Jeon, C. J., Strettoi, E. & Masland, R. H. J. Neurosci. 18, 8936–8946 (1998).

  18. 18.

    Rosenberg, A. B. et al. Science 360, 176–182 (2018).

  19. 19.

    Tasic, B. et al. Nat. Neurosci. 19, 335–346 (2016).

  20. 20.

    Levine, J. H. et al. Cell 162, 184–197 (2015).

  21. 21.

    Franke, L. et al. Am. J. Hum. Genet. 78, 1011–1025 (2006).

  22. 22.

    Hubert, L. & Arabie, P. J. Classif. 2, 193–218 (1985).

  23. 23.

    Rand, W. M. J. Am. Stat. Assoc. 66, 846–850 (1971).

  24. 24.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Nat. Biotechnol. 36, 421–427 (2018).

  25. 25.

    Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Nat. Methods 11, 740–742 (2014).

  26. 26.

    Stoeckius, M. et al. Nat. Methods 14, 865–868 (2017).

  27. 27.

    Haber, A. L. et al. Nature 551, 333–339 (2017).

Download references

Acknowledgements

We thank J. Chang, S. Rajaram, L. Sanman and S. Shen for their helpful comments. We gratefully acknowledge the support of NIH grant numbers R01 EY028205 and GM112690, NSF PHY-1545915 and SU2C/MSKCC 2015-003 to S.J.A., NCI-NIH no. RO1 CA185404 and CA184984 to L.F.W., the Institute of Computational Health Sciences (ICHS) at UCSF to S.J.A. and L.F.W., and Project of NSFC (no. 61327902), Project of Beijing Municipal Science & Technology Commission (no. Z181100003118014) to Q.D. and F.B.

Author information

Author notes

  1. These authors contributed equally: Yue Deng, Feng Bao.

Affiliations

  1. Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA

    • Yue Deng
    • , Lani F. Wu
    •  & Steven J. Altschuler
  2. Department of Automation, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, China

    • Feng Bao
    •  & Qionghai Dai

Authors

  1. Search for Yue Deng in:

  2. Search for Feng Bao in:

  3. Search for Qionghai Dai in:

  4. Search for Lani F. Wu in:

  5. Search for Steven J. Altschuler in:

Contributions

Y.D., F.B. and Q.D. developed the deep-learning algorithms. Y.D. and F.B. conducted experimental analysis on both simulated and biological datasets. The manuscript was written by Y.D., F.B., L.F.W. and S.J.A. All authors read and approved the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Lani F. Wu or Steven J. Altschuler.

Integrated supplementary information

  1. Supplementary Figure 1 Training of scScope on multiple GPUs to enable fast learning and low memory cost.

    scScope offers the option to train its deep model by using multiple GPUs, which can dramatically reduce runtime. In this mode, scScope replicates its network structure on multiple GPUs and aggregates all network parameters on the CPU. These network parameters include the connections and biases of all encoder, decoder and imputation layers of scScope. In a round of batch training, one GPU grabs the current network parameters from the CPU to use for its own network replicate of scScope. Then, for gradient calculation, the GPU processes a randomly chosen batch of m (= 64 or 512) single-cell expression profiles from a total of n single-cell profiles. We apply a conventional gradient calculation framework for neural networks, which iteratively performs feed-forward and back propagation steps. In the feed-forward (FF) step, a GPU passes its batch of m single-cell samples through its locally stored scScope network and accumulates the losses for this batch. In the back propagation (BP) step, batch-dependent gradient information for network parameters on different layers is calculated by sequentially propagating accumulated loss from the end to the first network layer. This BP operation is performed by using gradient calculation functions wrapped in deep-learning packages (in our case TensorFlow). We apply this process independently across all k GPUs in a parallelized manner to obtain gradient information from a total k×m samples. The gradient information of those k GPUs is averaged by the CPU, i.e.\({G}^{({j})} = ({G}_1^{\left( {j} \right)} + \ldots + {G}_{k}^{\left( {j} \right)})/{\mathrm{k}}\), where \({G}_{i}^{({j})}\) is the gradient calculated from the ith GPU in the jth round of optimization iteration. Finally, we apply adaptive moment estimation (ADAM) with default TensorFlow parameters to update the network parameters stored on the CPU. Iterations were terminated when either the objective function showed little change (i.e. < 0.1%) or the number of iterations reached a maximal epoch (e.g. 100).

  2. Supplementary Figure 2 Comparison of imputation accuracy on 2,000 single-cell datasets generated by Splatter.

    Methods were compared for varying fractions of sparsity levels (controlled by dropout rate, Supplementary Table 3). Accuracy measurement is based on fractional measure of imputation errors. For each simulated condition, 10 random replicates (n = 10) were simulated; mean values and s.d. (error bars) are reported.

  3. Supplementary Figure 3 Performance analysis of scScope under different recurrent iterations (T).

    This figure shows results of analysis using both simulated data (top panel) and biological data (bottom). Clustering accuracy (left), imputation error (middle) and run time (right) are reported for each dataset.

  4. Supplementary Figure 4 Comparison of clustering accuracy on 1 million single-cell datasets generated by simulation strategy as used in SIMLR.

    Methods were compared for varying fractions of dropout genes (n = 10 replicates were used). PCA, AE, DCA scVI and scScope were directly analyzed the whole dataset, while other methods analyzed 20 thousand down-sampled subsets. Median (center line), interquartile range (box) and minimum-maximum data range (whisker) were illustrated on box plot.

  5. Supplementary Figure 5 Analysis of minor subpopulation discovery using the retina scRNA-seq dataset.

    We compared: (a) the accuracy of cell-type identification (cluster numbers derived from each method are indicated in parentheses), (b) the significance of cell-type markers identified, and the correlation of minor cell-type proportions (c) and overall cell-type proportions (d) identified by computation or microscopy.

  6. Supplementary Figure 6

    Overview of scalable clustering approach for grouping large-scale data.

  7. Supplementary Figure 7 Analysis of intestinal scRNA-seq dataset.

    (a) Changes in cell-type composition of mouse intestinal epithelial under different infection conditions, visualized via t-SNE plots (n = 9,842 cells in total). TA: transit amplifying, EEC: enteroendocrine, EP: enterocyte progenitor, E: enterocyte. scScope identified four subtypes of enterocyte cells. (b) Identification of mature vs. immature and distal vs. proximal enterocyte subpopulations. Shown are expression levels of E-distal and E-proximal gene markers (average UMI count) on the four enterocyte subtypes predicted by scScope and, for comparison, all other clusters (Non-E). (c) Discovery of differential expression of the gene Saa1 in distal and proximal enterocytes after Salmonella and H. polygyrus infections. Violin plot shows the distributions of logTPM for E-distal/E-proximal cells in each infection condition for in total n = 1,129 enterocyte cells.

  8. Supplementary Figure 8 Statistical test of marker gene-expression levels for enterocytes.

    Statistical difference (P-values from two-sided t-test and rank-sum test) of marker genes expression levels between immature vs. mature, non-enterocyte (non-E) vs. immature, non-E vs. mature for E-distal and proximal clusters on mouse epithelial cell dataset. For E-distal, E-immature n = 385, E-mature n = 457; for E-proximal, E-immature n = 1,039, E-mature n = 845; For non-E cells, n = 7,116.

  9. Supplementary Figure 9 Cell fraction comparison in a dataset of 1.3 million mouse brain cells.

    Fractions of three major cell types (glutamatergic neurons, GABAergic neurons and non-neurons) identified by scScope and comparisons with reported neuron fractions by previous SPLiT-seq research.

  10. Supplementary Figure 10

    scScope’s recurrent network shown unfolded to three steps.

  11. Supplementary Figure 11 Time costs of running PhenoGraph on simulated datasets of different scales.

    For data size in 5,000 to 50,000, reported time costs were obtained as average values on 100 random repeats. For data sizes of 100,000 or 200,000, 10 repeats were used to report average time costs.

  12. Supplementary Figure 12 Performance of comparing methods on different normalization/scaling methods using three biological datasets.

    Where it applies, built-in default normalization methods in different packages were bypassed and replaced by tested normalization methods to perform these comparisons.

  13. Supplementary Figure 13 Evaluation of performance for varying latent dimensions using the CBMC dataset with 8,000 profiles.

    The latent dimension was varying from [10, 30, 50, 100].

  14. Supplementary Figure 14 Comparison with existing autoencoder methods.

    We compared scScope with alternative denoising autoencoder (AE) methods in the machine-learning field. Two main strategies in the literature for enhancing the abilities of AEs to smooth noisy data are to add random noise (“added noise smoothing,” ANS) or to drop values (“dropped value smoothing,” DVS) from the input (references [1-2] and [3-4] below, respectively), and then to train the AE to reconstruct the original input signal. We tested three smoothing strategies on both simulated (Splatter) and real biological (CBMC) data: 1) Adding noise (“ANS-AE”): for the jth gene, noise was drawn from \(N(0,0.3\sigma _j^2)\), where σj is the standard deviation computed from all nonzero values of the jth gene across all samples in the dataset; 2) Dropping values (“DVS -AE”): 10% of the nonzero input vector entries were set to zero; and 3) Both adding noise and dropping values (“ANS+DVS-AE”) as above. As reference, scScope was trained using only the input original, either as a standard autoencoder (T = 1) or using its built-in recurrent structure (T = 2) to smooth noise. In terms of learning parameters for training, all hyperparameter choices are listed in Supplementary Table 2 and are the same parameters used throughout all comparisons of methods shown in the paper (unless stated otherwise). In terms of neural network configurations, for the (ANS-AE, DVS-AE, and ANS+DVS-AE) we used the same network configurations as in the original papers. References: [1] Teixeira, V., Camacho, R., & Ferreira, P. G. (2017, November). Learning influential genes on cancer gene-expression data with stacked denoising autoencoders. In Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on(pp. 1201-1205). IEEE. [2] Tan, J., Hammond, J. H., Hogan, D. A., & Greene, C. S. (2016). ADAGE-based integration of publicly available Pseudomonas aeruginosa gene-expression data with denoising autoencoders illuminates microbe-host interactions. MSystems, 1(1), e00025-15. [3] Beaulieu-Jones, B. K., & Moore, J. H. (2017). Missing data imputation in the electronic health record using deeply learned autoencoders. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017 (pp. 207-218). [4] Xie, R., Wen, J., Quitadamo, A., Cheng, J., & Shi, X. (2017). A deep auto-encoder model for gene-expression prediction. BMC genomics, 18(9), 845.

  15. Supplementary Figure 15 Results of varying scScope training-related hyper-parameters on a simulated (generated as in Fig. 1c) and a biological dataset.

    (a) Learning rate, (b) batch size, (c) epoch in training and (d) epoch per check were analyzed respectively for scScope for each tested dataset.

Supplementary information

  1. Supplementary Information

    Supplementary Figures 1–15 and Supplementary Tables 1–11

  2. Reporting Summary

About this article

Publication history

Received

Accepted

Published

Issue Date

DOI

https://doi.org/10.1038/s41592-019-0353-7