Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning

Deng, Yue; Bao, Feng; Dai, Qionghai; Wu, Lani F.; Altschuler, Steven J.

doi:10.1038/s41592-019-0353-7

Brief Communication
Published: 18 March 2019

Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning

Nature Methods volume 16, pages 311–314 (2019)Cite this article

12k Accesses
104 Citations
52 Altmetric
Metrics details

Subjects

Abstract

Recent advances in large-scale single-cell RNA-seq enable fine-grained characterization of phenotypically distinct cellular states in heterogeneous tissues. We present scScope, a scalable deep-learning-based approach that can accurately and rapidly identify cell-type composition from millions of noisy single-cell gene-expression profiles.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of scScope architecture and performance on simulated datasets.**

**Fig. 2: Evaluation of methods on experimental scRNA-seq datasets.**

**Fig. 3: Application of scScope to explore biology in 1.3 million mouse brain dataset.**

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Srinivas Niranj Chandrasekaran, Beth A. Cimini, … Anne E. Carpenter

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Austin D. Reed, Sara Pensa, … Walid T. Khaled

Code availability

scScope can be obtained as an installable Python package, via ‘pip install scScope’, and is available under the Apache license. All software, instructions and software updates will be maintained at https://github.com/AltschulerWu-Lab/scScope.

Data availability

Data for this paper are available in its text and Supplementary Information files or from the websites or database accessions referenced in the paper.

References

Gawad, C., Koh, W. & Quake, S. R. Nat. Rev. Genet. 17, 175–188 (2016).
Article CAS Google Scholar
Saliba, A.-E., Westermann, A. J., Gorski, S. A. & Vogel, J. Nucleic Acids Res. 42, 8845–8860 (2014).
Article CAS Google Scholar
Shalek, A. K. et al. Nature 510, 363–369 (2014).
Article CAS Google Scholar
Macosko, E. Z. et al. Cell 161, 1202–1214 (2015).
Article CAS Google Scholar
Zheng, G. X. Y. et al. Nat. Commun. 8, 14049 (2017).
Article CAS Google Scholar
Han, X. et al. Cell 172, 1091–1107 (2018).
Article CAS Google Scholar
Pierson, E. & Yau, C. Genome. Biol. 16, 241 (2015).
Article Google Scholar
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. Nat. Commun. 9, 284 (2018).
Article Google Scholar
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Nat. Methods 14, 414–416 (2017).
Article CAS Google Scholar
Cleary, B., Le, C., Cheung, A., Lander, E. S. & Regev, A. Cell 171, 1424–1436 (2017).
Article CAS Google Scholar
van Dijk, D. et al. Cell 174, 716–729 (2018).
Article Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. J. Mach. Learn. Res. 11, 3371–3408 (2010).
Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Nat. Methods 15, 1053–1058 (2018).
Article CAS Google Scholar
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Nat. Commun. 10, 390 (2019).
Article Google Scholar
Zappia, L., Phipson, B. & Oshlack, A. Genome. Biol. 18, 174 (2017).
Article Google Scholar
Jeon, C. J., Strettoi, E. & Masland, R. H. J. Neurosci. 18, 8936–8946 (1998).
Article CAS Google Scholar
Rosenberg, A. B. et al. Science 360, 176–182 (2018).
Article CAS Google Scholar
Tasic, B. et al. Nat. Neurosci. 19, 335–346 (2016).
Article CAS Google Scholar
Levine, J. H. et al. Cell 162, 184–197 (2015).
Article CAS Google Scholar
Franke, L. et al. Am. J. Hum. Genet. 78, 1011–1025 (2006).
Article CAS Google Scholar
Hubert, L. & Arabie, P. J. Classif. 2, 193–218 (1985).
Article Google Scholar
Rand, W. M. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS Google Scholar
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Nat. Methods 11, 740–742 (2014).
Article CAS Google Scholar
Stoeckius, M. et al. Nat. Methods 14, 865–868 (2017).
Article CAS Google Scholar
Haber, A. L. et al. Nature 551, 333–339 (2017).
Article CAS Google Scholar

Download references

Acknowledgements

We thank J. Chang, S. Rajaram, L. Sanman and S. Shen for their helpful comments. We gratefully acknowledge the support of NIH grant numbers R01 EY028205 and GM112690, NSF PHY-1545915 and SU2C/MSKCC 2015-003 to S.J.A., NCI-NIH no. RO1 CA185404 and CA184984 to L.F.W., the Institute of Computational Health Sciences (ICHS) at UCSF to S.J.A. and L.F.W., and Project of NSFC (no. 61327902), Project of Beijing Municipal Science & Technology Commission (no. Z181100003118014) to Q.D. and F.B.

Author information

These authors contributed equally: Yue Deng, Feng Bao.

Authors and Affiliations

Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
Yue Deng, Lani F. Wu & Steven J. Altschuler
Department of Automation, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, China
Feng Bao & Qionghai Dai

Authors

Yue Deng
View author publications
You can also search for this author in PubMed Google Scholar
Feng Bao
View author publications
You can also search for this author in PubMed Google Scholar
Qionghai Dai
View author publications
You can also search for this author in PubMed Google Scholar
Lani F. Wu
View author publications
You can also search for this author in PubMed Google Scholar
Steven J. Altschuler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.D., F.B. and Q.D. developed the deep-learning algorithms. Y.D. and F.B. conducted experimental analysis on both simulated and biological datasets. The manuscript was written by Y.D., F.B., L.F.W. and S.J.A. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Lani F. Wu or Steven J. Altschuler.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Training of scScope on multiple GPUs to enable fast learning and low memory cost.

scScope offers the option to train its deep model by using multiple GPUs, which can dramatically reduce runtime. In this mode, scScope replicates its network structure on multiple GPUs and aggregates all network parameters on the CPU. These network parameters include the connections and biases of all encoder, decoder and imputation layers of scScope. In a round of batch training, one GPU grabs the current network parameters from the CPU to use for its own network replicate of scScope. Then, for gradient calculation, the GPU processes a randomly chosen batch of m (= 64 or 512) single-cell expression profiles from a total of n single-cell profiles. We apply a conventional gradient calculation framework for neural networks, which iteratively performs feed-forward and back propagation steps. In the feed-forward (FF) step, a GPU passes its batch of m single-cell samples through its locally stored scScope network and accumulates the losses for this batch. In the back propagation (BP) step, batch-dependent gradient information for network parameters on different layers is calculated by sequentially propagating accumulated loss from the end to the first network layer. This BP operation is performed by using gradient calculation functions wrapped in deep-learning packages (in our case TensorFlow). We apply this process independently across all k GPUs in a parallelized manner to obtain gradient information from a total k×m samples. The gradient information of those k GPUs is averaged by the CPU, i.e.\({G}^{({j})} = ({G}_1^{\left( {j} \right)} + \ldots + {G}_{k}^{\left( {j} \right)})/{\mathrm{k}}\), where \({G}_{i}^{({j})}\) is the gradient calculated from the i^th GPU in the j^th round of optimization iteration. Finally, we apply adaptive moment estimation (ADAM) with default TensorFlow parameters to update the network parameters stored on the CPU. Iterations were terminated when either the objective function showed little change (i.e. < 0.1%) or the number of iterations reached a maximal epoch (e.g. 100).

Supplementary Figure 2 Comparison of imputation accuracy on 2,000 single-cell datasets generated by Splatter.

Methods were compared for varying fractions of sparsity levels (controlled by dropout rate, Supplementary Table 3). Accuracy measurement is based on fractional measure of imputation errors. For each simulated condition, 10 random replicates (n = 10) were simulated; mean values and s.d. (error bars) are reported.

Supplementary Figure 3 Performance analysis of scScope under different recurrent iterations (T).

This figure shows results of analysis using both simulated data (top panel) and biological data (bottom). Clustering accuracy (left), imputation error (middle) and run time (right) are reported for each dataset.

Supplementary Figure 4 Comparison of clustering accuracy on 1 million single-cell datasets generated by simulation strategy as used in SIMLR.

Methods were compared for varying fractions of dropout genes (n = 10 replicates were used). PCA, AE, DCA scVI and scScope were directly analyzed the whole dataset, while other methods analyzed 20 thousand down-sampled subsets. Median (center line), interquartile range (box) and minimum-maximum data range (whisker) were illustrated on box plot.

Supplementary Figure 5 Analysis of minor subpopulation discovery using the retina scRNA-seq dataset.

We compared: (a) the accuracy of cell-type identification (cluster numbers derived from each method are indicated in parentheses), (b) the significance of cell-type markers identified, and the correlation of minor cell-type proportions (c) and overall cell-type proportions (d) identified by computation or microscopy.

Supplementary Figure 6

Overview of scalable clustering approach for grouping large-scale data.

Supplementary Figure 7 Analysis of intestinal scRNA-seq dataset.

(a) Changes in cell-type composition of mouse intestinal epithelial under different infection conditions, visualized via t-SNE plots (n = 9,842 cells in total). TA: transit amplifying, EEC: enteroendocrine, EP: enterocyte progenitor, E: enterocyte. scScope identified four subtypes of enterocyte cells. (b) Identification of mature vs. immature and distal vs. proximal enterocyte subpopulations. Shown are expression levels of E-distal and E-proximal gene markers (average UMI count) on the four enterocyte subtypes predicted by scScope and, for comparison, all other clusters (Non-E). (c) Discovery of differential expression of the gene Saa1 in distal and proximal enterocytes after Salmonella and H. polygyrus infections. Violin plot shows the distributions of logTPM for E-distal/E-proximal cells in each infection condition for in total n = 1,129 enterocyte cells.

Supplementary Figure 8 Statistical test of marker gene-expression levels for enterocytes.

Statistical difference (P-values from two-sided t-test and rank-sum test) of marker genes expression levels between immature vs. mature, non-enterocyte (non-E) vs. immature, non-E vs. mature for E-distal and proximal clusters on mouse epithelial cell dataset. For E-distal, E-immature n = 385, E-mature n = 457; for E-proximal, E-immature n = 1,039, E-mature n = 845; For non-E cells, n = 7,116.

Supplementary Figure 9 Cell fraction comparison in a dataset of 1.3 million mouse brain cells.

Fractions of three major cell types (glutamatergic neurons, GABAergic neurons and non-neurons) identified by scScope and comparisons with reported neuron fractions by previous SPLiT-seq research.

Supplementary Figure 10

scScope’s recurrent network shown unfolded to three steps.

Supplementary Figure 11 Time costs of running PhenoGraph on simulated datasets of different scales.

For data size in 5,000 to 50,000, reported time costs were obtained as average values on 100 random repeats. For data sizes of 100,000 or 200,000, 10 repeats were used to report average time costs.

Supplementary Figure 12 Performance of comparing methods on different normalization/scaling methods using three biological datasets.

Where it applies, built-in default normalization methods in different packages were bypassed and replaced by tested normalization methods to perform these comparisons.

Supplementary Figure 13 Evaluation of performance for varying latent dimensions using the CBMC dataset with 8,000 profiles.

The latent dimension was varying from [10, 30, 50, 100].

Supplementary Figure 14 Comparison with existing autoencoder methods.

We compared scScope with alternative denoising autoencoder (AE) methods in the machine-learning field. Two main strategies in the literature for enhancing the abilities of AEs to smooth noisy data are to add random noise (“added noise smoothing,” ANS) or to drop values (“dropped value smoothing,” DVS) from the input (references [1-2] and [3-4] below, respectively), and then to train the AE to reconstruct the original input signal. We tested three smoothing strategies on both simulated (Splatter) and real biological (CBMC) data: 1) Adding noise (“ANS-AE”): for the j^th gene, noise was drawn from \(N(0,0.3\sigma _j^2)\), where σ_j is the standard deviation computed from all nonzero values of the j^th gene across all samples in the dataset; 2) Dropping values (“DVS -AE”): 10% of the nonzero input vector entries were set to zero; and 3) Both adding noise and dropping values (“ANS+DVS-AE”) as above. As reference, scScope was trained using only the input original, either as a standard autoencoder (T = 1) or using its built-in recurrent structure (T = 2) to smooth noise. In terms of learning parameters for training, all hyperparameter choices are listed in Supplementary Table 2 and are the same parameters used throughout all comparisons of methods shown in the paper (unless stated otherwise). In terms of neural network configurations, for the (ANS-AE, DVS-AE, and ANS+DVS-AE) we used the same network configurations as in the original papers. References: [1] Teixeira, V., Camacho, R., & Ferreira, P. G. (2017, November). Learning influential genes on cancer gene-expression data with stacked denoising autoencoders. In Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on(pp. 1201-1205). IEEE. [2] Tan, J., Hammond, J. H., Hogan, D. A., & Greene, C. S. (2016). ADAGE-based integration of publicly available Pseudomonas aeruginosa gene-expression data with denoising autoencoders illuminates microbe-host interactions. MSystems, 1(1), e00025-15. [3] Beaulieu-Jones, B. K., & Moore, J. H. (2017). Missing data imputation in the electronic health record using deeply learned autoencoders. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017 (pp. 207-218). [4] Xie, R., Wen, J., Quitadamo, A., Cheng, J., & Shi, X. (2017). A deep auto-encoder model for gene-expression prediction. BMC genomics, 18(9), 845.

Supplementary Figure 15 Results of varying scScope training-related hyper-parameters on a simulated (generated as in Fig. 1c) and a biological dataset.

(a) Learning rate, (b) batch size, (c) epoch in training and (d) epoch per check were analyzed respectively for scScope for each tested dataset.

Supplementary information

Supplementary Information

Supplementary Figures 1–15 and Supplementary Tables 1–11

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, Y., Bao, F., Dai, Q. et al. Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods 16, 311–314 (2019). https://doi.org/10.1038/s41592-019-0353-7

Download citation

Received: 11 October 2018
Accepted: 12 February 2019
Published: 18 March 2019
Issue Date: April 2019
DOI: https://doi.org/10.1038/s41592-019-0353-7

This article is cited by

Graph attention autoencoder model with dual decoder for clustering single-cell RNA sequencing data
- Shudong Wang
- Yu Zhang
- Yingye Liu
Applied Intelligence (2024)
Evaluating imputation methods for single-cell RNA-seq data
- Yi Cheng
- Xiuli Ma
- Pingzhang Wang
BMC Bioinformatics (2023)
ZINBMM: a general mixture model for simultaneous clustering and gene selection using single-cell transcriptomic data
- Yang Li
- Mingcong Wu
- Mengyun Wu
Genome Biology (2023)
Cell cycle gene regulation dynamics revealed by RNA velocity and deep-learning
- Andrea Riba
- Attila Oravecz
- Nacho Molina
Nature Communications (2022)
Erratic and blood vessel-guided migration of astrocyte progenitors in the cerebral cortex
- Hidenori Tabata
- Megumi Sasaki
- Kazunori Nakajima
Nature Communications (2022)