Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning

Abstract

Recent advances in large-scale single-cell RNA-seq enable fine-grained characterization of phenotypically distinct cellular states in heterogeneous tissues. We present scScope, a scalable deep-learning-based approach that can accurately and rapidly identify cell-type composition from millions of noisy single-cell gene-expression profiles.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of scScope architecture and performance on simulated datasets.
Fig. 2: Evaluation of methods on experimental scRNA-seq datasets.
Fig. 3: Application of scScope to explore biology in 1.3 million mouse brain dataset.

Code availability

scScope can be obtained as an installable Python package, via ‘pip install scScope’, and is available under the Apache license. All software, instructions and software updates will be maintained at https://github.com/AltschulerWu-Lab/scScope.

Data availability

Data for this paper are available in its text and Supplementary Information files or from the websites or database accessions referenced in the paper.

References

  1. 1.

    Gawad, C., Koh, W. & Quake, S. R. Nat. Rev. Genet. 17, 175–188 (2016).

    CAS  Article  Google Scholar 

  2. 2.

    Saliba, A.-E., Westermann, A. J., Gorski, S. A. & Vogel, J. Nucleic Acids Res. 42, 8845–8860 (2014).

    CAS  Article  Google Scholar 

  3. 3.

    Shalek, A. K. et al. Nature 510, 363–369 (2014).

    CAS  Article  Google Scholar 

  4. 4.

    Macosko, E. Z. et al. Cell 161, 1202–1214 (2015).

    CAS  Article  Google Scholar 

  5. 5.

    Zheng, G. X. Y. et al. Nat. Commun. 8, 14049 (2017).

    CAS  Article  Google Scholar 

  6. 6.

    Han, X. et al. Cell 172, 1091–1107 (2018).

    CAS  Article  Google Scholar 

  7. 7.

    Pierson, E. & Yau, C. Genome. Biol. 16, 241 (2015).

    Article  Google Scholar 

  8. 8.

    Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. Nat. Commun. 9, 284 (2018).

    Article  Google Scholar 

  9. 9.

    Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Nat. Methods 14, 414–416 (2017).

    CAS  Article  Google Scholar 

  10. 10.

    Cleary, B., Le, C., Cheung, A., Lander, E. S. & Regev, A. Cell 171, 1424–1436 (2017).

    CAS  Article  Google Scholar 

  11. 11.

    van Dijk, D. et al. Cell 174, 716–729 (2018).

    Article  Google Scholar 

  12. 12.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Nat. Biotechnol. 36, 411–420 (2018).

    CAS  Article  Google Scholar 

  13. 13.

    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. J. Mach. Learn. Res. 11, 3371–3408 (2010).

    Google Scholar 

  14. 14.

    Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Nat. Methods 15, 1053–1058 (2018).

    CAS  Article  Google Scholar 

  15. 15.

    Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Nat. Commun. 10, 390 (2019).

    Article  Google Scholar 

  16. 16.

    Zappia, L., Phipson, B. & Oshlack, A. Genome. Biol. 18, 174 (2017).

    Article  Google Scholar 

  17. 17.

    Jeon, C. J., Strettoi, E. & Masland, R. H. J. Neurosci. 18, 8936–8946 (1998).

    CAS  Article  Google Scholar 

  18. 18.

    Rosenberg, A. B. et al. Science 360, 176–182 (2018).

    CAS  Article  Google Scholar 

  19. 19.

    Tasic, B. et al. Nat. Neurosci. 19, 335–346 (2016).

    CAS  Article  Google Scholar 

  20. 20.

    Levine, J. H. et al. Cell 162, 184–197 (2015).

    CAS  Article  Google Scholar 

  21. 21.

    Franke, L. et al. Am. J. Hum. Genet. 78, 1011–1025 (2006).

    CAS  Article  Google Scholar 

  22. 22.

    Hubert, L. & Arabie, P. J. Classif. 2, 193–218 (1985).

    Article  Google Scholar 

  23. 23.

    Rand, W. M. J. Am. Stat. Assoc. 66, 846–850 (1971).

    Article  Google Scholar 

  24. 24.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Nat. Biotechnol. 36, 421–427 (2018).

    CAS  Article  Google Scholar 

  25. 25.

    Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Nat. Methods 11, 740–742 (2014).

    CAS  Article  Google Scholar 

  26. 26.

    Stoeckius, M. et al. Nat. Methods 14, 865–868 (2017).

    CAS  Article  Google Scholar 

  27. 27.

    Haber, A. L. et al. Nature 551, 333–339 (2017).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank J. Chang, S. Rajaram, L. Sanman and S. Shen for their helpful comments. We gratefully acknowledge the support of NIH grant numbers R01 EY028205 and GM112690, NSF PHY-1545915 and SU2C/MSKCC 2015-003 to S.J.A., NCI-NIH no. RO1 CA185404 and CA184984 to L.F.W., the Institute of Computational Health Sciences (ICHS) at UCSF to S.J.A. and L.F.W., and Project of NSFC (no. 61327902), Project of Beijing Municipal Science & Technology Commission (no. Z181100003118014) to Q.D. and F.B.

Author information

Affiliations

Authors

Contributions

Y.D., F.B. and Q.D. developed the deep-learning algorithms. Y.D. and F.B. conducted experimental analysis on both simulated and biological datasets. The manuscript was written by Y.D., F.B., L.F.W. and S.J.A. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Lani F. Wu or Steven J. Altschuler.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Training of scScope on multiple GPUs to enable fast learning and low memory cost.

scScope offers the option to train its deep model by using multiple GPUs, which can dramatically reduce runtime. In this mode, scScope replicates its network structure on multiple GPUs and aggregates all network parameters on the CPU. These network parameters include the connections and biases of all encoder, decoder and imputation layers of scScope. In a round of batch training, one GPU grabs the current network parameters from the CPU to use for its own network replicate of scScope. Then, for gradient calculation, the GPU processes a randomly chosen batch of m (= 64 or 512) single-cell expression profiles from a total of n single-cell profiles. We apply a conventional gradient calculation framework for neural networks, which iteratively performs feed-forward and back propagation steps. In the feed-forward (FF) step, a GPU passes its batch of m single-cell samples through its locally stored scScope network and accumulates the losses for this batch. In the back propagation (BP) step, batch-dependent gradient information for network parameters on different layers is calculated by sequentially propagating accumulated loss from the end to the first network layer. This BP operation is performed by using gradient calculation functions wrapped in deep-learning packages (in our case TensorFlow). We apply this process independently across all k GPUs in a parallelized manner to obtain gradient information from a total k×m samples. The gradient information of those k GPUs is averaged by the CPU, i.e.\({G}^{({j})} = ({G}_1^{\left( {j} \right)} + \ldots + {G}_{k}^{\left( {j} \right)})/{\mathrm{k}}\), where \({G}_{i}^{({j})}\) is the gradient calculated from the ith GPU in the jth round of optimization iteration. Finally, we apply adaptive moment estimation (ADAM) with default TensorFlow parameters to update the network parameters stored on the CPU. Iterations were terminated when either the objective function showed little change (i.e. < 0.1%) or the number of iterations reached a maximal epoch (e.g. 100).

Supplementary Figure 2 Comparison of imputation accuracy on 2,000 single-cell datasets generated by Splatter.

Methods were compared for varying fractions of sparsity levels (controlled by dropout rate, Supplementary Table 3). Accuracy measurement is based on fractional measure of imputation errors. For each simulated condition, 10 random replicates (n = 10) were simulated; mean values and s.d. (error bars) are reported.

Supplementary Figure 3 Performance analysis of scScope under different recurrent iterations (T).

This figure shows results of analysis using both simulated data (top panel) and biological data (bottom). Clustering accuracy (left), imputation error (middle) and run time (right) are reported for each dataset.

Supplementary Figure 4 Comparison of clustering accuracy on 1 million single-cell datasets generated by simulation strategy as used in SIMLR.

Methods were compared for varying fractions of dropout genes (n = 10 replicates were used). PCA, AE, DCA scVI and scScope were directly analyzed the whole dataset, while other methods analyzed 20 thousand down-sampled subsets. Median (center line), interquartile range (box) and minimum-maximum data range (whisker) were illustrated on box plot.

Supplementary Figure 5 Analysis of minor subpopulation discovery using the retina scRNA-seq dataset.

We compared: (a) the accuracy of cell-type identification (cluster numbers derived from each method are indicated in parentheses), (b) the significance of cell-type markers identified, and the correlation of minor cell-type proportions (c) and overall cell-type proportions (d) identified by computation or microscopy.

Supplementary Figure 6

Overview of scalable clustering approach for grouping large-scale data.

Supplementary Figure 7 Analysis of intestinal scRNA-seq dataset.

(a) Changes in cell-type composition of mouse intestinal epithelial under different infection conditions, visualized via t-SNE plots (n = 9,842 cells in total). TA: transit amplifying, EEC: enteroendocrine, EP: enterocyte progenitor, E: enterocyte. scScope identified four subtypes of enterocyte cells. (b) Identification of mature vs. immature and distal vs. proximal enterocyte subpopulations. Shown are expression levels of E-distal and E-proximal gene markers (average UMI count) on the four enterocyte subtypes predicted by scScope and, for comparison, all other clusters (Non-E). (c) Discovery of differential expression of the gene Saa1 in distal and proximal enterocytes after Salmonella and H. polygyrus infections. Violin plot shows the distributions of logTPM for E-distal/E-proximal cells in each infection condition for in total n = 1,129 enterocyte cells.

Supplementary Figure 8 Statistical test of marker gene-expression levels for enterocytes.

Statistical difference (P-values from two-sided t-test and rank-sum test) of marker genes expression levels between immature vs. mature, non-enterocyte (non-E) vs. immature, non-E vs. mature for E-distal and proximal clusters on mouse epithelial cell dataset. For E-distal, E-immature n = 385, E-mature n = 457; for E-proximal, E-immature n = 1,039, E-mature n = 845; For non-E cells, n = 7,116.

Supplementary Figure 9 Cell fraction comparison in a dataset of 1.3 million mouse brain cells.

Fractions of three major cell types (glutamatergic neurons, GABAergic neurons and non-neurons) identified by scScope and comparisons with reported neuron fractions by previous SPLiT-seq research.

Supplementary Figure 10

scScope’s recurrent network shown unfolded to three steps.

Supplementary Figure 11 Time costs of running PhenoGraph on simulated datasets of different scales.

For data size in 5,000 to 50,000, reported time costs were obtained as average values on 100 random repeats. For data sizes of 100,000 or 200,000, 10 repeats were used to report average time costs.

Supplementary Figure 12 Performance of comparing methods on different normalization/scaling methods using three biological datasets.

Where it applies, built-in default normalization methods in different packages were bypassed and replaced by tested normalization methods to perform these comparisons.

Supplementary Figure 13 Evaluation of performance for varying latent dimensions using the CBMC dataset with 8,000 profiles.

The latent dimension was varying from [10, 30, 50, 100].

Supplementary Figure 14 Comparison with existing autoencoder methods.

We compared scScope with alternative denoising autoencoder (AE) methods in the machine-learning field. Two main strategies in the literature for enhancing the abilities of AEs to smooth noisy data are to add random noise (“added noise smoothing,” ANS) or to drop values (“dropped value smoothing,” DVS) from the input (references [1-2] and [3-4] below, respectively), and then to train the AE to reconstruct the original input signal. We tested three smoothing strategies on both simulated (Splatter) and real biological (CBMC) data: 1) Adding noise (“ANS-AE”): for the jth gene, noise was drawn from \(N(0,0.3\sigma _j^2)\), where σj is the standard deviation computed from all nonzero values of the jth gene across all samples in the dataset; 2) Dropping values (“DVS -AE”): 10% of the nonzero input vector entries were set to zero; and 3) Both adding noise and dropping values (“ANS+DVS-AE”) as above. As reference, scScope was trained using only the input original, either as a standard autoencoder (T = 1) or using its built-in recurrent structure (T = 2) to smooth noise. In terms of learning parameters for training, all hyperparameter choices are listed in Supplementary Table 2 and are the same parameters used throughout all comparisons of methods shown in the paper (unless stated otherwise). In terms of neural network configurations, for the (ANS-AE, DVS-AE, and ANS+DVS-AE) we used the same network configurations as in the original papers. References: [1] Teixeira, V., Camacho, R., & Ferreira, P. G. (2017, November). Learning influential genes on cancer gene-expression data with stacked denoising autoencoders. In Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on(pp. 1201-1205). IEEE. [2] Tan, J., Hammond, J. H., Hogan, D. A., & Greene, C. S. (2016). ADAGE-based integration of publicly available Pseudomonas aeruginosa gene-expression data with denoising autoencoders illuminates microbe-host interactions. MSystems, 1(1), e00025-15. [3] Beaulieu-Jones, B. K., & Moore, J. H. (2017). Missing data imputation in the electronic health record using deeply learned autoencoders. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017 (pp. 207-218). [4] Xie, R., Wen, J., Quitadamo, A., Cheng, J., & Shi, X. (2017). A deep auto-encoder model for gene-expression prediction. BMC genomics, 18(9), 845.

Supplementary Figure 15 Results of varying scScope training-related hyper-parameters on a simulated (generated as in Fig. 1c) and a biological dataset.

(a) Learning rate, (b) batch size, (c) epoch in training and (d) epoch per check were analyzed respectively for scScope for each tested dataset.

Supplementary information

Supplementary Information

Supplementary Figures 1–15 and Supplementary Tables 1–11

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Deng, Y., Bao, F., Dai, Q. et al. Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods 16, 311–314 (2019). https://doi.org/10.1038/s41592-019-0353-7

Download citation

Further reading

Search

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing