scGen predicts single-cell perturbation responses

Lotfollahi, Mohammad; Wolf, F. Alexander; Theis, Fabian J.

doi:10.1038/s41592-019-0494-8

Article
Published: 29 July 2019

scGen predicts single-cell perturbation responses

Nature Methods volume 16, pages 715–721 (2019)Cite this article

37k Accesses
154 Citations
210 Altmetric
Metrics details

Subjects

Abstract

Accurately modeling cellular response to perturbations is a central goal of computational biology. While such modeling has been based on statistical, mechanistic and machine learning models in specific settings, no generalization of predictions to phenomena absent from training data (out-of-sample) has yet been demonstrated. Here, we present scGen (https://github.com/theislab/scgen), a model combining variational autoencoders and latent space vector arithmetics for high-dimensional single-cell gene expression data. We show that scGen accurately models perturbation and infection response of cells across cell types, studies and species. In particular, we demonstrate that scGen learns cell-type and species-specific responses implying that it captures features that distinguish responding from non-responding genes and cells. With the upcoming availability of large-scale atlases of organs in a healthy state, we envision scGen to become a tool for experimental design through in silico screening of perturbation response in the context of disease and drug treatment.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: scGen, a method to predict single-cell perturbation response.**

**Fig. 2: scGen accurately predicts single-cell perturbation response out-of-sample.**

**Fig. 3: scGen models infection response in two data sets of intestinal epithelial cells.**

**Fig. 4: scGen accurately predicts single-cell perturbation across different studies.**

**Fig. 5: scGen predicts perturbation response across different species.**

Learning to encode cellular responses to systematic perturbations with deep generative models

Article Open access 06 November 2020

Yifan Xue, Michael Q. Ding & Xinghua Lu

Gene set inference from single-cell sequencing data using a hybrid of matrix factorization and variational autoencoders

Article 07 December 2020

Soeren Lukassen, Foo Wei Ten, … Christian Conrad

VEGA is an interpretable generative model for inferring biological network activity in single-cell transcriptomics

Article Open access 28 September 2021

Lucas Seninge, Ioannis Anastopoulos, … Joshua Stuart

Data availability

All of the data sets analyzed in this manuscript are public and published in other papers. We have referenced them in the manuscript and they are downloadable at https://github.com/theislab/scgen-reproducibility.

Code availability

The software is available at https://github.com/theislab/scgen. The code to reproduce the results of the papers is also available at https://github.com/theislab/scgen-reproducibility.

References

Stubbington, M. J. T., Rozenblatt-Rosen, O., Regev, A. & Teichmann, S. A. Single-cell transcriptomics to explore the immune system in health and disease. Science 358, 58–63 (2017).
Article CAS Google Scholar
Angerer, P. et al. Single cells make big data: New challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).
Article Google Scholar
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Article CAS Google Scholar
Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).
Article CAS Google Scholar
Hagai, T. et al. Gene expression variability across cells and species shapes innate immunity. Nature 563, 197–202 (2018).
Article CAS Google Scholar
Dixit, A. et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Article CAS Google Scholar
Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882 (2016).
Article CAS Google Scholar
Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017).
Article CAS Google Scholar
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Article CAS Google Scholar
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).
Article Google Scholar
Fröhlich, F. et al. Efficient parameter estimation enables the prediction of drug response using a mechanistic pan-cancer pathway model. Cell Syst. 7, 567–579.e6 (2018).
Article Google Scholar
Choi, K., Hellerstein, J., Wiley, S. & Sauro, H. M. Inferring reaction networks using perturbation data. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/351767v1 (2018).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article CAS Google Scholar
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Article Google Scholar
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).
Article Google Scholar
Ghahramani, A., Watt, F. M. & Luscombe, N. M. Generative adversarial networks simulate gene expression and predict perturbations in single cells.Preprint at bioRxiv https://doi.org/10.1101/262501 (2018).
Marouf, M. et al. Realistic in silico generation and augmentation of single cell RNA-seq data using generative adversarial neural networks. Preprint at bioRxiv https://doi.org/10.1101/390153 (2018).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).
Article CAS Google Scholar
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at arXiv https://arxiv.org/abs/1312.6114 (2013).
Sohn, K., Lee, H. & Yan, X. in Advances in Neural Information Processing Systems 28 (eds Cortes, C. et al.) 3483–3491 (Curran Associates, Inc., 2015).
Abadi, M. et al. Tensorflow: a system for large-scale machine learning. Preprint at https://arxiv.org/abs/1605.08695v2 (2016).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article Google Scholar
Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Article CAS Google Scholar
Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).
Article Google Scholar
Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at arXiv https://arxiv.org/abs/1511.06434 (2015).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint arXiv https://arxiv.org/abs/1301.3781 (2013).
Liu, M.-Y. & Tuzel, O. in Advances in Neural Information Processing Systems 29 (eds Lee, D. D. et al.) 469–477 (Curran Associates, Inc., 2016).
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (2017).
Amodio, M. & Krishnaswamy, S. MAGAN: Aligning biological manifolds. In Proceedings of the 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 215–223 (PMLR, Stockholmsmässan, Stockholm, 2018).
Clift, M. J. D. et al. A novel technique to determine the cell type specific response within an in vitro co-culture model via multi-colour flow cytometry. Sci. Rep. 7, 434 (2017).
Article Google Scholar
Schubert, M. et al. Perturbation-response genes reveal signaling footprints in cancer gene expression. Nat. Commun. 9, 20 (2018).
Article Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS Google Scholar
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
Article Google Scholar
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article CAS Google Scholar
Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23, 80–91 (2018).
PubMed PubMed Central Google Scholar
Smillie, C. S. et al. Rewiring of the cellular and inter-cellular landscape of the human colon during ulcerative colitis. Preprint at bioRxiv https://doi.org/10.1101/455451 (2018).
McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426v2 (2018).
Doersch, C. Tutorial on variational autoencoders. Preprint at arXiv https://arxiv.org/abs/1606.05908 (2016).
Park, J.-E., Polanski, K., Meyer, K. & Teichmann, S. A. Fast batch alignment of single cell transcriptomes unifies multiple mouse cell atlases into an integrated landscape. Preprint at bioRxiv https://doi.org/10.1101/397042 (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar

Download references

Acknowledgements

We are grateful to all members of the Theis lab, in particular, D.S. Fischer for early comments on predicting across species. M.L. is grateful for valuable feedback from L.Haghverdi regarding batch effect removal. F.A.W. acknowledges discussions with N. Stranski on responding and non-responding cells and support by the Helmholtz Postdoc Program, Initiative and Networking Fund of the Helmholtz Association. This work was supported by BMBF grant nos. 01IS18036A and 01IS18053A, by the German Research Foundation within the Collaborative Research Center 1243, Subproject A17, by the Helmholtz Association (Incubator grant sparse2big, grant no. ZT-I-0007) and by the Chan Zuckerberg Initiative DAF (advised fund of Silicon Valley Community Foundation, no. 182835).

Author information

Authors and Affiliations

Helmholtz Zentrum München – German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
Mohammad Lotfollahi, F. Alexander Wolf & Fabian J. Theis
School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
Mohammad Lotfollahi & Fabian J. Theis
Department of Mathematics, Technical University of Munich, Munich, Germany
Fabian J. Theis

Authors

Mohammad Lotfollahi
View author publications
You can also search for this author in PubMed Google Scholar
F. Alexander Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Fabian J. Theis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.L. performed the research, implemented the models and analyzed the data. F.A.W. conceived the project with contributions from M.L. and F.J.T. F.A.W. and F.J.T. supervised the research. All authors wrote the manuscript.

Corresponding authors

Correspondence to F. Alexander Wolf or Fabian J. Theis.

Ethics declarations

Competing interests

F.J.T. reports receiving consulting fees from Roche Diagnostics GmbH and Cellarity Inc., and ownership interest in Cellarity, Inc.

Additional information

Peer review information: Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Fig. 1 Linearity of the latent space.

(a)Building a linear classifier based on the dot product between the difference vector δ and the latent representation of each cell. (b) Dot product results between latent representation of all cells with their corresponding difference vector δ for each condition show that two conditions are approximately linearly separable using dot product classifier. (c) Cosine similarity of \({\mathrm{\delta }}_{ {{\mathrm{stim}} - {\mathrm{k}}}},{\mathrm{\delta }}_{{\mathrm{celltype}} - {\mathrm{ij}}}\) with δ where\({\mathrm{\delta }}_{{\mathrm{celltype}} - {\mathrm{ij}}} = avg\left( {z_{celltype = i}} \right) - avg\left( {z_{celltype = j}} \right)\) and δ_stim-k=avg(z_stim,_{cell type=k})–avg(z_ctrl,_{cell type=k}) for all seven cell types present in the Kang et al.³ (n=18,868) dataset (z denotes the latent representation of all cells with the corresponding label). First and second violin plot have n=21 and n=7, respectively. The third violin plot shows pairwise (n=499,500) cosine similarity for a set of 1000 random samples from 100–dimensional standard normal distribution. Vertical axis: cosine similarity. Horizontal axis: dot product results in different scenarios as described before.

Supplementary Fig. 2 Distribution-matching comparison between different models.

(a–e) Mean and variance matching comparison between scGen and four alternative models for CD4-T (control (n=2437), predicted (n=2437) and real stimulated (n=3127)) cells shows scGen outperforms other models. Similarly, by comparing UMAP visualizations, one can see the predictions by scGen have more overlap with ground truth cells whereas predictions from other models lie far from real stimulated cells. (f) Ground truth mean and variance between control and stimulated CD4-T cells. R² denotes squared Pearson correlation between ground truth and predicted values.

Supplementary Fig. 3 Graphical pipeline of two alternative approaches to predict unseen single–cell perturbations.

(a) CVAE pipeline at test time to predict unseen condition. In order to predict cells in condition 1, we feed all cells present in condition 0 with inverse label 1 concatenated (shown with + symbol) to the data matrix. This informs the model that these cells are from condition 1. Therefore, the model changes the condition of input cells from 0 to 1. (b) The style transfer GAN to transform one condition to another. This would be possible by learning a joint two-way mapping in an adversarial learning setting. There exist two generators: G_0–1 which transforms cells from condition 0 to 1, and G_1–0, which does the same task but in the reverse direction. Two discriminators, denoted by D₀ and D₁, are trained to detect real from fake cells generated by G_1–0 and G_0–1, respectively.

Supplementary Fig. 4 Latent space comparison.

(a–c) UMAP visualization of latent space representation for PBMCs from the Kang et al.³ dataset (n=18,868). For scGen (VAE) and CVAE we used the bottleneck layer but for the style transfer GAN we used the discriminator’s penultimate output as the input for UMAP algorithm.

Supplementary Fig. 5 scGen captures cell type specific response patterns.

(a) Violin plot for top 10 specific response genes from Kang et al.³ (n=18,868) out of 500 DEGs according to the gene specificity score (Supplementary Note 6) across control (c), real stimulated (r.s), and predicted stimulated (p.s) for different cell types. Vertical axis: expression distribution for top specific genes. Horizontal axis: control, real and predicted distribution by scGen for different cell types and n shows the number of samples for each violin plot. (b) Box plots of top 500 DEGs ordered by the gene specificity score. Each bin is composed of 50 genes and each point in the bin shows the error between average expression of that gene within a cell type and average prediction by scGen for that cell type. In total each boxplot has been derived from 50 (number of genes) × 7 (number of cell types) points (n=350) and the error is \(\frac{{\left| {x - x_{pred}} \right|}}{{{\it{max}}\left( {x,1} \right)}}\). Box plots indicate the median (center lines), interquantile range (hinges), and whiskers represents min and max values. (c) Predictions using linear PCA + Vec. Arithm. shows how this linear model fails to capture specific responses. Note how the model increases APOBEC3A in all cell types whereas scGen upregulates it only in responsive cell types. Similar phenomenon happens in CD4-T cells, in which scGen only upregulates responsive genes whereas the linear model upregulates all genes. Horizontal axis: control, real and predicted distribution by scGen for different cell types and n shows number of samples for each violin plot. (d) Similar box plot as b depicts how a linear model yields larger error in bin-0 which includes top 50 genes with cell type specific differential expression patterns. Each box plot has derived from n=350 points as explained in (b).

Supplementary Fig. 6 scGen captures cell type specific responses patterns in two datasets of intestinal epithelial cells.

(a) Violin plot for top 10 specific response genes out of 500 (top 250 upregulated and top 250 downregulated) DEGs according to gene specificity score (Supplementary Note 6) for H.poly (a) and Salmonella (b) datasets from Haber et al. across control (c), real stimulated (r.s), and predicted stimulated (p.s) for different cell types. Vertical axis: expression distribution for top 10 specific genes. Horizontal axis: control, real and predicted distribution by scGen for different cell types and n shows the number of samples for each violin plot.

Supplementary Fig. 7 scGen performs robustly when holding out more than one cell type.

(a-c) Predicting IFN-β stimulated CD4-T and F-Mono cells form the Kang et al.³ dataset in different scenarios with different numbers of held out cell types. First panel shows UMAP visualization for the position of held out cells (n=11,265, 10,261 and 8,622 for a-c, respectively). Other panels show mean gene expression of all genes and violin plot (vertical axis: expression distribution for top specific genes. Horizontal axis: control, real and predicted distribution by scGen) for ISG15, the top response gene after stimulation with IFN-β for CD4-T (control (n=2,437), predicted (n=2,437) and real stimulated (n=3,127)) and F-Mono (control (n=1,100), predicted (n=2,437) and real stimulated (n=3,127)). R² denotes squared Pearson correlation between ground truth and predicted values. Shaded lines depict 95 percent confidence interval for the for the regression estimate. The regression line is showed with blue line.

Supplementary Fig. 8 scGen enables the generation of intermediary cells between two conditions.

(a-b) PCA visualization of 2,000 generated intermediary TA (Haber et al.) and CD4-T (Kang et al.³) cells between control and predicted cells. (c) Change in top five upregulated and downregulated genes as we move from control (n=350) to predicted Salmonella infected cells (n=350). (d) Similarly, variation of top 10 IFN-β marker genes while transitioning from control (n=2,437) to predicted IFN-β stimulated cells (n= n=2,437).

Supplementary Fig. 9 Alternative vector arithmetics for cross–species prediction.

(a-f) Prediction of rat_LPS by adding difference vectors estimated using rat and mouse where δ_LPS = mouse_LPS−mouse_control and δ_species= rat_control−mouse_control (g-h) Prediction of rat_LPS by adding δ_average to rat_control where δ_average = avg(z_LPS,_{all species})−δ_average = avg(z_control,_{all species}). R² denotes squared Pearson correlation between real LPS cells (n=9,423) and predicted LPS (n=8,421) values.

Supplementary Fig. 10 scGen removes batch effects.

(a) UMAP visualization of four technically diverse pancreatic datasets (n=14,693) with their corresponding batch and cell types. We report average silhouette width (ASW) for batches in the original data (ASW = 0.2130, lower is better for batch-effect evaluation). (b) Data corrected by scGen mixes shared cell types from different studies while preserving study specific cell types as independent (ASW = −0.0917). ASW was calculated based on the n=14,693 cells with their study labels.

Supplementary Fig. 11 Comparison of existing batch effect removal methods for integrating four different pancreatic datasets.

(a) Original data. (b) scGen. (c) Scanorama. (d) MultiCCA. (e) MNN. (f) Bbknn. ASW was calculated based on the n=14,693 cells with their study labels.

Supplementary Fig. 12 scGen integrates eight mouse single–cell atlasess with 114,600 cells.

(a) UMAP visualization of eight different datasets with their corresponding study, cell type and organ labels. ASW was calculated based on the n=57,300 randomly subsampled cells with their study labels. (b) scGen merges the data by connecting the similar cell types according to their cell labels while having lower ASW (−0.28649).

Supplementary Fig. 13 Biased sampling effect.

(a) Comparison between biased and unbiased predictions for CD4-T cells using PCA and vector arithmetics. R² denotes squared Pearson correlation between real stimulated cells (n=2,437) and predicted (n=3,127) values. Shaded lines depict 95 percent confidence interval for the for the regression estimate. The regression line is showed with blue line.

Supplementary information

Supplementary Figs. 1–13, Supplementary Notes 1–9 and Supplementary Tables 1–3.

Reporting summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lotfollahi, M., Wolf, F.A. & Theis, F.J. scGen predicts single-cell perturbation responses. Nat Methods 16, 715–721 (2019). https://doi.org/10.1038/s41592-019-0494-8

Download citation

Received: 07 January 2019
Accepted: 17 June 2019
Published: 29 July 2019
Issue Date: August 2019
DOI: https://doi.org/10.1038/s41592-019-0494-8

This article is cited by

scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data
- Ziqi Zhang
- Xinye Zhao
- Xiuwei Zhang
Nature Communications (2024)
scPerturb: harmonized single-cell perturbation data
- Stefan Peidli
- Tessa D. Green
- Chris Sander
Nature Methods (2024)
Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells
- Adam Gayoso
- Philipp Weiler
- Nir Yosef
Nature Methods (2024)
scButterfly: a versatile single-cell cross-modality translation method via dual-aligned variational autoencoders
- Yichuan Cao
- Xiamiao Zhao
- Shengquan Chen
Nature Communications (2024)
TISSUE: uncertainty-calibrated prediction of single-cell spatial transcriptomics improves downstream analyses
- Eric D. Sun
- Rong Ma
- James Zou
Nature Methods (2024)