# scGen predicts single-cell perturbation responses

## Abstract

Accurately modeling cellular response to perturbations is a central goal of computational biology. While such modeling has been based on statistical, mechanistic and machine learning models in specific settings, no generalization of predictions to phenomena absent from training data (out-of-sample) has yet been demonstrated. Here, we present scGen (https://github.com/theislab/scgen), a model combining variational autoencoders and latent space vector arithmetics for high-dimensional single-cell gene expression data. We show that scGen accurately models perturbation and infection response of cells across cell types, studies and species. In particular, we demonstrate that scGen learns cell-type and species-specific responses implying that it captures features that distinguish responding from non-responding genes and cells. With the upcoming availability of large-scale atlases of organs in a healthy state, we envision scGen to become a tool for experimental design through in silico screening of perturbation response in the context of disease and drug treatment.

## Access options

from\$8.99

All prices are NET prices.

## Data availability

All of the data sets analyzed in this manuscript are public and published in other papers. We have referenced them in the manuscript and they are downloadable at https://github.com/theislab/scgen-reproducibility.

## Code availability

The software is available at https://github.com/theislab/scgen. The code to reproduce the results of the papers is also available at https://github.com/theislab/scgen-reproducibility.

## References

1. 1.

Stubbington, M. J. T., Rozenblatt-Rosen, O., Regev, A. & Teichmann, S. A. Single-cell transcriptomics to explore the immune system in health and disease. Science 358, 58–63 (2017).

2. 2.

Angerer, P. et al. Single cells make big data: New challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).

3. 3.

Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

4. 4.

Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).

5. 5.

Hagai, T. et al. Gene expression variability across cells and species shapes innate immunity. Nature 563, 197–202 (2018).

6. 6.

Dixit, A. et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).

7. 7.

Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882 (2016).

8. 8.

Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017).

9. 9.

Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).

10. 10.

Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).

11. 11.

Fröhlich, F. et al. Efficient parameter estimation enables the prediction of drug response using a mechanistic pan-cancer pathway model. Cell Syst. 7, 567–579.e6 (2018).

12. 12.

Choi, K., Hellerstein, J., Wiley, S. & Sauro, H. M. Inferring reaction networks using perturbation data. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/351767v1 (2018).

13. 13.

Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

14. 14.

Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).

15. 15.

Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).

16. 16.

Ghahramani, A., Watt, F. M. & Luscombe, N. M. Generative adversarial networks simulate gene expression and predict perturbations in single cells.Preprint at bioRxiv https://doi.org/10.1101/262501 (2018).

17. 17.

Marouf, M. et al. Realistic in silico generation and augmentation of single cell RNA-seq data using generative adversarial neural networks. Preprint at bioRxiv https://doi.org/10.1101/390153 (2018).

18. 18.

Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).

19. 19.

Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at arXiv https://arxiv.org/abs/1312.6114 (2013).

20. 20.

Sohn, K., Lee, H. & Yan, X. in Advances in Neural Information Processing Systems 28 (eds Cortes, C. et al.) 3483–3491 (Curran Associates, Inc., 2015).

21. 21.

Abadi, M. et al. Tensorflow: a system for large-scale machine learning. Preprint at https://arxiv.org/abs/1605.08695v2 (2016).

22. 22.

Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

23. 23.

Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).

24. 24.

Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).

25. 25.

Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at arXiv https://arxiv.org/abs/1511.06434 (2015).

26. 26.

Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint arXiv https://arxiv.org/abs/1301.3781 (2013).

27. 27.

Liu, M.-Y. & Tuzel, O. in Advances in Neural Information Processing Systems 29 (eds Lee, D. D. et al.) 469–477 (Curran Associates, Inc., 2016).

28. 28.

Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (2017).

29. 29.

Amodio, M. & Krishnaswamy, S. MAGAN: Aligning biological manifolds. In Proceedings of the 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 215–223 (PMLR, Stockholmsmässan, Stockholm, 2018).

30. 30.

Clift, M. J. D. et al. A novel technique to determine the cell type specific response within an in vitro co-culture model via multi-colour flow cytometry. Sci. Rep. 7, 434 (2017).

31. 31.

Schubert, M. et al. Perturbation-response genes reveal signaling footprints in cancer gene expression. Nat. Commun. 9, 20 (2018).

32. 32.

Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

33. 33.

Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).

34. 34.

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

35. 35.

Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23, 80–91 (2018).

36. 36.

Smillie, C. S. et al. Rewiring of the cellular and inter-cellular landscape of the human colon during ulcerative colitis. Preprint at bioRxiv https://doi.org/10.1101/455451 (2018).

37. 37.

McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426v2 (2018).

38. 38.

Doersch, C. Tutorial on variational autoencoders. Preprint at arXiv https://arxiv.org/abs/1606.05908 (2016).

39. 39.

Park, J.-E., Polanski, K., Meyer, K. & Teichmann, S. A. Fast batch alignment of single cell transcriptomes unifies multiple mouse cell atlases into an integrated landscape. Preprint at bioRxiv https://doi.org/10.1101/397042 (2018).

40. 40.

Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

## Acknowledgements

We are grateful to all members of the Theis lab, in particular, D.S. Fischer for early comments on predicting across species. M.L. is grateful for valuable feedback from L.Haghverdi regarding batch effect removal. F.A.W. acknowledges discussions with N. Stranski on responding and non-responding cells and support by the Helmholtz Postdoc Program, Initiative and Networking Fund of the Helmholtz Association. This work was supported by BMBF grant nos. 01IS18036A and 01IS18053A, by the German Research Foundation within the Collaborative Research Center 1243, Subproject A17, by the Helmholtz Association (Incubator grant sparse2big, grant no. ZT-I-0007) and by the Chan Zuckerberg Initiative DAF (advised fund of Silicon Valley Community Foundation, no. 182835).

## Author information

M.L. performed the research, implemented the models and analyzed the data. F.A.W. conceived the project with contributions from M.L. and F.J.T. F.A.W. and F.J.T. supervised the research. All authors wrote the manuscript.

Correspondence to F. Alexander Wolf or Fabian J. Theis.

## Ethics declarations

### Competing interests

F.J.T. reports receiving consulting fees from Roche Diagnostics GmbH and Cellarity Inc., and ownership interest in Cellarity, Inc.

Peer review information: Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Integrated supplementary information

### Supplementary Fig. 1 Linearity of the latent space.

(a)Building a linear classifier based on the dot product between the difference vector δ and the latent representation of each cell. (b) Dot product results between latent representation of all cells with their corresponding difference vector δ for each condition show that two conditions are approximately linearly separable using dot product classifier. (c) Cosine similarity of $${\mathrm{\delta }}_{ {{\mathrm{stim}} - {\mathrm{k}}}},{\mathrm{\delta }}_{{\mathrm{celltype}} - {\mathrm{ij}}}$$ with δ where$${\mathrm{\delta }}_{{\mathrm{celltype}} - {\mathrm{ij}}} = avg\left( {z_{celltype = i}} \right) - avg\left( {z_{celltype = j}} \right)$$ and δstim-k=avg(zstim,cell type=k)–avg(zctrl,cell type=k) for all seven cell types present in the Kang et al.3 (n=18,868) dataset (z denotes the latent representation of all cells with the corresponding label). First and second violin plot have n=21 and n=7, respectively. The third violin plot shows pairwise (n=499,500) cosine similarity for a set of 1000 random samples from 100–dimensional standard normal distribution. Vertical axis: cosine similarity. Horizontal axis: dot product results in different scenarios as described before.

### Supplementary Fig. 2 Distribution-matching comparison between different models.

(ae) Mean and variance matching comparison between scGen and four alternative models for CD4-T (control (n=2437), predicted (n=2437) and real stimulated (n=3127)) cells shows scGen outperforms other models. Similarly, by comparing UMAP visualizations, one can see the predictions by scGen have more overlap with ground truth cells whereas predictions from other models lie far from real stimulated cells. (f) Ground truth mean and variance between control and stimulated CD4-T cells. R2 denotes squared Pearson correlation between ground truth and predicted values.

### Supplementary Fig. 3 Graphical pipeline of two alternative approaches to predict unseen single–cell perturbations.

(a) CVAE pipeline at test time to predict unseen condition. In order to predict cells in condition 1, we feed all cells present in condition 0 with inverse label 1 concatenated (shown with + symbol) to the data matrix. This informs the model that these cells are from condition 1. Therefore, the model changes the condition of input cells from 0 to 1. (b) The style transfer GAN to transform one condition to another. This would be possible by learning a joint two-way mapping in an adversarial learning setting. There exist two generators: G0–1 which transforms cells from condition 0 to 1, and G1–0, which does the same task but in the reverse direction. Two discriminators, denoted by D0 and D1, are trained to detect real from fake cells generated by G1–0 and G0–1, respectively.

### Supplementary Fig. 4 Latent space comparison.

(ac) UMAP visualization of latent space representation for PBMCs from the Kang et al.3 dataset (n=18,868). For scGen (VAE) and CVAE we used the bottleneck layer but for the style transfer GAN we used the discriminator’s penultimate output as the input for UMAP algorithm.

### Supplementary Fig. 5 scGen captures cell type specific response patterns.

(a) Violin plot for top 10 specific response genes from Kang et al.3 (n=18,868) out of 500 DEGs according to the gene specificity score (Supplementary Note 6) across control (c), real stimulated (r.s), and predicted stimulated (p.s) for different cell types. Vertical axis: expression distribution for top specific genes. Horizontal axis: control, real and predicted distribution by scGen for different cell types and n shows the number of samples for each violin plot. (b) Box plots of top 500 DEGs ordered by the gene specificity score. Each bin is composed of 50 genes and each point in the bin shows the error between average expression of that gene within a cell type and average prediction by scGen for that cell type. In total each boxplot has been derived from 50 (number of genes) × 7 (number of cell types) points (n=350) and the error is $$\frac{{\left| {x - x_{pred}} \right|}}{{{\it{max}}\left( {x,1} \right)}}$$. Box plots indicate the median (center lines), interquantile range (hinges), and whiskers represents min and max values. (c) Predictions using linear PCA + Vec. Arithm. shows how this linear model fails to capture specific responses. Note how the model increases APOBEC3A in all cell types whereas scGen upregulates it only in responsive cell types. Similar phenomenon happens in CD4-T cells, in which scGen only upregulates responsive genes whereas the linear model upregulates all genes. Horizontal axis: control, real and predicted distribution by scGen for different cell types and n shows number of samples for each violin plot. (d) Similar box plot as b depicts how a linear model yields larger error in bin-0 which includes top 50 genes with cell type specific differential expression patterns. Each box plot has derived from n=350 points as explained in (b).

### Supplementary Fig. 6 scGen captures cell type specific responses patterns in two datasets of intestinal epithelial cells.

(a) Violin plot for top 10 specific response genes out of 500 (top 250 upregulated and top 250 downregulated) DEGs according to gene specificity score (Supplementary Note 6) for H.poly (a) and Salmonella (b) datasets from Haber et al. across control (c), real stimulated (r.s), and predicted stimulated (p.s) for different cell types. Vertical axis: expression distribution for top 10 specific genes. Horizontal axis: control, real and predicted distribution by scGen for different cell types and n shows the number of samples for each violin plot.

### Supplementary Fig. 7 scGen performs robustly when holding out more than one cell type.

(a-c) Predicting IFN-β stimulated CD4-T and F-Mono cells form the Kang et al.3 dataset in different scenarios with different numbers of held out cell types. First panel shows UMAP visualization for the position of held out cells (n=11,265, 10,261 and 8,622 for a-c, respectively). Other panels show mean gene expression of all genes and violin plot (vertical axis: expression distribution for top specific genes. Horizontal axis: control, real and predicted distribution by scGen) for ISG15, the top response gene after stimulation with IFN-β for CD4-T (control (n=2,437), predicted (n=2,437) and real stimulated (n=3,127)) and F-Mono (control (n=1,100), predicted (n=2,437) and real stimulated (n=3,127)). R2 denotes squared Pearson correlation between ground truth and predicted values. Shaded lines depict 95 percent confidence interval for the for the regression estimate. The regression line is showed with blue line.

### Supplementary Fig. 8 scGen enables the generation of intermediary cells between two conditions.

(a-b) PCA visualization of 2,000 generated intermediary TA (Haber et al.) and CD4-T (Kang et al.3) cells between control and predicted cells. (c) Change in top five upregulated and downregulated genes as we move from control (n=350) to predicted Salmonella infected cells (n=350). (d) Similarly, variation of top 10 IFN-β marker genes while transitioning from control (n=2,437) to predicted IFN-β stimulated cells (n= n=2,437).

### Supplementary Fig. 9 Alternative vector arithmetics for cross–species prediction.

(a-f) Prediction of ratLPS by adding difference vectors estimated using rat and mouse where δLPS = mouseLPSmousecontrol and δspecies= ratcontrolmousecontrol (g-h) Prediction of ratLPS by adding δaverage to ratcontrol where δaverage = avg(zLPS,all species)−δaverage = avg(zcontrol,all species). R2 denotes squared Pearson correlation between real LPS cells (n=9,423) and predicted LPS (n=8,421) values.

### Supplementary Fig. 10 scGen removes batch effects.

(a) UMAP visualization of four technically diverse pancreatic datasets (n=14,693) with their corresponding batch and cell types. We report average silhouette width (ASW) for batches in the original data (ASW = 0.2130, lower is better for batch-effect evaluation). (b) Data corrected by scGen mixes shared cell types from different studies while preserving study specific cell types as independent (ASW = −0.0917). ASW was calculated based on the n=14,693 cells with their study labels.

### Supplementary Fig. 11 Comparison of existing batch effect removal methods for integrating four different pancreatic datasets.

(a) Original data. (b) scGen. (c) Scanorama. (d) MultiCCA. (e) MNN. (f) Bbknn. ASW was calculated based on the n=14,693 cells with their study labels.

### Supplementary Fig. 12 scGen integrates eight mouse single–cell atlasess with 114,600 cells.

(a) UMAP visualization of eight different datasets with their corresponding study, cell type and organ labels. ASW was calculated based on the n=57,300 randomly subsampled cells with their study labels. (b) scGen merges the data by connecting the similar cell types according to their cell labels while having lower ASW (−0.28649).

### Supplementary Fig. 13 Biased sampling effect.

(a) Comparison between biased and unbiased predictions for CD4-T cells using PCA and vector arithmetics. R2 denotes squared Pearson correlation between real stimulated cells (n=2,437) and predicted (n=3,127) values. Shaded lines depict 95 percent confidence interval for the for the regression estimate. The regression line is showed with blue line.

## Supplementary information

### Supplementary information

Supplementary Figs. 1–13, Supplementary Notes 1–9 and Supplementary Tables 1–3.

## Rights and permissions

Reprints and Permissions

Lotfollahi, M., Wolf, F.A. & Theis, F.J. scGen predicts single-cell perturbation responses. Nat Methods 16, 715–721 (2019) doi:10.1038/s41592-019-0494-8

• #### DOI

https://doi.org/10.1038/s41592-019-0494-8