Abstract
Analysis across a growing number of single-cell perturbation datasets is hampered by poor data interoperability. To facilitate development and benchmarking of computational methods, we collect a set of 44 publicly available single-cell perturbation–response datasets with molecular readouts, including transcriptomics, proteomics and epigenomics. We apply uniform quality control pipelines and harmonize feature annotations. The resulting information resource, scPerturb, enables development and testing of computational methods, and facilitates comparison and integration across datasets. We describe energy statistics (E-statistics) for quantification of perturbation effects and significance testing, and demonstrate E-distance as a general distance measure between sets of single-cell expression profiles. We illustrate the application of E-statistics for quantifying similarity and efficacy of perturbations. The perturbation–response datasets and E-statistics computation software are publicly available at scperturb.org. This work provides an information resource for researchers working with single-cell perturbation data and recommendations for experimental design, including optimal cell counts and read depth.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The website scperturb.org stores harmonized datasets with the following: scRNA-seq and antibody-based protein datasets: .h5ad files; scATAC-seq: multiple different feature matrix definitions as separate download options. RNA data at https://doi.org/10.5281/zenodo.7041848 and ATAC data at https://doi.org/10.5281/zenodo.7058381. Dataset access details: AdamsonWeissman20167: GSE90546 on GEO55; AissaBenevolenskaya202168: GSE149383 on GEO; ChangYe202169: E-MTAB-10698 on ArrayExpress70; DatlingerBock20171: GSE92872 on GEO; DatlingerBock202171: GSE168620 on GEO; DixitRegev20162: GSE90063 on GEO; FrangiehIzar20216: SCP1064 on the Broad Single Cell Portal https://singlecell.broadinstitute.org/single_cell/study/SCP1064/multi-modal-pooled-perturb-cite-seq-screens-in-patient-models-define-novel-mechanisms-of-cancer-immune-evasion; GasperiniShendure201954: GSE120861 on GEO; GehringPachter201919: https://doi.org/10.22002/D1.1311 on CaltechDATA; Liscovitch-BrauerSanjana202159: GSE161002 on GEO; McFarlandTsherniak202072: https://doi.org/10.6084/m9.figshare.5863776.v1 on figshare; MimitouSmibert202173: GSE156476 on GEO; NormanWeissman201949: GSE133344 on GEO; PapalexiSatija2029: GSE153056 on GEO; PierceGreenleaf202142: data deposited on AWS, URIs to be found at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137922/bin/41467_2021_23213_MOESM9_ESM.xlsx; ReplogleWeissman202220: processed single-cell data from gwps.wi.mit.edu; SchiebingerLander201923: GSE106340 and GSE115943 on GEO; SchraivogelSteinmetz202074: GSE135497 on GEO; ShifrutMarson201875: GSE119450 on GEO; SrivatsanTrapnell202052: GSE139944 on GEO; TianKampmann201976: GSE152988 on GEO with mappings from kampmannlab.ucsf.edu/crop-seq; TianKampmann202121: GSE124703 on GEO; WeinrebKlein202077: GSE140802 on GEO; XieHon201778: GSE81884 on GEO; ZhaoSims202179: GSE148842 on GEO.
Code availability
Open access source code is at https://github.com/sanderlab/scPerturb/. We compiled a corresponding Python package called scperturb for performing E-statistics (E-distance and E-testing) in single-cell data, published on PyPI under https://pypi.org/project/scperturb/. Access details for the original publication for each dataset are available in the scPerturb GitHub repository (https://github.com/sanderlab/scPerturb) in the subfolder 'dataset_processing'.
References
Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods 14, 297–301 (2017).
Dixit, A., Parnas, O., Li, B. & Chen, J. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Jaitin, D. A. et al. Dissecting immune circuits by linking CRISPR-pooled screens with single-cell RNA-seq. Cell 167, 1883–1896 (2016).
Gilbert, L. A. et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell 159, 647–661 (2014).
Wessels, H.-H. et al. Efficient combinatorial targeting of RNA transcripts in single cells with Cas13 RNA Perturb-seq. Nat. Methods 20, 86–94 (2023).
Frangieh, C. J. et al. Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nat. Genet. 53, 332–341 (2021).
Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882 (2016).
Rubin, A. J. et al. Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell 176, 361–376 (2019).
Papalexi, E. et al. Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens. Nat. Genet. 53, 322–331 (2021).
Gross, T., Wongchenko, M. J., Yan, Y. & Blüthgen, N. Robust network inference using response logic. Bioinformatics 35, i634–i642 (2019).
Pratapa, A., Jalihal, A. P., Law, J. N., Bharadwaj, A. & Murali, T. M. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat. Methods 17, 147–154 (2020).
Gross, T. & Blüthgen, N. Identifiability and experimental design in perturbation studies. Bioinformatics 36, i482–i489 (2020).
Bertin, P. et al. RECOVER: sequential model optimization platform for combination drug repurposing identifies novel synergistic compounds in vitro. Preprint at https://doi.org/10.48550/arXiv.2202.04202 (2022).
Franz, A. et al. Molecular response to PARP1 inhibition in ovarian cancer cells as determined by mass spectrometry based proteomics. J. Ovarian Res. 14, 140 (2021).
Preuer, K. et al. DeepSynergy: predicting anti-cancer drug synergy with deep learning. Bioinformatics 34, 1538–1546 (2018).
Kharchenko, P. V. The triumphs and limitations of computational methods for scRNA-seq. Nat. Methods 18, 723–732 (2021).
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39, 619–629 (2021).
Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40, 245–253 (2022).
Gehring, J., Park, J. H., Chen, S., Thomson, M. & Pachter, L. Highly multiplexed single-cell RNA-seq by DNA oligonucleotide tagging of cellular proteins. Nat. Biotechnol. 38, 35–38 (2020).
Replogle, J. M. et al. Mapping information-rich genotype–phenotype landscapes with genome-scale Perturb-seq. Cell 185, 2559–2575 (2022).
Tian, R. et al. Genome-wide CRISPRi/a screens in human neurons link lysosomal failure to ferroptosis. Nat. Neurosci. 24, 1020–1034 (2021).
Chen, W. S. et al. Uncovering axes of variation among single-cell cancer specimens. Nat. Methods 17, 302–310 (2020).
Schiebinger, G. et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 928–943 (2019).
Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics 36, i610–i617 (2020).
Przybyla, L. & Gilbert, L. A. A new era in functional genomics screens. Nat. Rev. Genet. 23, 89–103 (2022).
Forcato, M., Romano, O. & Bicciato, S. Computational methods for the integrative analysis of single-cell data. Brief. Bioinform. 22, 20–29 (2021).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Duan, B. et al. Model-based understanding of single-cell CRISPR screening. Nat. Commun. 10, 2233 (2019).
Jin, K. et al. CellDrift: inferring perturbation responses in temporally-sampled single cell data. Brief. Bioinform. 23, bbac324 (2022).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Stathias, V. et al. LINCS Data Portal 2.0: next generation access point for perturbation–response signatures. Nucleic Acids Res. 48, D431–D439 (2020).
Tsherniak, A. et al. Defining a cancer dependency map. Cell 170, 564–576 (2017).
Lance, C. et al. Multimodal single cell data integration challenge: results and lessons learned. Preprint at https://doi.org/10.1101/2022.04.11.487796 (2022).
Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database 2020, baaa073 (2020).
Broad Institute. Single Cell Portal. https://singlecell.broadinstitute.org/single_cell (2022).
Ji, Y., Lotfollahi, M., Wolf, F. A. & Theis, F. J. Machine learning for perturbational single-cell omics. Cell Syst. 12, 522–537 (2021).
Fischer, D. S. et al. Sfaira accelerates data and model reuse in single cell genomics. Genome Biol. 22, 248 (2021).
Chan Zuckerberg CELLxGENE Discover. Cellxgene Data Portal. https://cellxgene.cziscience.com/
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019).
Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 20, 241 (2019).
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Pierce, S. E., Granja, J. M. & Greenleaf, W. J. High-throughput single-cell chromatin accessibility CRISPR screens enable unbiased identification of regulatory networks in cancer. Nat. Commun. 12, 2969 (2021).
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017).
Luecken, M. D. & Theis, F. J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 75 (2017).
Székely, G. J. & Rizzo, M. L. Energy statistics: a class of statistics based on distances. J. Stat. Plan. Inference 143, 1249–1272 (2013).
Norman, T. M. et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science 365, 786–793 (2019).
Schroder, K., Hertzog, P. J., Ravasi, T. & Hume, D. A. Interferon-γ: an overview of signals, mechanisms and functions. J. Leukoc. Biol. 75, 163–189 (2004).
Jung, S. & Marron, J. S. PCA consistency in high dimension, low sample size context. Ann. Stat. 37, 4104–4130 (2009).
Srivatsan, S. R. et al. Massively multiplex chemical transcriptomics at single-cell resolution. Science 367, 45–51 (2020).
Yao, D. et al. Scalable genetic screening for regulatory circuits using compressed Perturb-seq. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01964-9 (2023).
Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390 (2019).
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets – update. Nucleic Acids Res. 41, D991–D995 (2013).
Gatto, L. et al. Initial recommendations for performing, benchmarking and reporting single-cell proteomics experiments. Nat. Methods 20, 375–386 (2023).
Tian, L., Chen, F. & Macosko, E. Z. The expanding vistas of spatial transcriptomics. Nat. Biotechnol. 41, 773–782 (2023).
Bredikhin, D., Kats, I. & Stegle, O. MUON: multimodal omics analysis framework. Genome Biol. 23, 42 (2022).
Liscovitch-Brauer, N. et al. Profiling the genetic determinants of chromatin accessibility with scalable single-cell CRISPR screens. Nat. Biotechnol. 39, 1270–1277 (2021).
Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020).
Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Bairoch, A. The Cellosaurus, a cell-line knowledge resource. J. Biomol. Tech. 29, 25–38 (2018).
Mölder, F. et al. Sustainable data analysis with Snakemake. F1000Res. 10, 33 (2021).
Rizzo, M. L. & Székely, G. J. Energy distance. WIREs Comput. Stat. 8, 27–38 (2016).
Dhapola, P. et al. Scarf enables a highly memory-efficient analysis of large-scale single-cell genomics data. Nat. Commun. 13, 4616 (2022).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Aissa, A. F. et al. Single-cell transcriptional changes associated with drug tolerance and response to combination therapies in cancer. Nat. Commun. 12, 1628 (2021).
Chang, M. T. et al. Identifying transcriptional programs underlying cancer drug response with TraCe-seq. Nat. Biotechnol. 40, 86–93 (2022).
Parkinson, H. et al. ArrayExpress: a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35, D747–D750 (2007).
Datlinger, P. et al. Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing. Nat. Methods 18, 635–642 (2021).
McFarland, J. M. et al. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action. Nat. Commun. 11, 4296 (2020).
Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).
Schraivogel, D. et al. Targeted Perturb-seq enables genome-scale genetic screens in single cells. Nat. Methods 17, 629–635 (2020).
Shifrut, E. et al. Genome-wide CRISPR screens in primary human T cells reveal key regulators of immune function. Cell 175, 1958–1971 (2018).
Tian, R. et al. CRISPR interference-based platform for multimodal genetic screens in human iPSC-derived neurons. Neuron 104, 239–255 (2019).
Weinreb, C., Rodriguez-Fraticelli, A., Camargo, F. D. & Klein, A. M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science 367, eaaw3381 (2020).
Xie, S., Duan, J., Li, B., Zhou, P. & Hon, G. C. Multiplexed engineering and analysis of combinatorial enhancer activity in single cells. Mol. Cell 66, 285–299 (2017).
Zhao, W. et al. Deconvolution of cell type-specific drug responses in human tumor tissue with single-cell RNA-seq. Genome Med. 13, 82 (2021).
Acknowledgements
The authors appreciate informative conversations with Y. Ji of the Fabian Theis laboratory, helpful code suggestions from G. Wong, and computational support from A. Kollasch of the Debora Marks laboratory. The authors also appreciate preprint review comments from Arcadia Science’s preprint review initiative (G. P. Way, N. Davidson, E. Serrano, P. Hicks, J. Tomkinson, D. Bunten). This work was supported by the National Resource for Network Biology (NRNB, P41GM103504 to C.Sa.), the Wellcome Leap ∆Tissue Program (to C.Sa., L.J.S., D.S.M.), the Deutsche Forschungsgemeinschaft (DFG, RTG2424 CompCancer to N.B.), Einstein Stiftung Berlin (Einstein Visiting Fellow program, to C.Sa., N.B.), and the Intramural Research Program of the National Library of Medicine, National Institutes of Health (to A.L.). Computation was in part performed on the HPC for Research cluster of the Berlin Institute of Health. Figures 1 and 4b were created with BioRender.com.
Author information
Authors and Affiliations
Contributions
The project was conceptualized by C.Sa., N.B., A.L. and B.Y. Data were curated by T.D.G., S.P., C.Sh., T.G. and S.G. Formal analysis and methodology development were carried out by S.P., T.D.G. and C.Sa. Funding acquisition was done by N.B., D.S.M., L.J.S. and C.Sa. Software development was carried out by J.M., S.P. and T.D.G. Supervision was provided by N.B., A.L., J.P.T.-K., C.Sa., D.S.M. and L.J.S. The original draft was written by T.D.G., S.P., C.Sh., T.G. and J.P.T.-K. Writing review and editing were done by L.J.S., C.Sa., N.B. and A.L.
Corresponding authors
Ethics declarations
Competing interests
J.P.T.-K. and T.G. are employees of Relation Therapeutics. C.Sa. is on the science advisory board of Cytoreason Ltd. D.S.M. serves as an advisor for Dyno Therapeutics, Octant, Jura Bio, Tectonic Therapeutic, and Genentech, and is a co-founder of Seismic Therapeutic. All other authors have no competing interests.
Peer review
Peer review information
Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Number of cells per dataset by submission date.
There is a rapid increase in published single-cell perturbation datasets around 2019. We speculate that the slight decrease of dataset numbers after 2021 suggested by the plot is due to the ongoing impact of reduced research in the earlier phases of the COVID-19 pandemic.
Extended Data Fig. 2 Harmonization and analysis workflow.
Perturbation datasets with single-cell molecular profiles with at least two perturbations and one control condition (for example unperturbed) of various modality types were identified in a literature search. Data were obtained from public repositories, and metadata (such as guide identity) from paper supplements. Datasets were reprocessed to standardize annotations and analyzed in parallel. All datasets are now available for download from scperturb.org, along with visualizations and summarizing information.
Extended Data Fig. 3 Pairwise E-distances for NormanWeissman2019 dataset.
E-distances between all pairs of perturbations in the dataset NormanWeissman2019. The color scale is clipped at 5% highest and lowest percentiles. Clusters of similar perturbations are visible, for example a cluster of strongly acting perturbations targeting CEBPA at the top.
Supplementary information
Supplementary Information
Supplementary Fig. 1, Supplemental Note including figures and sections 1–9.
Supplementary Tables 1–5
Supplementary Table 1: Dataset metadata and description of source data papers. Supplementary Table 2: Description of scperturb-formatted gene and cell metadata. Supplementary Table 3: E-statistic results for filtered perturbations across datasets in database. Supplementary Table 4: Drug perturbations appearing in multiple datasets. Supplementary Table 5: Gene perturbations appearing in multiple datasets.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Peidli, S., Green, T.D., Shen, C. et al. scPerturb: harmonized single-cell perturbation data. Nat Methods 21, 531–540 (2024). https://doi.org/10.1038/s41592-023-02144-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-02144-y