To the Editor — Methods for analyzing single-cell data1,2,3,4 perform a core set of computational tasks. These tasks include dimensionality reduction, cell clustering, cell-state annotation, removal of unwanted variation, analysis of differential expression, identification of spatial patterns of gene expression, and joint analysis of multi-modal omics data. Many of these methods rely on likelihood-based models to represent variation in the data; we refer to these as ‘probabilistic models’. Probabilistic models provide principled ways to capture uncertainty in biological systems and are convenient for decomposing the many sources of variation that give rise to omics data5.
Despite the appeal of probabilistic models, several obstacles impede their community-wide adoption. The first obstacle, coming from the perspective of the end user, relates to the difficulty of implementing and running such models. Because probabilistic models are often implemented using Python machine-learning libraries, users are often required to interact with interfaces and objects that are lower level in nature than those used in popular environments for single-cell data analysis like Bioconductor6, Seurat7 or Scanpy8.
A second obstacle relates to the development of new probabilistic models. From the perspective of developers, there are many necessary routines to implement in support of a probabilistic model, including data handling, tensor computations, training routines that handle device management (for example, GPU (graphic processing unit) computing), and the underlying optimization, sampling and numerical procedures. Although higher level machine-learning packages that automate some of these routines (for example, PyTorch Lightning9 or Keras10) are becoming popular, they do not work seamlessly with single-cell omics data.
To address these limitations, we present scvi-tools (https://scvi-tools.org/), a Python library for deep probabilistic analysis of single-cell omics data. From the end user’s perspective (Supplementary Note 1), scvi-tools offers standardized access to methods for many single-cell data analysis tasks, such as integration of single-cell RNA sequencing (scRNA-seq) data (scVI11 or scArches12), annotation of single-cell profiles (CellAssign13 or scANVI14), deconvolution of bulk spatial transcriptomics profiles (Stereoscope15 or DestVI16), doublet detection (Solo17) and multi-modal analysis of CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) data (totalVI18).
In the broader analysis pipeline, scvi-tools sits downstream of initial quality control (QC)-driven preprocessing and generates outputs that may be further interpreted via general single-cell analysis tools (Fig. 1a). At its core, scvi-tools implements several key functionalities that are accessible across data modalities, such as differential analysis and dataset integration. All 14 models (Supplementary Table 1) currently implemented in scvi-tools interact with Scanpy through the annotated dataset (AnnData19) format, and the models share a consistent user interface (Fig. 1b). The scvi-tools library also has an interface with R such that each model may be used in Seurat or Bioconductor pipelines.
We also illustrate two new features of scvi-tools applicable to several types of omics data. The first feature offers the ability to remove unwanted variation due to multiple nuisance factors simultaneously, including both discrete (for example, batch category) and continuous (for example, percent mitochondrial reads) factors. In Supplementary Note 2, we apply this in the context of an scRNA-seq dataset of Drosophila wing development that suffered from nuisance variation due to cell cycle, sex and replicate. The second feature extends several scvi-tools integration methods to iteratively integrate new ‘query’ data into a pretrained ‘reference’ model via the recently proposed scArches neural network architecture surgery12. This feature is particularly useful for incorporating new samples into an analysis without having to reprocess the entire set of samples. Supplementary Note 3 presents a case study of applying this approach with totalVI by projecting data from patients with COVID-19 into an atlas of immune cells.
From the perspective of a methods developer, scvi-tools offers a set of building blocks that make it easy to implement new models and modify existing models with minimal code overhead (Fig. 2a,b and Supplementary Note 4). These building blocks use popular libraries, such as AnnData12, PyTorch20, PyTorch Lightning9 and Pyro21, and facilitate probabilistic model design with neural network components and GPU acceleration. This allows method developers to primarily focus on developing probabilistic models instead of on data management, model training and user-interface code. We demonstrate how these building blocks can be used for efficient model development through a reimplementation of Stereoscope, in which we demonstrate a substantial reduction in code complexity (Fig. 2c–e and Supplementary Note 5). This example demonstrates the broad scope of analyses that may be powered by scvi-tools.
On the scvi-tools documentation website, we feature the application programming interface (API) reference of each model, as well as tutorials describing the functionality of each model and its interaction with other single-cell tools. We also make these tutorials available via Google Colab, which provides a free computing environment and GPU and can even support large-scale analyses.
In the development of scvi-tools, we aimed to bridge the gap that exists between the single-cell software ecosystem and the contemporary machine-learning frameworks for constructing and deploying this class of models. Thus, developers can now expect to build models that are immediately accessible to end users in the single-cell community while continuing to rely on popular machine-learning libraries. On our documentation website, we provide a series of tutorials on building a model with scvi-tools, walking through the steps of data management, module construction and model development. We also built a template repository on GitHub that enables developers to quickly create a Python package that uses unit testing, automated documentation and popular code styling libraries. This repository demonstrates how the scvi-tools building blocks can be used for external model deployment. We anticipate that most models built with scvi-tools will be deployed in this way as independent packages while adhering to standard API and coding conventions, which will make them more readily accessible for new users.
As scvi-tools remains under active development, end users can expect that scvi-tools will continually evolve, adding support for new models, new workflows and new features. We anticipate that these resources will serve the single-cell community by facilitating the prototyping of new models, creating a standard for the deployment of probabilistic analysis software and enhancing the scientific discovery pipeline.
Svensson, V., da Veiga Beltrame, E. & Pachter, L. Database 2020, baaa073 (2020).
Lee, J., Hyeon, D. Y. & Hwang, D. Exp. Mol. Med. 52, 1428–1442 (2020).
Wagner, A., Regev, A. & Yosef, N. Nat. Biotechnol. 34, 1145–1160 (2016).
Zappia, L., Phipson, B. & Oshlack, A. PLOS Comput. Biol. 14 (2018).
Lopez, R., Gayoso, A. & Yosef, N. Mol. Syst. Biol. 16, e9198 (2020).
Gentleman, R. C. et al. Genome Biol. 5, R80 (2004).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Nat. Biotechnol. 33, 495–502 (2015).
Wolf, F. A., Angerer, P. & Theis, F. J. Genome Biol. 19, 15 (2018).
Falcon, W. & The PyTorch Lightning team. PyTorch Lightning (Version 1.4). (2019); https://doi.org/10.5281/zenodo.3828935
Chollet, F. et al. Keras. https://keras.io (2015).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Nat. Methods 15, 1053–1058 (2018).
Lotfollahi, M. et al. Nat. Biotechnol. 40, 121–130 (2022).
Zhang, A. W. et al. Nat. Methods 16, 1007–1015 (2019).
Xu, C. et al. Mol. Syst. Biol. 17, e9620 (2021).
Andersson, A. et al. Commun. Biol. 3, 565 (2020).
Lopez, R. et al. Preprint at bioRxiv https://doi.org/10.1101/2021.05.10.443517 (2021).
Bernstein, N. J. et al. Cell Syst. 11, 95–101.e5 (2020).
Gayoso, A. et al. Nat. Methods 18, 272–282 (2021).
Angerer, P., Wolf, A., Virshup, I. & Rybakov, S. AnnData. GitHub https://github.com/theislab/anndata (2019).
Paszke, A. et al. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
Bingham, E. et al. J. Mach. Learn. Res. 20, 1–6 (2019).
We acknowledge members of the Streets and Yosef laboratories for general feedback. We thank all the GitHub users who contributed code to scvi-tools over the years. We thank Nicholas Everetts for help with the analysis of the Drosophila data. We thank David Kelley and Nick Bernstein for help implementing Solo. We thank Marco Wagenstetter and Sergei Rybakov for help with the transition of the scGen package to use scvi-tools, as well as feedback on the scArches implementation. We thank Hector Roux de Bézieux for insightful discussions about the R ecosystem. We thank Kieran Campbell and Allen Zhang for clarifying aspects of the original CellAssign implementation. We thank the Pyro team, including Eli Bingham, Martin Jankowiak and Fritz Obermeyer, for help integrating Pyro in scvi-tools. Research reported in this manuscript was supported by the NIGMS of the National Institutes of Health under award number R35GM124916 and by the Chan-Zuckerberg Foundation Network under grant number 2019-02452. O.C. is supported by the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning (EP/S023151/1, studentship 2420649). A.G. is supported by NIH Training Grant 5T32HG000047-19. A.S. and N.Y. are Chan Zuckerberg Biohub investigators.
V.S. is a full-time employee of Serqet Therapeutics and has ownership interest in Serqet Therapeutics. F.J.T. reports consulting fees from Roche Diagnostics GmbH and Cellarity Inc., and ownership interest in Cellarity, Inc. N.Y. is an advisor to and/or has equity in Cellarity, Celsius Therapeutics and Rheos Medicines. The remaining authors declare no competing interests.
Peer review information
Nature Biotechnology thanks Martin Hemberg and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
About this article
Cite this article
Gayoso, A., Lopez, R., Xing, G. et al. A Python library for probabilistic analysis of single-cell omics data. Nat Biotechnol 40, 163–166 (2022). https://doi.org/10.1038/s41587-021-01206-w
This article is cited by
MIRA: joint regulatory modeling of multimodal expression and chromatin accessibility in single cells
Nature Methods (2022)
Nature Neuroscience (2022)
Nature Biotechnology (2022)