Abstract
The increasing generation of populationlevel singlecell atlases has the potential to link sample metadata with cellular data. Constructing such references requires integration of heterogeneous cohorts with varying metadata. Here we present singlecell population level integration (scPoli), an openworld learner that incorporates generative models to learn sample and cell representations for data integration, label transfer and reference mapping. We applied scPoli on populationlevel atlases of lung and peripheral blood mononuclear cells, the latter consisting of 7.8 million cells across 2,375 samples. We demonstrate that scPoli can explain samplelevel biological and technical variations using sample embeddings revealing genes associated with batch effects and biological effects. scPoli is further applicable to singlecell sequencing assay for transposaseaccessible chromatin and crossspecies datasets, offering insights into chromatin accessibility and comparative genomics. We envision scPoli becoming an important tool for populationlevel singlecell data integration facilitating atlas use but also interpretation by means of multiscale analyses.
Similar content being viewed by others
Main
The advancements in singlecell technologies have enabled the generation of datasets comprising information from millions of cells. These datasets, also called ‘atlases’, include data from different conditions and individuals and offer precious insight into cellular processes and states in different scenarios. Consortia such as the Human Cell Atlas^{1} and the Human BioMolecular Atlas Program^{2} aim to generate organ and bodylevel atlases that allow one to study human organs from development to aging in healthy and disease samples. A possibility opened by these atlases is that of metaanalyses relating cell types and states with biological conditions or demographics metadata^{3,4}.
Performing metaanalysis on an atlas requires learning a joint representation of all datasets correcting batch effects between them^{5,6,7}. Tremendous efforts have been made to solve the data integration problem for singlecell RNA sequencing datasets using approaches ranging from statistical^{8,9,10,11} and graphbased^{12,13,14} methods to deep learning models^{5,15,16,17}. Nonetheless, singlecell data integration remains challenging^{18}, especially in the case of many datasets with a variety of technical and biological properties.
Many analyses can be accelerated by mapping data on top of an atlas. Algorithms for efficient use of reference atlases are known as reference mapping methods^{19,20,21}, which build upon data integration algorithms to update an existing atlas by integrating a query dataset. Transferring information from reference to query enables efficient annotation of the query cells^{4,20,22,23}.
Existing deep learning integration methods^{6} rely on onehotencoded (OHE) vectors to represent conditions^{15,24}. This encoding does not allow a downstream interpretation of the effect of each sample on the mapping. Additionally, in the presence of many unique condition categories, the number of conditional inputs can become close or equal to the number of gene expression measurements leading the model to produce inaccurate data representation^{25}. Among current referencebuilding methods^{10,12,24,26} only scANVI and Seurat v3 offer cell type classification coupled with a reference mapping algorithm^{19,21}. Yet, while they can integrate annotated data to extend the reference, this requires retraining, which is timeconsuming and can sometimes be not possible due to data sharing restrictions.
In this Article, we introduce singlecell population level integration (scPoli), a semisupervised conditional generative model and openworld learner^{5,27} combined with advances in metalearning^{28} that is able to learn representations for both cells and samples (or other batch covariates). scPoli offers a celllevel and a samplelevel view of the dataset enabling multiscale analysis: the simultaneous exploration of sample and cell representations.
scPoli uses prototypebased cell label transfer and is augmented with an uncertainty estimation mechanism. We demonstrate that scPoli is competitive in data integration and cell annotation with other methods across six datasets. We further showcase the features of our model by integrating a lung atlas and performing reference mapping for two queries. We show potential use cases of condition embeddings such as sample classification and data integration workflow guidance. Finally, we build a reference of 7.8 million peripheral blood mononuclear cells (PBMCs) from 2,375 samples and explore the samplelevel representation obtained with scPoli.
Results
scPoli learns joint cell and sample representations
The variation of gene expression (x_{i}) in a dataset can be ascribed to batch effects and biological signals. Similarly to other conditional models^{15,29}, scPoli aims to regress out batch effects in a nonlinear fashion by means of a conditional variable (s_{i}) representing batch while retaining biological information. Moreover, scPoli posits that cell identities (k_{i}) can be represented with learnable cell type prototypes^{28} modeled using latent cell representations (z_{i}) (Fig. 1a). scPoli, therefore, introduces two modifications to the conditional variational autoencoder (CVAE) architecture widely used for data integration^{5,15,24} and perturbation modeling^{16,29} in singlecell genomics. These modifications are (1) the replacement of OHE vectors with continuous vectors of fixed dimensionality to represent the conditional term, and (2) the usage of cell type prototypes to enable label transfer.
CVAEbased methods encode each condition by means of OHE representations \({{\bf{c}}}_{1:N}^{{{\mathrm{OHE}}}}\in {{\mathbb{R}}}^{N}\), where N is the number of conditions. These are concatenated to the input, and an additional neuron for each condition is added to the first layer of the encoder neural network. In the case of thousands of conditions to be integrated, this approach leads to an increase in the number of total trainable parameters that can slow down training. scPoli uses learnable condition embeddings \({{\bf{s}}}_{1:N}^{{{\mathrm{emb}}}}\in {{\mathbb{R}}}^{E}\) of fixed dimensionality. These are concatenated to the input and learned at training time (Fig. 1a). As a result, scPoli is more scalable than a CVAE in scenarios where many conditions are to be integrated (see ‘Scalability analysis’ in Methods). Furthermore, these condition embeddings capture meaningful representations of each condition and can be analyzed, providing insight in largescale studies. scPoli can also be used to integrate multiple conditions simultaneously. This is achieved by modeling condition covariates using independent embeddings. scPoli can perform reference mapping by freezing the weights of the model trained on the reference and learning a new set of M embeddings to accommodate the query data conditions (Fig. 1b).
The second addition to CVAE models is the incorporation of prototypes used in metalearning^{30}. These allow efficient learning across tasks and datasets and have been used for cell type classification^{26}. scPoli models prototypes using the average latent representation of cells belonging to the same cell type and leverages them to transfer annotations and improve data integration by means of an additional term in the learning objective called prototype loss. This term encourages the model to reduce the distance between the latent representation of a cell and its prototype (Methods). We show this leads to better preservation of biological signals. Unlabeled cells are classified by comparing distances to the prototypes, and the label of the closest prototype is assigned as a predicted cell type label. We also exploit the distance of each cell to its closest prototype as a proxy for uncertainty. Finally, prototypes enable extending an initial reference atlas with novel cell types from a labeled query without retraining the reference model as opposed to existing methods^{24}.
We illustrate a standard scPoli workflow on a collection of nine pancreas studies (see ‘Benchmark datasets’ in Methods) in Fig. 1c–j. We build an integrated reference on seven datasets. We use two datasets (celseq and celseq2) as an unlabeled query and map them onto the reference data. We can observe that query cells are mapped onto the reference data (mean integration score of 0.86) and that most cells are classified correctly with an accuracy of 80%. A cluster of cells (beta cells) that was removed from the reference dataset to mimic an unknown cell type scenario is correctly identified. After a principal component analysis (PCA) we can observe that the condition embeddings learned by scPoli capture similarities between the integrated samples.
scPoli accurately integrates datasets and transfers annotations
To understand how well scPoli integrates singlecell datasets, and how accurately it transfers cell type annotations, we benchmarked our model against other methods for data integration and label transfer. We included in this comparison both deep learning models (scVI^{15}, scANVI^{24} and MARS^{26}) and other types of methods (Seurat v3^{12}, Symphony^{20} and a linear support vector machine (SVM)). Out of these models, only our method, scANVI and Seurat v3 tackle both data integration and label transfer, while some exclusively do integration (scVI and Symphony), or classification (MARS and SVM). All of these models, except for MARS and Symphony were part of the Luecken et al.^{6} data integration benchmark, where they came out as top performers.
We tested these methods on six datasets, spanning a variety of scenarios (see ‘Benchmark datasets’ in Methods) (Supplementary Fig. 1). For each dataset a set of studies to use as reference was picked, while the rest was used as query. To quantify the performance of data integration we used metrics to assess biological conservation and batch correction proposed in ref. ^{6} (Methods).
We found that scPoli outperformed the next bestperforming model (scANVI) by 5.06% in data integration (Fig. 2a). When we looked separately at batch correction and biological conservation metrics, we observed that scPoli preserved biologically meaningful signals better than other methods. To understand whether the improvements stemmed from the use of condition embeddings or from the inclusion of the prototype loss, we benchmarked two variants of our model. We included a scPoli model with standard OHE vectors to represent batch, and a scPoli model trained without prototype loss. We found the prototype loss to be the driver of the improvement in biological conservation (Fig. 2b).
To assess the quality of the classification achieved during reference mapping we used two metrics: the weighted averaged and the macroaveraged F1 score, with the latter being more sensitive to underrepresented cell types. We observed that scPoli outperformed all methods except for the linear SVM on the weighted F1 metric (Fig. 2c). Out of the models that are capable of both data integration and cell type classification, scPoli came out on top (Fig. 2d). When looking at the macroaveraged F1 score, scPoli showed comparable performance to Seurat v3 and a sizeable improvement over scANVI, indicating better performance on underrepresented cell types (Fig. 2e,f). The SVM was the bestperforming method according to this metric, a result corroborated by previous work in the field^{31}.
Furthermore, scPoli’s integration performance and label transfer accuracy were stable across runs and different dataset sizes (Methods and Supplementary Fig. 2).
scPoli enables interpretable atlaslevel integration
We showcase the data integration capability and quality of label transfer yielded by scPoli on the Human Lung Cell Atlas (HLCA)^{4}, a curated collection of 46 datasets of the human lung, with samples from 444 individuals. The atlas is divided into a core collection of data, which comprises data from 166 samples and 11 datasets, and an extended one that includes the remaining data. Following the work in the original study, we used the HLCA core data for reference building. We integrated data at the sample level to obtain a better resolution of the condition embeddings and allow interpretation using sample metadata. For prototype training we used the finest annotations, resulting in 58 prototypes. We observed that scPoli successfully integrated data from different studies (Fig. 3a) while maintaining structure among the known cell identities (Fig. 3b). We further assessed the quality of data integration and compared it against scANVI. To keep the comparison consistent, we also trained scANVI at the sample level. scPoli yielded an integration that preserved biologically meaningful variation better than scANVI, while achieving similar performance in batch correction (Fig. 3c).
When looking at the first two principal components (PCs) of the sample embeddings we found that samples from the same studies grouped together (Fig. 3d and Supplementary Fig. 3a). We additionally found metadata that covaried with the sample representation. These included information regarding properties of the sample such as subject type (for example, donor or alive) (Fig. 3d) and anatomical location (Supplementary Fig. 3b). Other covariates (for example, sex or ethnicity) appeared to be mixed in this space (Supplementary Fig. 3c,d), indicating that the main drivers behind batch effects are likely to be related to the nature of the tissue, its processing and other technical factors.
scPoli propagates highresolution annotations
One of the main hurdles in atlas building comes from discrepancies in annotation terms across datasets. scPoli and other methods that leverage these annotations require prior label harmonization labels before usage, which requires expert knowledge. Nonetheless, scPoli can work on multiple levels of annotation (for example, from coarse to fine) and can propagate labels to underclustered datasets during reference building. scPoli is able to model multiple sets of prototypes for each level of annotations. This prior information is then leveraged by optimizing the prototype loss on each set of labeled prototypes. To simulate such a scenario, we integrated the HLCA, but this time using both a coarse and a fine annotation (Supplementary Fig. 4a–c). Additionally, for one dataset in the reference (Krasnow2020), we kept only the coarse annotation. We then used scPoli to propagate highresolution labels to these cells obtaining an overall accuracy of 84.4% (Supplementary Fig. 4d).
scPoli enables querytoreference mapping and label transfer
After building a reference using the HLCA core dataset, we mapped a group of healthy samples (Meyer, 2021)^{32} (Fig. 3f). These data consist of six samples and contain nine cell identities not present in the reference. As a proxy for uncertainty in cell type prediction, we use the Euclidean distance from the closest prototype in the reference. Similarly to the original HLCA study, in which a knearest neighbor (kNN) graphbased uncertainty was used, we noticed that cells that lay in regions of transitions between cell types displayed the highest levels of uncertainty, as well as cells whose identities were not present in the reference data (Fig. 3g). We considered all cells with an uncertainty higher than the 90% quantile as unknown (Supplementary Fig. 5a) and inspected the classification performance by cell type (Supplementary Fig. 5b). A subset of novel cells were successfully detected as unknown, especially chondrocytes, erythrocites and myelinating and nonmyelinating Schwann cells. Natural killer T cells, GammaDelta T cells and regulatory T cells were not detected as unknown and were mostly classified as either CD4 T cells or CD8 T cells, which could also be a result of overclustering in the original atlas^{33}. Overall, scPoli achieved an accuracy of 75%, outperforming the model used in the original study, which yielded an accuracy of 69%. Label transfer in scPoli happens without the need for the reference dataset. scPoli transfers labels by comparing distances to a small set of prototypes that are obtained during the reference building step and stored within the reference model. This constitutes a big advantage in cases where the reference data cannot be shared. Furthermore, we observed that scPoli is more robust at detecting unknown cells than the methodology involving a kNN graph and scANVI. We compared the ratio of true predictions across different thresholds for unknown cell type detection for three models and scPoli consistently obtained better accuracy (Supplementary Fig. 5c).
To see how scPoli would perform when mapping a query dataset from a different condition than the one in reference, we mapped data collected from cancer patients. These data contain two cell identities not present in the reference (cancer and erythrocytes). We observed that scPoli successfully mapped the query dataset (Supplementary Fig. 6a). Since this query has a much coarser cell type annotation, we mapped the labels obtained with scPoli to the cell types present in the query via a mapping obtained from the authors of the study. We observed that almost all cancer cells mapped to a cluster whose label prediction had high uncertainty and was classified as unknown (Supplementary Fig. 6b,c). We observed that 85% of cancer cells and 98% of erythrocytes were identified as unknown (Supplementary Fig. 6d). Also in this case, scPoli obtained better accuracy across different thresholds for unknown cell type detection compared to a kNN classifier and scANVI (Supplementary Fig. 6e).
scPoli enables multiscale classification of cells and samples
We tested unlabeled sample classification as a potential use case for condition embeddings. We integrated a coronavirus disease 2019 (COVID19) PBMC dataset by Su et al.^{34} with large biological signals using the sample covariate as condition. The data contains 559,517 cells from 270 samples across various states of COVID19. We selected 30 random samples and set their cell type annotations and sample annotations as unknown. We then integrated the data and propagated the labels. We first assessed the quality of integration and label transfer, which achieved an accuracy of 90% (Fig. 4a,b).
The samplelevel metadata are organized into four classes: healthy, mild, moderate and severe. The sample embeddings of the reference data showed variation associated with this phenotypic covariate in the two first PCs (adjR^{2}_{PC1} 0.41, adjR^{2}_{PC2} 0.16 obtained with an ordinal regression; the adjusted R^{2} is reported to take into account the different numbers of available predictors) (Fig. 4c). We, therefore, proceeded to classify the disease state of the query samples using a kNN classifier trained on the reference sample embeddings. This classification yielded an accuracy of 73%. We compared the performance and stability of this sample classification against that obtained with a classifier trained on sample pseudobulks, and one trained on vectors of average latent expression per sample after integration with scVI. We did so by splitting the training data into labeled and unlabeled in a fivefold crossvalidation setting. When we compared the accuracy and F1 score obtained by a classifier trained on scPoli sample embeddings we observed that these were better than those obtained using the other classifiers, with weighted F1 scores of 65.3 ± 2.5%, 59.3 ± 1.3% and 58.8 ± 2.1%, for scPoli sample embeddings, mean gene expression and mean scVI latent expression, respectively (Fig. 4e).
scPoli supports experimental design in integration workflows
To understand the relationship between technical and phenotypic factors in scPoli’s sample embedding we integrated another COVID19 dataset consisting of 222,003 cells (SchulteSchrepping et al.)^{35}, with 99 samples from 65 patients in two different cohorts (Fig. 5a,b) obtained from a multitude of experiments with different technical properties. When we examined the sample embedding, we observed the major sources of variation to be related to technical factors, suggesting that these play a bigger role in batch effects. Indeed, the first two PCs of the sample embedding were explained by the experiment (adjR^{2}_{PC1} 0.94, adjR^{2}_{PC2} 0.97) (Fig. 5c) and the cohort from which the samples were obtained (adjR^{2}_{PC1} 0.73) (Fig. 5d), rather than the disease state (adjR^{2}_{PC1} 0.00) (Fig. 5e). We could find association with disease state, but only in further PCs (adjR^{2}_{PC2} 0.41, Fig. 5e,f).
These analyses suggest that, while in more focused studies where technical factors are controlled and kept consistent across samples, biological signals represent the main source of samplelevel heterogeneity; in biggerscale collections of data with a variety of technical factors, these variations will dominate. This led us to speculate that, since sample embeddings identify major sources of variations in the data, they could also guide the choice of the conditions to integrate in a data integration workflow. We proceeded to integrate the data using other two models conditioned on the covariates that showed association with the first PC of the embedding space (experiment and cohort).
Additionally, we trained a model with a dummy batch covariate, which was equal for all cells. In this case, scPoli will leverage exclusively cell type annotations and the prototype loss to perform integration. We observed that the integration yielded by the cohort and experimentlevel models displayed a similar quality of integration despite the reduction in the number of conditions to integrate (2 for the cohort covariate, 16 for experiment, compared to 99 for sample) (Supplementary Fig. 7). All models conditioned on an actual batch variable outperformed the one trained on a dummy covariate. This demonstrates an important use case of this samplelevel representation. Revealing the main sources of undesired batch effects can in fact lead to faster data integration workflows and potentially improve the quality of the integrated cell representations by selecting the most appropriate batch covariate.
scPoli can model multiple batch covariates
scPoli can model multiple batch covariate using independent embeddings which are then concatenated to the gene expression input. Doing so will yield a cell embedding and an embedding space for each batch condition. We applied this workflow on the SchulteSchrepping dataset, where we integrated the data using both ‘experiment’ and ‘sample’ covariates (Supplementary Fig. 8a,b). This new model yielded experimentlevel embeddings that varied according to cohort and disease information in the two first PCs (Supplementary Fig. 8c–e). In addition to this, scPoli produced sample embeddings where the experimentlevel variation was mitigated in comparison to the one observed in the model conditioned only on samples, suggesting that scPoli disentangled experimentlevel variation from samplelevel variation (Supplementary Fig. 8f–h).
scPoli can integrate data and transfer labels across species
Integrating data collected from different species can be a challenging task for data integration models. To understand how scPoli would perform in such a scenario, we built a reference from cells collected from the primal frontal cortex of marmoset and mouse using ortholog genes^{36}. scPoli was able to integrate the data from the two species and map cells of the same cell type together (Supplementary Fig. 9a,b). We then performed reference mapping and label transfer using cells from human data as a query. scPoli was able to perform querytoreference mapping across different species and yielded an overall label transfer accuracy of 86% (Supplementary Fig. 9c–e).
scPoli can be extended to further data modalities
In this work, we focus on the applications of scPoli on scRNAseq data; nonetheless, with the appropriate adaptations, scPoli can be applied to other modalities. To demonstrate this, we used scPoli to integrate a set of singlecell sequencing assay for transposaseaccessible chromatin (scATACseq) samples by modeling the likelihood of the input data using a Poisson distribution, as proposed in ref. ^{37}. We tested this on the NeurIPS 2021 multimodal data integration dataset^{38}, from which we used the scATACseq modality. scPoli integrated data from different samples (Supplementary Fig. 10a) and yielded condition embeddings that captured similarities between samples generated at the same site (Supplementary Fig. 10b). We assessed the quality of this integration by comparing it against that of PeakVI^{39}, a CVAEbased method for scATACseq data integration, and found that scPoli achieved comparable performance (Supplementary Fig. 10c).
scPoli scales to datasets with thousands of samples
We further leveraged scPoli to build a PBMC atlas comprising roughly 7.8 million cells from 2,375 samples, 1,977 subjects and 25 datasets. We obtained the integrated cell representation (Fig. 6a,b and Supplementary Fig. 11a) and the sample embeddings, which we analyzed to examine the dominant sources of variations across samples. We observed that, while most of the variance was explained by the first PC of this space, substantial signal was still present in the further PCs, suggesting that scPoli makes full use of this space to encode information useful for batch correction (Supplementary Fig. 11b). We found the first PC to be mainly associated with the dataset of origin of the sample (Fig. 6c and Supplementary Fig. 11c,d). We used a linear model to quantify this association, which yielded an adjusted R^{2} of 0.97. We also observed a strong association with sequencing assay (adjR^{2} 0.86) (Fig. 6d and Supplementary Fig. 11g,h) and moderate association with disease phenotype (adjR^{2} 0.59) (Fig. 6e and Supplementary Fig. 11e,f). When we looked at how information such as sex and ethnicity mapped onto the sample embedding obtained with scPoli, we observed no clear patterns in the embedding space (Supplementary Fig. 11i–n). We compared the structure in scPoli sample embeddings with that obtained on vectors of average gene expression by sample, and observed that, while some of the patterns showed similarities, scPoli was more sensitive to differences between datasets and preserved more structure overall (Supplementary Fig. 12).
To understand whether the sample embeddings reflected any gene expression patterns across samples, we computed the Pearson correlation between the mean expression of each gene in the various samples and the PC scores of the embeddings. We thus obtained lists of significantly correlated genes with each PC (P < 0.01) and filtered them for coefficients of determination larger than 0.3 in absolute value. We found the biggest number of correlated genes with the first PC. This number did not decrease regularly going further through the PCs, and we found PC2, PC4 and PC5 to also have a substantial number of correlated genes (Fig. 6f). When we looked at which genes were most strongly correlated with PC1, we observed a strong presence of ribosomal genes: 14 out of the 15 top negatively correlated genes. This was reflected also in a general association of PC1 with the mean ribosomal gene fraction of each sample (Supplementary Fig. 13a). On the other hand, we did not observe a clear association with mitochondrial gene fraction (Supplementary Fig. 13b). We performed a biological process and pathway enrichment analyses of the genes correlated with PC1 and found terms related to the immune and stress response, cytokine signaling, neutrophile degranulation and cell activation (Fig. 6g). The top negatively correlated gene with PC1 was RPL31 (r = −0.93), which is a ribosomal gene involved in the cellular response to stress; the second top positively correlated gene, on the other hand, was RHOA (r = 0.85), a gene involved in the immune response and that we observed to be more highly expressed in disease samples (Supplementary Fig. 13c). These findings reflect the associations found with both technical and phenotypic covariates and PC1. When we looked at the patterns of expression of these genes in the sample and cell embedding spaces, we observed that scPoli successfully mixed these signals in the integrated cell representation but also offered the unique feature of exploring them in the sample embedding space where they were preserved (Fig. 6h,i). A similar enrichment analysis with PC2 and PC4correlated genes revealed terms related to RNA and DNA metabolism in the first case (Supplementary Fig. 13d), and response to stress and cytokine production in the second (Supplementary Fig. 13f). We did not perform this analysis with PC3, due to the low number of correlated genes. We show the expression patterns in both cell and sample embedding spaces of genes associated with PC2 (SSR2) and PC4 (TSLPY2) in Supplementary Fig. 13f,g.
We believe that the multiscale representation obtained by scPoli could represent a useful tool for researchers to understand which genes drive batch effects the most or are affected by technical factors in the data generation process.
Discussion
We have presented scPoli, a generative model for data integration, label transfer and reference mapping. scPoli learns representations of the input data at different scales by learning cell and sample embeddings. This enables multiscale analyses whereby the user can explore sample information in a dedicated latent space, while still having access to an integrated singlecell object. By freezing the weights of the model and learning new embeddings, scPoli is able to quickly map newly generated data onto a previously built reference.
We have shown that scPoli is competitive with stateoftheart methods for data integration and label transfer. Thanks to the use of cell type prototypes, scPoli consistently preserved biological information better than other methods. Moreover, scPoli performs label transfer in a privacyaware fashion without the need for the reference data. We illustrated the features of our model by integrating the HLCA. scPoli outperformed the model used in the original study in data integration and yielded a sample latent representation that reflected similarities between different samples. We also showcased our model’s reference mapping capabilities by mapping an unlabeled query dataset of healthy samples and one of cancer samples. Furthermore, we demonstrated the viability of scPoli as a scATACseq data integration method and its ability to perform integration and reference mapping across species.
To understand better the information captured by the sample embeddings and potential use cases we investigated three further datasets. Our findings suggest that in smallerscale studies scPoli reveals phenotypical sources of variation and can enable multiscale classification of both cell types and samples. Nonetheless, as the complexity and number of samples increase, the sample embeddings obtained with our model are more likely to reflect variations of technical nature. In these cases, scPoli’s sample embeddings can help identify the main sources of technical variation driving the batch effects. This can be used to guide data integration workflows by identifying the most appropriate covariates to use as batch condition and to discover gene expression patterns across samples associated with batch effects and technical and phenotypic factors. Furthermore, scPoli can integrate and model multiple batch covariates, which will yield multiple batch embedding spaces and can improve interpretability by disentangling datasetlevel variation from samplelevel variation.
scPoli, like other methods for data integration that leverage CVAE, provides the user with a lowerdimensional singlecell integrated object and not a corrected count matrix. Moreover, the quality of the integration will be a function of the number of samples that can be used in the referencebuilding step. As the model integrates samples with different technical or phenotypical characteristics, it does a better job at regressing out the batch effects.
A limitation of models that make use of cell type information is the need for highquality and harmonized annotation across datasets. scPoli is also susceptible to this and requires cell type label harmonization before reference building.
We found that the use of prototypes improves the preservation of biological information. We use distances from these prototypes and a latent cell embedding to transfer labels and yield an uncertainty associated with it. While latent representations obtained with VAEs are learned on smoother manifolds than vanilla autoencoders, this linear approximation remains suboptimal. Nonetheless, this approach has been used in foundational work in generative modeling research^{40,41,42}. This limitation becomes relevant for our uncertainty, whose distribution can vary substantially in different scenarios. Therefore, we recommend users visualize these distributions and choose the best threshold for detecting unknown cells manually.
We recommend care when interpreting the samplelevel representations obtained with scPoli. The main sources of variation between samples will change across datasets. Different covariates are likely to explain these variations in different datasets. This will determine which are the most sensible use cases for sample embeddings.
We believe scPoli will be useful as a tool for data integration and reference mapping given its improvements in the conservation of biological signals. Furthermore, we expect scPoli’s samplelevel embeddings to provide researchers with another point of view over largescale datasets, and pave the way to multiscale analyses that investigate and link patterns at different scales. Singlecell atlassing is entering the stage of populationlevel studies, which implies the need for models across this level of variation.
Methods
scPoli
scPoli is a semisupervised generative deep learning method comprising two components, an unsupervised backbone based on CVAEs^{27} and a cell type supervised component leveraging prototype networks^{28}. In the following we first outline the data generation process describing different inputs for the model. Following that, we will discuss details about the main components of scPoli and the training procedure.
Notation
We denote a collection of singlecell data from different samples \(\{{X}_{1},\ldots ,{X}_{n}\}\) with sample labels \(\{{c}_{1},\ldots ,{c}_{n}\}\). Within each sample we have \(X=\{{{\bf{x}}}_{1},\ldots ,{{\bf{x}}}_{j}\}\) cells with cell type annotations \(\{{a}_{1},\ldots ,{a}_{j}\}\) and sample annotation c. A batch of data passed to the model during training will contain the singlecell gene expression of a random sample of cells from the training data, plus the sample and cell type labels associated with each cell \(\{{{\bf{x}}}_{i},{c}_{i},{a}_{i}\}\). When we pass multiple condition covariates and cell type annotations to the model this becomes \(\{{{\bf{x}}}_{i},{{\bf{c}}}_{i},{{\bf{a}}}_{i}\}\).
CVAEs
Variational autoencoders (VAEs)^{40} are generative models that employ variational inference and deep neural networks to learn the underlying distribution of the data they are trained on. These models consist of an encoder network that parameterizes the latent variational distribution of the data and a decoder network that samples from this distribution and maps the data back to the input space. CVAEs^{15} are an extension of such models in which the input data is conditioned on another random variable. CVAEs aim to maximize the likelihood of the data, formulated following the Bayes chain rule as
where x is the input gene expression data, z is the latent variable that is assumed to parameterize the latent distribution of the data, c is the condition variable, and θ is the parameters of the model. Since the likelihood as formulated above is in most cases intractable, VAEs use amortized inference by approximating the posterior distribution by means of a neural network \({p}_{\theta }\left({\bf{z}}{\rm{}}{\bf{x}},{\bf{c}}\right)\sim {q}_{\phi }\left({\bf{z}}{\rm{}}{\bf{x}},{\bf{c}}\right)\), where ϕ are the parameters of said network. The loss used to optimize these models aims to jointly reduce the reconstruction error between input and output and make \({q}_{\phi }\left({\bf{z}}{\rm{}}{\bf{x}},{\bf{c}}\right)\) as close as possible to \({p}_{\theta }\left({\bf{z}}{\rm{}}{\bf{x}},{\bf{c}}\right)\).
The resulting loss function, also known as evidence lower bound, is formulated as
where θ and ϕ are respectively the encoder and decoder networks parameters, \({\mathbb{E}}\) is the expectation and D_{KL} indicates the Kullback–Leibler divergence.
The first term of the loss formulated above is also known as reconstruction loss, and it might take different forms depending on the generative process of the input data. In the case of count data, we assume a negative binomial distribution as input; thus the likelihood will follow:
In the case of scATACseq data we assume a Poisson distribution:
The negative loglikelihood of the appropriate distribution is used as reconstruction loss during training.
Condition embeddings
The architectural backbone of scPoli builds upon CVAEs, but with an important modification. While in a standard CVAE different conditions are represented by means of fixed OHE vectors \({{\bf{c}}}_{i}\in {{\mathbb{Z}}}^{N}\) that are concatenated to the input \({{\bf{x}}}_{i}\in {{\mathbb{R}}}^{G}\), scPoli uses learnable embeddings \({{\bf{s}}}_{i}\in {{\mathbb{R}}}^{E}\) of fixed dimensionality E to represent conditions. The learning objective for this network is akin to that of a standard CVAE, but the embeddings s are optimized during training as parameters of the model using backpropagation:
Condition embeddings are implemented using the torch.nn.Embedding class, which takes as input an index indicating the condition and outputs the learned embedding. These embeddings are randomly initialized and optimized together with the rest of the trainable parameters of the network by minimizing the loss function used to train the model.
Prototypes for label transfer
scPoli posits that cell identities can be represented using prototypes. For each labeled cell type in the data, a prototype \({{\bf{p}}}_{k}\in {{\mathbb{R}}}^{D}\) is initialized, where D is the dimensionality of the latent space of the CVAE model and k represents the cell type label. Prototypes are computed by averaging the latent representation of the data points belonging to each particular cell type:
where z_{i} is the latent representation of cell i, k is the cell type label and \({\mathscr{K}}\left(k\right)\) is the set of indices of cells belonging to cell type k:
Additionally, when the model is trained on partially labeled data, a set of unlabeled prototypes are initialized after clustering the unlabeled data using the Louvain algorithm implementation in Scanpy. Assuming that each unlabeled cell is now assigned a cluster j, unlabeled prototypes are computed using:
where z_{i} is the latent representation of cell i, j is the unlabeled cluster label and \({\mathscr{J}}\left(j\right)\) is the set of indices of cells belonging to cluster j:
Unlabeled prototypes offer good reference points for downstream analyses and novel cell type annotation but are not used for the prototype loss computation.
Prototype loss
scPoli’s training objective includes a supervised term we call prototype loss. This term has the objective of pulling together cells belonging to the same cell type towards their correspondent prototype in latent space.
where K is the number of cell types in the data, a_{i} is the cell type annotation of cell i and \(d\left(\cdot ,\cdot \right)\) is a distance function. The distance is computed only between the latent representation of a cell and the prototype of the cell type to which said cell belongs. In this work we formulate this distance as the Minkowski distance between a cell and its prototype. This distance is equivalent to the Euclidean distance when p = 2 and to the Manhattan distance when p = 1. Nonetheless, any distance metric could potentially be used.
scPoli can be trained on multiple sets of prototypes (for example, from multiple cell type annotations) in parallel, in this case (10) is expanded as:
where L is the number of cell type annotations provided to the model and \({a}_{i}^{\left(l\right)}\) is the cell annotation index for cell i and annotation l.
scPoli training
We describe the procedure used to train an scPoli model when building a reference of performing querytoreference mapping below:

Reference building

Pretraining: initialize N condition embeddings, where N is the number of samples present in the reference. The model receives gene expression \({{\bf{x}}}_{i}^{\left({{\mathrm{ref}}}\right)}\in {{\mathbb{R}}}^{G}\) (where G is the number of input features), which is concatenated to the sample embedding of the corresponding sample \({{\bf{s}}}_{i}\in {{\mathbb{R}}}^{E}\) as input (where E is the embedding dimensionality). When multiple condition labels are passed to the model, the embeddings for each condition \(\{{{\bf{s}}}_{i}^{\left(i\right)},\ldots ,{{\bf{s}}}_{i}^{\left(k\right)}\}\) are concatenated to the input, where k is the number of condition covariates passed. The objective function \({{\mathscr{L}}}_{{{\mathrm{CVAE}}}}\) (5) is optimized on the reference dataset.

Finetuning: initialize cell type prototypes using (6) and optimize \({{\mathscr{L}}}_{{{\mathrm{CVAE}}}}+\eta {{\mathscr{L}}}_{{{\mathrm{prototype}}}}\), where η is a hyperparameter that tunes the strength of the prototype loss; store learned prototypes with the model.


Reference mapping

Pretraining: freeze the weights of the encoder and decoder networks from the reference model, and initialize M additional sample embeddings where M is the number of samples in the query. These embeddings constitute the only trainable parameters when training on query data. The model receives gene expression \({{\bf{x}}}_{i}^{({\mathrm{query}})}\in {{\mathbb{R}}}^{G}\) concatenated to sample embeddings \({{\bf{s}}}_{i}\in {{\mathbb{R}}}^{E}\) as input. The model needs to be provided with the same number of condition covariates as the reference model. The objective function \({ {\mathcal L} }_{{\mathrm{CVAE}}}\) (5) is optimized on the query dataset.

Finetuning: initialize labeled prototypes in the query dataset if any labeled cells are present, the unlabeled data are clustered using the Louvain algorithm and unlabeled prototypes are initialized for the detected clusters using (8). Optimize \({ {\mathcal L} }_{{\mathrm{CVAE}}}+\eta { {\mathcal L} }_{{\mathrm{prototype}}}\); if all cells in the query are unlabeled, this learning objective is reduced to \({ {\mathcal L} }_{{\mathrm{CVAE}}}\). Unlabeled prototypes are not used for prototype loss computation and are only used for downstream analyses.

scPoli is optimized using Adam and a default learning rate of 0.001. The ratio of pretraining/finetuning epochs we use is 0.9, but the optimal value might vary depending on the input dataset.
Cell type label transfer and uncertainty quantification
scPoli assigns to unlabeled cell i with latent representation z_{i} the cell type identity k of the closest prototype p in latent space:
The minimum distance between the latent representation and any reference prototype is used as a proxy for uncertainty for unknown cell type detection.
where \({\mathfrak{K}}\) is the set of cell type identities present in the data.
We do not fix a default value for this uncertainty above which a cell should be classified as unknown. In our experiments, we observed the distributions of uncertainties and picked a quantile as the cutoff value. When we expected more unknown cells in the unlabeled data, for example in the cancer query, we picked lower quantiles. Another possible approach could be the one proposed by ref. ^{4}. The authors held out a few labeled datasets as query and then after mapping them they looked for the most optimal uncertainty threshold by generating a receiver operating characteristic (ROC) curve tracking correct label transfer.
This uncertainty does not have an upper bound, but we offer the option to scale and normalize it to have values between 0 and 1.
Scalability analysis
The introduction of learnable condition embeddings that replace OHE vectors to represent conditions leads to a difference in the number of trainable parameters between the scPoli CVAE and a standard CVAE network. Let G be the dimensionality of the gene expression input, E the dimensionality of condition embeddings in scPoli, N the number of conditions, D the latent space dimensionality, and H_{enc} and H_{dec} the widths of the input layers of the encoder and decoder, respectively. When comparing these two models with the same number and width of hidden layers and latent dimensionality and ignoring bias terms of the fully connected layers, scPoli introduces E × N parameters in the embedding matrix, \(\left(G+E\right)\times {H}_{{{\mathrm{enc}}}}\) parameters at the input layer of the encoder and \(\left(D+E\right)\times {H}_{{{\mathrm{dec}}}}\) at the decoder. A standard CVAE has \(\left(G+N\right)\times {H}_{{{\mathrm{enc}}}}\) and \(\left(D+N\right)\times {H}_{{{\mathrm{dec}}}}\) at the encoder and decoder input, respectively. From this, it can be derived that scPoli will have fewer trainable parameters than a standard CVAE when:
This inequation results in
Considering the common choice of 100 or lower for H, and the common choice of values below 25 for E, we can see that scPoli results in a lower number of parameters in the case of a relatively high number of conditions N (Supplementary Fig. 14). While it will be comparable with a standard CVAE in the case of few conditions to integrate.
Hyperparameters and training
We performed a hyperparameter search on the pancreas dataset of the benchmark datasets. We included parameters such as the depth of encoder and decoder, the weight η for the prototype loss, the embedding dimensionality, the latent dimensionality and the KL annealing parameter. We fixed the width of the hidden layer to be the square root of the number of features in the input data, as is done in^{15}. We tried to fix as many hyperparameters as possible to keep the computational overhead within a reasonable limit. We selected the set of hyperparameters that yielded the best integration performance and then used these to obtain results for the benchmarks displayed in Fig. 2. A table with the grid of values considered during our hyperparameter search is available at Supplementary Table 1.
Hyperparameters of trained models
A table with the hyperparameters used for training the models presented in this work is available at Supplementary Table 2.
Computational cost analysis
We ran scPoli for 100 epochs (80 pretraining) on the same dataset (PBMC benchmarking data, 4,000 highly variable genes (HVG)) under or oversampled to reach sizes of 1K, 10K, 100K and 1M cells. The model was trained with either a sparse or a dense input matrix to test the difference in memory consumption and training time between the two. We tracked the time needed to train the model for 100 epochs and max memory consumption during training net of the memory needed to store the data. The experiment was run on a server with two Intel Xeon Platinum 8280L 2.70 GHz central processing units and an NVIDIA v100 graphics processing unit. The results of this analysis are available in Supplementary Fig. 15.
scATACseq integration
We adapted scPoli to work on scATACseq data by using a Poisson likelihood for modeling fragment counts. Fragment counts are obtained by aggregating odd and neighboring even reads. This approach was proposed by ref. ^{37}. We ran PeakVI using comparable parameters to obtain a comparison.
Association with PCs
We compute the association between a covariate and a PC by fitting a linear model with the covariate as the predictor and the values of the PC as dependent variable. We fit this model by using the lm() function in R. After fitting the model, we report the adjusted R^{2}, which is defined as:
where n is the number of samples and p is the number of predictors.
Benchmarks
Integration methods
We benchmarked the data integration performance of our model against various stateoftheart methods. These include:

scArches scVI (v0.5.3): we ran the model using default parameters;

scArches scANVI (v0.5.3): we ran the model using default parameters;

Seurat v3 (v4.0.3): we followed the tutorial^{43} and used supervised PCA for reducing the data dimensionality to 50;

Symphony (v0.1.0): we followed the vignette^{44} and ran the model using default parameters.
Cell type classification methods
We benchmarked the performance on label transfer and cell type classification against the following methods:

scArches scANVI (v0.5.3): we ran the model using default parameters;

MARS: we ran the model using default parameters;

Seurat v3 (v4.0.3): we followed the vignette^{43} and ran the model using default parameters;

SVM: we fit a LinearSVC object from scikitlearn (v0.24.2) on the reference data.
Metrics
We quantified the quality of the data integration using the following metrics from the scIB (v 1.0.0) package and Luecken et al. Cell type ASW (average silhouette width), isolated label F1, isolated label silhouette, NMI (normalized mutual information) and ARI (adjusted Rand index) were used as biological conservation metrics. To quantify batch mixing we used PC regression, graph connectivity and batch ASW. The overall integration score is a weighted average of the average batch mixing score and the average biological conservation score, with weight 0.4 and 0.6 respectively. Descriptions for these metrics can be found in ref. ^{6}. To quantify label transfer accuracy we used the weighted averaged and macroaveraged F1 score.
Weighted F1 score: average of the F1 scores obtained for each class, weighted by their support.
Macro F1 score: average of the F1 scores obtained for each class, without any weighting.
An overview of the used metrics can be found in Supplementary Table 3.
Datasets
The HLCA
We obtained the HLCA core dataset from the authors of the study^{4}. The dataset consists of 584,884 lung cells from 166 samples and 107 subjects. Gene expression is subset to 2,000 selected genes, which we used for model training. Different levels of cell type annotation and sample and patient metadata are curated and available.
PBMC atlas
The atlas contains 7,800,850 PBMC cells from 2,375 samples, representing cells from 25 datasets, 1,977 healthy or diseased donors. For model training and analysis, 10,000 HVGs were selected. A coarse annotation consisting of 14 cell types was used for initializing scPoli’s prototypes. The data and metadata were curated by scientists at the ChanZuckerberg Initiative and collected from refs. ^{19,35,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61}.
SchulteSchrepping et al. dataset
This is a published^{35} PBMC dataset of 65 patients with COVID19 and healthy controls. The dataset contains 99 samples and 222,003 cells, and was downloaded as part of the Fredhutch COVID19 collection available at ref. ^{62}. For model training and analysis 4,000 HVGs were used.
Su et al. dataset
This is a published^{34} PBMC dataset of 129 COVID19 patients and 16 healthy controls. The dataset contains 270 samples and 559,517 cells, and was downloaded as part of the Fredhutch COVID19 collection available at ref. ^{62}. For model training and analysis 2,000 HVGs were used.
Crossspecies dataset
These data consist of the transcriptomics profiling of the primary motor cortex of 305,638 single nuclei in humans, marmoset monkeys and mice^{36}. The dataset was downloaded from ref. ^{63}, which contains expression profiles based on the onetoone orthologs (15,860 genes in total) defined in the three species. A total of 2,000 HVGs were selected on the basis of the reference datasets.
scATACseq dataset
We obtained the scATACseq data from ref. ^{64}. The data were published for the NeurIPS 2021 competition on multimodal singlecell integration. We isolated only the ATAC features and trained the model on those. We filtered peaks that were detected on less than 5% of cells. After these preprocessing steps the data consisted of 69,249 cells and 16,134 features.
Benchmark datasets
All datasets used for benchmarking were obtained from ref. ^{65} unless specified otherwise. Count data were used for all datasets.
Immune
The immune PBMC dataset used for benchmarking was obtained from refs. ^{6,66}. The dataset contains 32,484 cells from 4 studies and 16 cell types. The data was subset to the 4,000 most highly variable genes before further analysis.
Pancreas
The data contain 16,382 pancreas cells from 8 different batches. The cells are annotated and assigned to 14 cell types. A total of 4,000 HVGs were used for model training and analysis.
Brain
The mouse brain dataset consists of 332,129 cells and 4 batches. Ten cell types are present. The data were subset to the 4,000 most highly variable genes before further analysis.
Endocrine
The data consist of 25,919 cells in 4 batches and 14 cell types. A total of 4,000 highly variable genes were selected for downstream analyses.
Tumor
The tumor atlas was obtained from refs. ^{67,68}. The dataset is a collection of 14 studies on various types of cancer. It contains 317,111 cells annotated in 25 cell types. We selected 4,000 HVGs for model training.
Lung
The lung data were obtained from refs. ^{6,66}. These data consist of 32,472 lung cells from 3 batches and 17 cell types. A total of 4,000 HVGs were used for model training.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All datasets analyzed in this manuscript are public and have been published in other papers. We have referenced them in the manuscript and, when necessary, made them available at http://github.com/theislab/scPoli_reproduce.
Code availability
The software is available as part of https://scarches.readthedocs.io. The code used to generate the results is available at http://github.com/theislab/scPoli_reproduce.
References
Regev, A. et al. Science forum: The Human Cell Atlas. eLife 6, e27041 (2017).
HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).
Muus, C. et al. Singlecell metaanalysis of SARSCoV2 entry genes across tissues and demographics. Nat. Med. 27, 546–559 (2021).
Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29, 1563–1577 (2023).
Gayoso, A. et al. A Python library for probabilistic analysis of singlecell omics data. Nat. Biotechnol. 40, 163–166 (2022).
Luecken, M. D. et al. Benchmarking atlaslevel data integration in singlecell genomics. Nat. Methods 19, 41–50 (2022).
Argelaguet, R., Cuomo, A. S., Stegle, O. & Marioni, J. C. Computational principles and challenges in singlecell data integration. Nat. Biotechnol. 39, 1202–1215 (2021).
Welch, J. D. et al. Singlecell multiomic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Res. 43, e47–e47 (2015).
Korsunsky, I. et al. Fast, sensitive and accurate integration of singlecell data with harmony. Nat. Methods 16, 1289–1296 (2019).
Kiselev, V. Y., Yiu, A. & Hemberg, M. Scmap: projection of singlecell RNAseq data across data sets. Nat. Methods 15, 359–362 (2018).
Stuart, T. et al. Comprehensive integration of singlecell data. Cell 177, 1888–1902 (2019).
Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).
Haghverdi, L. et al. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts singlecell perturbation responses. Nat. Methods 16, 715–721 (2019).
Amodio, M. et al. Exploring singlecell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
Lähnemann, D. et al. Eleven grand challenges in singlecell data science. Genome Biol. 21, 1–35 (2020).
Hao, Y. et al. Integrated analysis of multimodal singlecell data. Cell 184, 3573–3587 (2021).
Kang, J. B. et al. Efficient and precise singlecell reference atlas mapping with Symphony. Nat. Commun. 12, 1–21 (2021).
Lotfollahi, M. et al. Mapping singlecell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Michielsen, L. et al. Singlecell reference mapping to construct and extend celltype hierarchies. NAR Genomics and Bioinformatics 5, lqad070 (2023).
Osorio, D., McGrail, D. J., Sahni, N. & Yi, S. S. Drug combination prioritization for cancer treatment using singlecell RNAseq based transfer learning. Preprint at bioRxiv (2022).
Xu, C. et al. Probabilistic harmonization and annotation of singlecell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
Fetaya, E., Jacobsen, J.H., Grathwohl, W. & Zemel, R. Understanding the limitations of conditional generative models. Preprint at https://doi.org/10.48550/arXiv.1906.01171 (2019).
Brbić, M. et al. MARS: discovering novel cell types across heterogeneous singlecell experiments. Nat. Methods 17, 1200–1206 (2020).
Sohn, K., Lee, H. & Yan, X. Learning structured output representation using deep conditional generative models. Adv. Neural Inf. Process. Syst. 28, 3483–3491 (2015).
Snell, J., Swersky, K. & Zemel, R. Prototypical networks for fewshot learning. Adv. Neural Inf. Process. Syst. 30, (2017).
Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional outofdistribution generation for unpaired data using transfer VAE. Bioinformatics 36, i610–i617 (2020).
Hospedales, T., Antoniou, A., Micaelli, P. & Storkey, A. Metalearning in neural networks: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2004.05439 (2020).
Köhler, N. D., Büttner, M. & Theis, F. J. Deep learning does not outperform classical machine learning for celltype annotation. Preprint at bioRxiv https://doi.org/10.1101/653907 (2019).
Madissoon, E. et al. A spatially resolved atlas of the human lung characterizes a glandassociated immune niche. Nat Genet. 55, 66–77 (2023).
Grabski, I. N., Street, K. & Irizarry, R. A. Significance analysis for clustering with singlecell RNAsequencing data. Nat. Methods 20, 1196–1202 (2023).
Su, Y. et al. Multiomic immunophenotyping of COVID19 patients reveals early infection trajectories. Preprint at bioRxiv (2020).
SchulteSchrepping, J. et al. Severe COVID19 is marked by a dysregulated myeloid cell compartment. Cell 182, 1419–1440 (2020).
Bakken, T. E. et al. Comparative cellular analysis of motor cortex in human, marmoset and mouse. Nature 598, 111–119 (2021).
Martens, L. D., Fischer, D. S., Theis, F. J. & Gagneur, J. Modeling fragment counts improves singlecell ATACseq analysis. Preprint at bioRxiv https://doi.org/10.1101/2022.05.04.490536 (2022).
Luecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In Thirtyfifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021).
Ashuach, T., Reidenbach, D. A., Gayoso, A. & Yosef, N. PeakVI: a deep generative model for singlecell chromatin accessibility analysis. Cell Rep. methods 2, 100182 (2022).
Kingma, D. P. & Welling, M. Autoencoding variational Bayes. Preprint at arXiv https://doi.org/10.48550/arXiv.1312.6114 (2013).
Higgins, I. et al. Betavae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (2017).
Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at arXiv https://doi.org/10.48550/arXiv.1511.06434 (2015).
Integration and label transfer. Satija Lab https://satijalab.org/seurat/archive/v3.0/integration.html
Quickstart tutorial. R Project https://cran.rproject.org/web/packages/symphony/vignettes/quickstart_tutorial.html
Szabo, P. A. et al. Longitudinal profiling of respiratory and systemic immune responses reveals myeloid celldriven lung inflammation in severe COVID19. Immunity 54, 797–814 (2021).
Lee, J. S. et al. Immunophenotyping of COVID19 and influenza highlights the role of type i interferons in development of severe COVID19. Sci. Immunol. 5, eabd1554 (2020).
Stephenson, E. et al. Singlecell multiomics analysis of the immune response in COVID19. Nat. Med. 27, 904–916 (2021).
Yoshida, M. et al. Local and systemic responses to SARSCoV2 infection in children and adults. Nature 602, 321–327 (2022).
Savage, A. K. et al. Multimodal analysis for human ex vivo studies shows extensive molecular changes from delays in blood processing. iScience 24, 102404 (2021).
Yazar, S. et al. Singlecell eQTL mapping identifies cell type–specific genetic control of autoimmune disease. Science 376, eabf3041 (2022).
Guo, C. et al. Singlecell analysis of two severe COVID19 patients reveals a monocyteassociated and tocilizumabresponding cytokine storm. Nat. Commun. 11, 1–11 (2020).
Arunachalam, P. S. et al. Systems biological assessment of immunity to mild versus severe COVID19 infection in humans. Science 369, 1210–1220 (2020).
Ahern, D. J. et al. A blood atlas of COVID19 defines hallmarks of disease severity and specificity. Cell 185, 916–938 (2022).
Travaglini, K. J. et al. A molecular cell atlas of the human lung from singlecell RNA sequencing. Nature 587, 619–625 (2020).
Liu, C. et al. Timeresolved systems immunology reveals a late juncture linked to fatal COVID19. Cell 184, 1836–1857 (2021).
Wilk, A. J. et al. A singlecell atlas of the peripheral immune response in patients with severe COVID19. Nat. Med. 26, 1070–1076 (2020).
Ren, X. et al. COVID19 immune features revealed by a largescale singlecell transcriptome atlas. Cell 184, 1895–1913 (2021).
Tabula Sapiens Consortium et al. The Tabula Sapiens: a multipleorgan, singlecell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Szabo, P. A. et al. Singlecell transcriptomics of human t cells reveals tissue and activation signatures in health and disease. Nat. Commun. 10, 1–16 (2019).
van der Wijst, M. G. et al. Type I interferon autoantibodies are associated with systemic immune alterations in patients with COVID19. Sci. Transl. Med. 13, eabh2624 (2021).
Perez, R. K. et al. Singlecell RNAseq reveals cell type–specific molecular and genetic associations to lupus. Science 376, eabf1970 (2022).
Singlecell immunology of SARSCoV2 infection. Fred Hutch https://atlas.fredhutch.org/fredhutch/covid/
treeArches: reference models & latent space. Zenodo https://zenodo.org/record/6786357
Series GSE194122. NCBI https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122
scArchesreproducibility. GitHub https://github.com/theislab/scArchesreproducibility
Benchmarking atlaslevel data integration in singlecell genomics  integration task datasets. figshare https://doi.org/10.6084/m9.figshare.12420968
Nieto, P. et al. A singlecell tumor immune atlas for precision oncology. Genome Res. 31, 1913–1926 (2021).
A singlecell tumor immune atlas for precision oncology. Zenodo https://zenodo.org/record/4263972
Acknowledgements
We thank L. Sikkema for the valuable feedback on our work with the HLCA dataset. We thank S. Rybakov for helping integrate our software to the scArches package. M.L. acknowledges financial support from the Joachim Herz Stiftung. F.J.T. acknowledges support by the Helmholtz Association’s Initiative and Networking Fund through Helmholtz AI (ZTIPF501) and by the European Union (ERC, DeepCell—101054957). A.A.M. is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, project number 458958943). L.Z. is funded by the Bavarian Ministry of Science and the Arts in the framework of the Bavarian Research Association ‘ForInter’ (Interaction of human brain cells). Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.
Funding
Open access funding provided by Helmholtz Zentrum München  Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH).
Author information
Authors and Affiliations
Contributions
M.L. conceived the project with F.J.T. M.L., C.D. and M.W. designed the algorithm. C.D. and M.W. implemented the algorithm with contributions from A.A.M. C.D., M.W. and L.Z. curated the datasets used for the analyses in the paper. C.D., S.H.Z. and A.A.M. performed experiments and analyses. C.D. ran benchmarking experiments and wrote the paper with contribution from the other authors. M.L. and F.J.T. supervised the project.
Corresponding authors
Ethics declarations
Competing interests
F.J.T. consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd, Omniscope Ltd, Cellarity and has ownership interest in Dermagnostix GmbH and Cellarity. M.L. consults for Santa Ana Bio, is a parttime employee at Relation Therapeutics and owns interests in Relation Therapeutics. All other authors declare no competing interest.
Peer review
Peer review information
Nature Methods thanks Xiangjie Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–15 and Tables 1–3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
De Donno, C., HediyehZadeh, S., Moinfar, A.A. et al. Populationlevel integration of singlecell datasets enables multiscale analysis across samples. Nat Methods 20, 1683–1692 (2023). https://doi.org/10.1038/s41592023020352
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592023020352
This article is cited by

JOINTLY: interpretable joint clustering of singlecell transcriptomes
Nature Communications (2023)