Architecture surgery

Our method relies on a concept known as TL. TL is an approach in which weights from a model trained on one task are taken and used as weight initialization or fine tuning for another task. We introduce an architecture surgery, a strategy to apply TL in the context of conditional generative models and single-cell data. Our proposed method is general and can be used to perform TL on both CVAEs and conditional generative adversarial nets67.

Let us assume that we want to train a reference CVAE model with a d-dimensional dataset (x ϵ Rd) from n different studies (s ϵ Rn), where R denotes real number space. We further assume that the bottleneck z with layer size is k(z ϵ Rk). Then, an input for a single cell i will be x′ = x · s, where x and s are the d-dimensional gene expression profile and n-dimensional one-hot encoding of study labels, respectively. The · symbol denotes the row-wise concatenation operation. Therefore, the model receives (d + n)-dimensional and (k + n)-dimensional vectors as inputs for encoder and decoder, respectively. Assuming m query datasets, the target model will be initialized with all the parameters from the reference model. To incorporate m new study labels, we add m new dimensions to s in both encoder and decoder networks. We refer to these new added study labels as s′. Next, m new randomly initialized weight vectors are also added to the first layer of the encoder and decoder. Finally, we fine tune the new model by only training the weights connected to the last m dimensions of x′ that correspond to the condition labels. Let us assume that p and q are the number of neurons in the first layer of the encoder and decoder; then, during the fine tuning, only (m) × (p + q) parameters will be trained. Let us parameterize the first layer of the encoder and decoder part of the scArches as f 1 and g 1 , respectively. Let us further assume that ReLU activations are used in the layers. Therefore the equations for f 1 and g 1 are

$$\begin{array}{l}f_1(x,s,s\prime ;\phi _x,\phi _s,\phi _{s\prime }) = {\textrm{max}}(0,\phi _x^Tx + \phi _s^Ts + \phi _{s\prime }^Ts\prime )\\ g_1(z,s,s\prime ;\theta _z,\theta _s,\theta _{s\prime }) = {\textrm{max}}(0,\theta _z^Tz + \theta _s^Ts + \theta _{s\prime }^Ts\prime ),\end{array}$$

where ϕ and θ are parameters of encoder and decoder, and T denotes transpose operation. Therefore, the gradients of f and g with respect to ϕ s′ and θ s′ are

$$\begin{array}{l}

abla_{\phi_{s'}}f_1 = \left\{\begin{array}{lr} 0 & {\mathrm{if}}\ {\phi_x^Tx + \phi_s^Ts + \phi_{s'}^Ts'} {\le} 0\\ s' & {\mathrm{otherwise}}\ \end{array} \right. \\

abla_{\theta_{s'}}g_1 = \left\{\begin{array}{lr} 0 & {\mathrm{if}}\ {\theta_z^Tz + \theta_s^Ts + \theta_{s'}^Ts'} {\le} 0\\ s' & {\mathrm{otherwise}} \end{array} \right. \end{array}$$

Finally, because all other weights except ϕ s′ and θ s′ are frozen, we only compute the gradient of scArches’ cost function with respect to ϕ s′ and θ s′ :

$$\begin{array}{l}

abla _{\phi _{s\prime }}L_{\textrm{scArches}}(x,s,s\prime ;\theta ,\phi ) =

abla _{f_1}L_{\textrm{scArches}}(x,s,s\prime ;\phi ) \cdot

abla _{\phi _{s\prime }}f_1(x,s,s\prime ;\phi _x,\phi _s,\phi _{s\prime })\\

abla _{\theta _{s\prime }}L_{\textrm{scArches}}(x,s,s\prime ;\theta ,\phi ) =

abla _{g_1}L_{\textrm{scArches}}(z,s,s\prime ;\theta ,\phi ) \cdot

abla _{\theta _{s\prime }}g_1(x,s,s\prime ;\theta _z,\theta _s,\theta _{s\prime }).\end{array}$$

scArches base models

Conditional variational autoencoders

Variational autoencoders (VAEs)68 were shown to learn the underlying complex structure of data. VAEs were proposed for generative modeling of the underlying data leveraging variational inference and neural networks to maximize the following equation:

$$p_\theta (X\mid S) = {\int} {p_\theta } (X\mid Z,S)p_\theta (Z\mid S)dZ,$$

where X is a random variable representing the model’s input, S is a random variable indicating various conditions, θ is the neural network parameters, and \(p_\theta (X\mid Z,S)\) is the output distribution that we sample Z to reconstruct X. In the following equation, we exploit notations from ref. 29 and a tutorial from ref. 69. We approximate the posterior distribution \(p_\theta (Z|X,S)\) using the variational distribution \(q_\phi (Z|X,S)\) that is approximated by a deep neural network parameterized with ϕ:

$$\begin{aligned}L_{\textrm{CVAE}}(X,S;\phi ,\theta ) &= \log p_\theta (X\mid S) - \alpha \cdot D_{\textrm{KL}}(q_\phi (Z|X,S)||p_\theta (Z|X,S)) = \\ &= \mathbb{E}_{q_\phi (Z\mid X,S)}[\log p_\theta (X\mid Z,S)] - \alpha \cdot D_{\textrm{KL}}(q_\phi (Z|X,S)||p_\theta (Z|S)),\end{aligned}$$

where \(\theta = \{ \theta \prime ,\theta _z,\theta _s\}\) and \(\phi = \{ \phi \prime ,\phi _x,\phi _s\}\) are parameters of decoder and encoder, respectively, \({\mathbb{E}}\) is the expectation and D KL is the Kullback-Leibler divergence scaled by parameter α. On the left-hand side, we have the log likelihood of the data and an error term that depends on the capacity of the model. The right-hand side of the above equation is also known as the evidence lower bound. CVAE70 is an extension of VAE framework in which \(S

e \emptyset\).

scArches trVAE

trVAE30 builds upon VAE68 with an extra regularization to further match the distribution between conditions. Following the method proposed by Lotfollahi et al.30, we use the representation of the first layer in the decoder, which is regularized by maximum mean discrepancy71. For implementation, we use multi-scale radial basis function (RBF) kernels defined as

$$k\left( { x,x\prime } \right) = \mathop {\sum }\limits_{i = 1}^l k\left( {x,x\prime ,\gamma _i} \right),$$

where \(k\left( {x,x\prime ,\gamma _i} \right) = e^{ - \gamma _i| \ast x - \ast x\prime |^2}\), γ i is a hyperparameter, and l denotes maximum number of RBF kernels.

We will parameterize the encoder and decoder part of scArches as f ϕ and g θ , respectively. So the networks f ϕ and g θ will accept inputs x, s and z, s, respectively. Let us distinguish the first (\(g_{\theta _z,\theta _s}^{(1)}\)) and the remaining layers (\(g_{\theta \prime }^{(2)}\)) of the decoder network \(g_\theta = g_{\theta \prime }^{(2)} \circ g_{\theta _z,\theta _s}^{(1)}\). Therefore, we can define the following maximum mean discrepancy (MMD) cost function:

$$L_{\textrm{MMD}}(X,S;\phi ,\theta _z,\theta _s) = \mathop {\sum }\limits_{i

e j}^{\ {\textrm{No. studies}}} l_{\textrm{MMD}}(g_{\theta _z,\theta _s}^{(1)}(f_\phi (X_{S = i},i),i),g_{\theta _z,\theta _s}^{(1)}(f_\phi (X_{S = j},j),j)),$$

where

$$\begin{array}{rcl}l_{\textrm{MMD}}(X,X^\prime ) & = & \frac{1}{{N_0^2}}\mathop {\sum }\limits_{n = 1}^{N_0} \mathop {\sum }\limits_{m = 1}^{N_0} k(x_n,x_m) \\ && + \frac{1}{{N_1^2}}\mathop {\sum }\limits_{n = 1}^{N_1} \mathop {\sum }\limits_{m = 1}^{N_1} k(x_n^\prime ,x_m^\prime ) - \frac{2}{{N_0N_1}}\mathop {\sum }\limits_{n = 1}^{N_0} \mathop {\sum }\limits_{m = 0}^{N_1} k(x_n,x_m^\prime ).\end{array}$$

We used the notation X S=i for samples drawn from ith study distribution in the training data. Finally, the trVAE’s cost function is

$$L_{\textrm{trVAE}}(X,S;\phi ,\theta ) = L_{\textrm{CVAE}}(X,S;\phi ,\theta ) - \beta \cdot L_{\textrm{MMD}}(X,S;\phi ,\theta _z,\theta _s),$$

where β is a regularization scale parameter. The gradients of trVAE’s cost function with respect to ϕ s and θ s are

$$\begin{array}{l}

abla _{\phi _s}L_{\textrm{trVAE}}(X,S;\theta ,\phi ) =

abla _{\phi _s}L_{\textrm{CVAE}}(X,S;\theta ,\phi ) - \beta \cdot

abla _{\phi _s}L_{\textrm{MMD}}(X,S;\phi ,\theta _z,\theta _s),\\

abla _{\theta _s}L_{\textrm{trVAE}}(X,S;\theta ,\phi ) =

abla _{\theta _s}L_{\textrm{CVAE}}(X,S;\theta ,\phi ) - \beta \cdot

abla _{\theta _s}L_{\textrm{MMD}}(X,S;\phi ,\theta _z,\theta _s).\end{array}$$

Therefore L trVAE can be optimized using stochastic gradient ascent with respect to ϕ s and θ s as all the other parameters are frozen.

scArches scVI

Lopez et al.27 developed a fully probabilistic approach, called scVI, for normalization and analysis of scRNA-seq data. scVI is also based on a CVAE, described in detail above. But, in contrast to the trVAE architecture, the decoder assumes a zero-inflated negative binomial (ZINB) distribution; and therefore the reconstruction loss differs to the MSE loss of trVAE. Another major difference is that scVI explicitly models the library size, which is needed for the ZINB loss calculation with another shallow neural network called the library encoder. Therefore, with similar notation as above, we have the output distribution \(p(X|Z,S,L)\), where L is the scaling factor that is sampled by the outputs of the library encoder, namely the empirical mean L μ and the variance L σ of the log library per batch:

$$L \sim {\textrm{lognormal}}(L_\mu ,L_\sigma ^2).$$

When we now separate the outputs of the decoder g θ into \(g_\theta ^x\), the decoded mean proportion of the expression data, and \(g_\theta ^d\), the decoded dropout effects, we can write the ZINB mass function for \(p(X|Z,S,L)\) in the following closed form:

$$\left\{\begin{array}{l}\\ p(X = 0 | Z, S, L) = \\ \qquad g_{\theta}^d(Z, S) + (1 - g_{\theta}^d(Z, S))\left(\frac{\Sigma}{\Sigma + L \cdot g_{\theta}^x(Z,S)} \right)^{\Sigma} \\p(X = Y | Z, S, L) =\\ \qquad (1 - g_{\theta}^d(Z,S))\frac{\Gamma(Y + \Sigma)}{\Gamma(Y+1)\Gamma(\Sigma)}\left(\frac{\Sigma}{\Sigma + L \cdot g_{\theta}^x(Z,S)}\right)^{\Sigma}\left(\frac{g_{\theta}^x(Z,S)}{\Sigma + L \cdot g_{\theta}^x(Z, S)}\right)^{Y},\end{array}\right.$$

where Σ is the gene-specific inverse dispersion, Γ is the gamma function, and Y represents non-zero entries drawn from a ZINB distribution. Because the evidence lower bound and therefore the optimization objective can be calculated by applying the reparameterization trick and supposing Gaussians, which is possible here because of the proposed ZINB distribution, we can write the scVI cost function as follows:

$$L_{\textrm{scVI}}(X,S;\phi ,\theta ) = L_{\textrm{CVAE}}(X,S;\phi ,\theta ) - \alpha \cdot D_{\textrm{KL}}(q_\phi (L|X,S)||p_\theta (L)).$$

Furthermore, because of the applied reparameterization trick, an automatic differentiation operator can be used, and the cost function can be optimized by applying stochastic gradient descent. For the application in scArches, we removed the library encoder and computed the library size for each batch in a closed form by summing up the counts. This does not decrease the performance of the model and accelerates the surgery step. The resulting network can then be used similarly to the trVAE network by simply retraining only the condition weights corresponding to the new batch annotations in S.

scArches scANVI

scANVI is a semi-supervised method that builds up on the scVI model and was proposed in detail by Xu et al.31. By constructing a mixture model, it is able to make use of any cell type annotations during autoencoder training to improve latent representation of the data. In addition to this, scANVI is capable of labeling datasets with only some marker gene labels as well as transferring labels from a labeled dataset to an unlabeled dataset. For the training of scANVI, the authors proposed an alternating optimization of the cost function \(L_{\textrm{scANVI}}(X,S;\phi ,\theta )\) and the classification loss C, which results from a shallow neural network that serves as a classifier with a cross-entropy loss after the last softmax layer. In more detail, the cost function can be formulated in the following manner:

$$L_{\textrm{scANVI}}(X,S;\phi ,\theta ) = L_{\textrm{labeled}}(X,S,C;\phi ,\theta ) + L_{\textrm{unlabeled}}(X,S;\phi ,\theta ),$$

where C is the cell types in the annotated datasets, and both cost function summands L labeled and L unlabeled are obtained by similar calculations as in the case of scVI. The major difference here, however, is that the Kullback–Leibler divergence is applied to an additional latent encoder that takes cell type annotations into account. For the unlabeled case, each sample is broadcasted into every available cell type. As scANVI builds up on scVI, we use the same adjustments here to apply surgery. On top of that, we also freeze the classifier even for semi-supervised query data, because we want an unchanging reference performance for building a cell atlas and also to force cells in the query data with the same cell type annotation to be near to the corresponding reference cells in the latent representation.

scArches totalVI

For the purpose of combining paired measurement of RNA and surface proteins from the same cells, such as for CITE-seq data, Gayoso et al.32 presented a deep generative model called totalVI. totalVI learns a joint low-dimensional probabilistic representation of RNA and protein measurements. For the RNA portion of the data, totalVI uses an architecture similar to that of scVI, which we discussed in detail above; but, for proteins, a new model is introduced that separates protein information into background and foreground components. With the surgery functionality of scArches added to totalVI, it is now possible to learn a joint latent space of RNA and protein data on a CITE-seq reference dataset and do surgery on a query dataset with only RNA data to impute protein data for that query dataset as well. To accomplish this goal, we again only retrain the weights that correspond to the new batch labels.

CVAEs for single-cell genomics

CVAEs were first applied to scRNA-seq data in scVI29 for data integration and differential testing. Here we focus on how CVAEs perform data integration and potential pitfalls. These models receive a matrix of gene expression profile for cells (X) and label (condition) matrix (S). The condition matrix comprises a nuisance variable, which we want to regress out from the data. Labels can encode batch, technologies, disease state or other discrete variables. The CVAE model seeks to infer a low-dimensional latent space (Z) for the cell that would be free of variations explained by the label variable. For example, if the labels are the experimental batches, then similar cell type separated by batch effect in the original gene expression space will be aligned together. Importantly, variation attributed to the labels will be merely regressed in the latent space while still present in the output of the CVAE. Therefore, the reconstructed output will still contain batch effects. Additionally, while autoencoder-based data-integration methods were shown to perform best when outputting integrated embeddings, these methods can also output corrected expression matrices. This is achieved by forcing all batches to be transformed to a specific batch as previously shown in scGen.

scArches builds upon existing CVAEs. The results of the integration heavily depend on the type of labels used as batch covariates for condition inputs. If the dataset is the batch covariate, within-dataset donor effects will not be removed, but donors become more comparable across datasets. In our COVID-19 example, the disease is used as a query and thus is not captured fully in the encoder, which is trained on data from healthy individuals. Adaptor training removes the donor- and/or dataset-specific batch effect from a disease sample but does not remove variation unseen in network training. Thus, choice of training data and choice of batch covariate are crucial to assess whether variation from disease is removed in training or not.

Overall, the choice and design of the label matrix is a crucial step for optimal outcome. The label matrix can encode one covariate (for example, batch), multiple covariates (for example, technology, cell types, disease, species,…) or a combination of covariates (for example, technology and species). However, the interpretability of the latent space will be challenging in the presence of complex label design and will require extra caution.

Model sharing

We currently support an application programming interface to upload and download model weights and data (if available) using Zenodo. Zenodo is a general-purpose open-access repository developed to enable researchers to share datasets and software. We have provided step-by-step guides for the whole pipeline from training and uploading models to downloading, updating and further sharing models. These tutorials can be found in the scArches GitHub repository (https://github.com/theislab/scarches).

Feature overlap between reference and query

An important practical challenge for reference mapping using scArches is the number of features (genes) that are shared between the query and the reference model and/or dataset. It is important to note that, with the current pipeline, the query data must have the same gene set as the reference model. Therefore, the user has to replace missing reference genes in the query with zeros. We investigated the effect of zero filling and observed that integration performance was robust when 10% (of 2,000 genes) were missing from query data. However, the performance will deteriorate with larger differences between query and reference (Supplementary Fig. 28a). We further observed good integration with 4,000 HVGs, even when 25% of genes were missing from the query data, conveying that the model would be robust if the overall number of shared genes is large (for example, 4,000 HVGs, Supplementary Fig. 28b).

Evaluation metrics

Evaluation metrics and their definitions in the current paper were taken from work by Luecken et al.7, unless specifically stated otherwise.

Entropy of batch mixing

This metric43 works by constructing a fix similarity matrix for cells. The entropy of mixing in a region of cells with c batches is defined as

$$E = \mathop {\sum }\limits_{i = 1}^c p_i\log_c(p_i),$$

where pi is defined below as

$$p_i = \frac{{\ {\textrm{no. cells with batch}}\,i\,{\textrm{in the region}}}}{{\ {\textrm{no. cells in the region}}}}.$$

Next, we define U, a uniform random variable on the cell population. Let B U be the frequencies of 15 nearest neighbors for the cell U in batch x. We report the entropy of this variable and then average across T = 100 measurements of U. To normalize the entropy of the batch mixing score between 0 and 1, we set the base of the logarithm to the number of batches c.

Average silhouette width

Silhouette width measures the relationship between within-cluster distances of a cell and between-cluster distances of that cell to the closest cluster. In general, an ASW score of 1 implies clusters that are well separated, an ASW score of 0 implies overlapping clusters, and an ASW score of −1 implies strong misclassification. When we use the ASW score as a measure of biological variance, we calculate it on cell types in the following manner:

$${\textrm{ASW}}_c = \frac{{\textrm{ASW} + 1}}{2},$$

where the final score is already scaled between 0 and 1. Therefore larger values correspond to denser clusters. In contrast to the ASW c score, we also calculate an ASW score on batches within cell clusters to obtain a measure for batch-effect removal. In this case, we again scale but also invert the ASW score to have a consistent metric comparison:

$${\textrm{ASW}}_b = 1 - {\textrm{abs}}({\textrm{ASW}}).$$

A higher final score here implies better mixing and therefore a better batch-removal effect.

Normalized mutual information

We use NMI to compare the overlap of two different cell type clusterings. In detail, we computed a Louvain clustering on the latent representation of the data and compared it to the latent representation itself in a cell type-wise manner. To obtain scores between 0 and 1, the overlap was scaled using the mean of entropy terms for cell type and cluster labels. Therefore an NMI score of 1 corresponds to a perfect match and good conservation of biological variance, whereas an NMI score of 0 corresponds to uncorrelated clustering.

Adjusted Rand index

This metric considers correct clustering overlaps as well as counting correct disagreements between two clusterings. Again, similar to NMI, cell type labels in the integrated dataset are compared with Louvain clustering. The adjusted Rand index score is normalized between 0 and 1, where 1 corresponds to good conservation of biological variance and 0 corresponds to random labeling.

Principal-component regression

In contrast to principal-component analysis (PCA), we calculate a linear regression R with respect to the batch label onto each principal component. The total variance (Var) explained by the batch variable can then be formulated as follows:

$${\textrm{Var}}(X|B) = \mathop {\sum }\limits_{i = 1}^N {\textrm{Var}}(X|{\textrm{PC}}_i) \cdot R^2({\textrm{PC}}_i|B),$$

where X is the data matrix, B is the batch label, and N is the number of principal components (PC).

Graph connectivity

For this metric, we calculate a subset kNN graph \(G(N_c,E_c)\) for each cell type label c, such that each subset only contains cells from the given label. The total graph connectivity score can then be calculated as follows:

$$gc = \frac{1}{{|C|}}\mathop {\sum}\limits_{c \in C} {\frac{{|{\textrm{LCC}}(G(N_c,E_c))|}}{{|N_c|}}} ,$$

where C is the set of cell type labels, \(|{\textrm{LCC}}()|\) is the number of nodes in the largest connected component of the graph, and |N c | is the number of nodes with the given cell type label. This means that we check if the graph representation of the latent representation connects all cells with the same cell type label. Therefore, a score of 1 would imply that all cells with the same cell type label are connected, which would further indicate good batch mixing. A graph in which no cells are connected would result in a score of 0.

Isolated label F 1

We defined isolated labels as cell type labels that are present in the least number of batches. If there are multiple isolated labels, we simply take the mean of each score. To determine how well those cell types are separated from other cell types in the latent representation, we first determine the cluster with the largest number of an isolated label. Subsequently, an F 1 score of the isolated label against all other labels within that cluster is computed, where the F 1 score is defined as follows:

$$F_1 = 2\frac{{{\textrm{precision}} \cdot {\textrm{recall}}}}{{\textrm{precision} + {\textrm{recall}}}}.$$

This results in a score between 0 and 1 once again, where 1 implies that all cells with the isolated label are captured in the cluster.

Isolated label silhouette

For this metric, we use ASW c , defined above, but only on the isolated label subset of the latent representation. Scaling and meaning of the score are the same as described for ASW. If there are multiple isolated labels, we average over each score similar to the isolated labeled F 1 score.

kNN accuracy

We first compute the 15 nearest neighbors of each cell in the data. We then compute the ratio of the correct cell type annotations inside those 15 neighbors. This cell-wise score is then averaged over all cell types separately and then averaged over all remaining scores again to obtain a single kNN-accuracy score between 0 and 1. A higher kNN-accuracy score corresponds to better preservation of local cell type purity. This metric was inspired by a similar metric used in scANVI.

Visualization of integration scores

To compare performances of different models, we designed an overview table (inspired by Saelens et al.72) that displays individual integration scores as circles and aggregated scores as bars. Each individual score is minimum–maximum scaled to improve visual comparison of different models and then averaged into aggregated scores by category (batch correction and biological conservation). Finally, an overall score is calculated as a weighted sum of batch correction and bio-conservation, considering a ratio of 40:60, respectively. When shown, reference and query times are not considered in the calculation of aggregated scores. Moreover, these time values are scaled together to allow direct comparison. The overall ranking of each model, for each score, is represented by the color scheme.

Datasets

All cell type labels and metadata were obtained from original publications unless specifically stated otherwise below.

Brain data

The mouse brain dataset is a collection of four publicly available scRNA-seq mouse brain studies1,33,34,35, for which additional information on cerebral regions was provided. We obtained the raw count matrix from Rosenberg et al.34 under GEO accession ID GSE110823, the annotated count matrix from Zeisel et al.35 from http://mousebrain.org (file name L5_all.loom, downloaded on 9 September 2019) and count matrices per cell type from Saunders et al.33 from http://dropviz.org (DGE by region section, downloaded on 30 August 2019). Data from mouse brain tissue sorted by flow cytometry (myeloid and non-myeloid cells, including the annotation file annotations_FACS.CSV) from TM were obtained from https://figshare.com (retrieved 14 February 2019). We harmonized cluster labels via fuzzy string matching and attempted to preserve the original annotation as far as possible. Specifically, we annotated ten major cell types (neuron, astrocyte, oligodendrocyte, oligodendrocyte precursor cell, endothelial cell, brain pericyte, ependymal cell, olfactory ensheathing cell, macrophage and microglia). In the case of Saunders et al.33, we facilitated the additional annotation data table for 585 reported cell types (annotation.BrainCellAtlasSaundersversion2018.04.01.TXT retrieved from http://dropviz.org on 30 August 2019. Among these, some cell types were annotated as ‘endothelial tip’, ‘endothelial stalk’ and ‘mural’. We examined the subset of the Saunders et al.33 dataset as follows: we used Louvain clustering (default resolution parameter, 1.0) to cluster, followed by gene expression profiling via the rankgenesgroups function in scanpy. Using marker gene expression, we assigned microglia (C1qa), oligodendrocytes (Plp1), astrocytes (Gfap, Clu) and endothelial cells (Flt1) to the subset. Finally, we applied scran73 normalization and log (counts + 1) to transform count matrices. In total, the dataset consists of 978,734 cells.

Pancreas

Five publicly available pancreatic islet datasets74,75,76,77,78, with a total of 15,681 cells in raw count matrix format were obtained from the Scanorama42 dataset, which has already assigned its cell types using batch-corrected gene expression by Scanorama. The Scanorama dataset was downloaded from http://scanorama.csail.mit.edu/data.tar.gz. In the preprocessing step, raw count datasets were normalized and log transformed by scanpy preprocessing methods. Preprocessed data were used directly for the pipeline of scArches. One thousand HVGs were selected for training the model.

The human cell landscape

The HCL dataset was obtained from https://figshare.com/articles/HCL_DGE_Data/7235471. Raw count matrix data for all tissues were aggregated. A total of 277,909 cells were selected and processed using the scanpy Python package. Data were normalized using size factor normalization such that every cell had 10,000 counts and then log transformed. Finally, 5,000 HVGs were selected as per their average expression and dispersion. We used processed data directly for training scArches at the pre-training phase.

The mouse cell atlas

The mouse cell atlas dataset was obtained from https://figshare.com/articles/HCL_DGE_Data/7235471. Raw count matrix data for all tissues were aggregated together. A total of 150,126 cells were selected and processed using the scanpy Python package. Homologous genes were selected using BioMart 100 before merging with HCL data. Data were normalized together with HCL as explained before.

Immune data

The immune dataset consists of ten human samples from two different tissues: bone marrow and peripheral blood. Data from bone marrow samples were retrieved from Oetjen et al.36, while data from peripheral blood samples were obtained from 10x Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3), Freytag et al.37, Sun et al.38 and Villani et al.79. Details on the retrieval location of datasets, the different protocols used and ways in which samples were chosen for analysis can be found in Luecken et al.7. We performed quality control separately for each sample but adopted a common strategy for normalization: all samples for which count data were available were individually normalized by scran pooling73. This excludes data from Villani et al.79, which included only TPM values. All datasets were log+1 transformed in scanpy80. Cell type labels were harmonized starting from existing annotations (Oetjen et al.36) to create a consistent set of cell identities. Well-known markers of cell types were collected and used to extend annotation to samples for which they were not previously available. When necessary, subclustering was performed to derive more precise labeling. Finally, cell populations were removed if no label could be assigned. Four thousand HVGs were selected for training.

Endocrine pancreas

The raw dataset of pancreatic endocrinogenesis (n = 22,163)45 is available at the GEO under accession number GSE132188. We considered a subset of 2,000 HVGs for training. Cell type labels were obtained from an adata object provided by the authors of scVelo46.

CITE-seq

We obtained three publicly available datasets from 10x Genomics, already curated and preprocessed as described in the totalVI study32. These data include ‘10k PBMCs from a Healthy Donor—Gene Expression and Cell Surface Protein’ (PBMC, 10k (CITE-seq)81), ‘5k PBMCs from a healthy donor with cell surface proteins (v3 chemistry)’ (PBMC, 5k (CITE-seq)82) and ‘10k PBMCs from a Healthy Donor (v3 chemistry)’ (PBMC, 10k (RNA-seq)57,83,84). Reference data included 14 proteins, and 4,000 HVGs were selected for training.

COVID-19

The COVID-19 dataset along with its metadata was downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1459261 and https://github.com/zhangzlab/covid_balf. The dataset that was used in this paper includes n = 62,469 cells. Data from lungs52,53,54, PBMCs37,38,39 and bone marrow36 were later merged with those from COVID-19 samples. Data were normalized using scanpy, and 2,000 HVGs were selected for training the model. Cell type labels were obtained from the original study.

Tabula Muris Senis

The TM Senis dataset with GEO accession number GSE132042 is publicly available at https://figshare.com/projects/Tabula_Muris_Senis/64982. The dataset contains 356,213 cells with cell type, tissue and method annotation. We normalized the data using size factor normalization with 10,000 counts for each cell. Next, we log+1 transformed the dataset and selected 5,000 HVGs according to their average expression and dispersion. All preprocessing steps were carried out using the scanpy Python package. In this study, we used a combination of sequencing technology and time point as batch covariates.

Benchmarks

Full integration methods

We ran PCA with 20 principal components on the final results from Seurat, Scanorama and mnnCorrect to be comparable (similar approach as described in ref. 31) when computing metrics to deep learning models, which had a latent representation of size 10–20.

Harmony: we used the Harmony Matrix function from the Harmony package. We provided the function with a PCA matrix with 20 principal components on the gene expression matrix.

Scanorama: we used the correct_scanpy function from the Scanorama package with default parameters.

Seurat: we applied Seurat as in the walkthrough (https://satijalab.org/seurat/v3.1/integration.html) with default parameters.

Liger: we used the Liger method as in the walkthrough (https://github.com/welch-lab/liger/blob/master/vignettes/walkthrough_pbmc.pdf). We used k = 20, λ = 5 and resolution = 0.4 with other default parameters. We only scaled data as we had already preprocessed data.

Conos: we followed the Conos tutorial at https://htmlpreview.github.io/?https://raw.githubusercontent.com/kharchenkolab/conos/master/doc/walkthrough.html. Unlike the tutorial, we used our own preprocessed data for better comparisons. We used PCA space with parameters k = 30, k.self = 5, ncomps = 30, matching.method = ’mNN’ and metric = ’angular’ to build the graph. We set the resolution to 1 to find communities. Finally, we saved the corrected pseudo-PCA space with 20 components.

mnnCorrect: we used the mnnCorrect function from the scran package with default parameters.

Cell type-classification methods

Seurat: we followed the walkthrough (https://satijalab.org/seurat/v3.1/integration.html) and used reciprocal PCA for dimension reduction. As described in the original publication 48 , we examined projection scores and assigned cells with the lowest 20% of values to be ‘unknown’.

SVM: we fitted an SVM model from the scikit-learn library to the reference data and classified query cells. We assigned cells with uncertainty probability greater than 0.7 as ‘unknown’.

Logistic regression: we fitted logistic regression from the scikit-learn library to the reference data and predicted query labels.

All these methods were tested on a machine with one eight-core Intel i7-9700KQ CPU addressing 32 GB RAM and one Nvidia GTX 1080 ti (12 GB) addressing 12 GB VRAM.

Model output

Throughout this paper, all low-dimensional representations were obtained using the latent space of scArches models. The output of scArches models will be confounded with condition variables not fit for data-integration applications but best for imputation or denoising scenarios.

Cell type annotation

To classify labels for the query dataset, we trained a weighted kNN classifier on the latent-space representation of the reference dataset. For each query cell c, we extracted its kNNs (N c ). We computed the standard deviation of the nearest distances:

$${\textrm{s.d.}}_{c,N_c} = \sqrt {\frac{{\mathop {\sum}

olimits_{n \in N_c} {({\textrm{dist}}(c,n))^2} }}{k}},$$

where dist(c, n) is the Euclidean distance of the query cell c and its neighbors n in the latent space. Next, we applied the Gaussian kernel to distances using

$$D_{c,n,N_c} = e^{ - \frac{{\textrm{dist}(c,n)}}{{(2/{\textrm{s.d.}}_{c,N_c})^2}}}.$$

Next, we computed the probability of assigning each label y to the query cell c by normalizing across all adjusted distances using

$$p(Y = y|X = c,N_c) = \frac{{\mathop {\sum}

olimits_{i \in N_c} {I(y^{(i)} = y) \cdot D_{c,n_i,N_c}} }}{{\mathop {\sum}

olimits_{j \in N_c} {D_{c,n_j,N_c}} }},$$

where y(i) is the label of ith nearest neighbor and I is the indicator function. Finally, we calculated the uncertainty u for each cell c in the query dataset using its set of closest neighbors in the reference dataset (N c ). We defined the uncertainty \(u_{c,y,N_c}\) for a query cell c with label y and N c as its set of nearest neighbors as

$$u_{c,y,N_c} = 1 - p(Y = y|X = c,N_c).$$

We reported cells with more than 50% uncertainty as unknown to detect out-of-distribution cells with new labels, which do not exist in the training data. Therefore, we labeled each cell c in the query dataset as follows:

$$\begin{array}{l}\hat y_c^\prime = {\textrm{argmin}}_y\,u_{c,y,N_c}\\ \hat{y}_c = \left\{ \begin{array}{lr} \hat{y}^\prime_c & {\mathrm{if}}\ u_{c, \hat{y}^\prime_c, N_c} {\le} 0.5 \\ {\mathrm{unknown}} & {\mathrm{o.w.}}\end{array} \right\}\end{array}$$

Protein imputation

For scArches totalVI, missing proteins for RNA-seq-only data were imputed by conditioning query cells as being in the other batches in the reference with protein data. It is possible to impute based on a specific batch or average across all batches. In the example in the paper, the average version was used.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.