Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data

The advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, large-scale integrative analysis of scRNA-seq data remains a challenge largely due to unwanted batch effects and the limited transferabilty, interpretability, and scalability of the existing computational methods. We present single-cell Embedded Topic Model (scETM). Our key contribution is the utilization of a transferable neural-network-based encoder while having an interpretable linear decoder via a matrix tri-factorization. In particular, scETM simultaneously learns an encoder network to infer cell type mixture and a set of highly interpretable gene embeddings, topic embeddings, and batch-effect linear intercepts from multiple scRNA-seq datasets. scETM is scalable to over 106 cells and confers remarkable cross-tissue and cross-species zero-shot transfer-learning performance. Using gene set enrichment analysis, we find that scETM-learned topics are enriched in biologically meaningful and disease-related pathways. Lastly, scETM enables the incorporation of known gene sets into the gene embeddings, thereby directly learning the associations between pathways and topics via the topic embeddings.


Background
Advances in high-throughput sequencing technologies [1] provide an unprecedented opportunity to profile the individual cells' transcriptomes across various biological and pathological conditions, and have spurred the creation of several atlas projects [2][3][4][5]. Emerged as a key application of single-cell RNA sequencing (scRNA-seq) data, unsupervised clustering allows for cell-type identification in a data-driven manner. Flexible, scalable, and interpretable computational methods are crucial for unleashing the full potential from the wealth of single-cell datasets and translating the transcription profiles into biological insights. Despite considerable progress made on clustering method development for scRNA-seq data analysis [6][7][8][9][10][11][12][13][14][15][16], several challenges remain.
First, compared to bulk RNA-seq, scRNA-seq data commonly exhibit higher noise levels and drop-out rates, where the data only captures a small fraction of a cell's transcriptome [17]. Changes in gene expression due to experimental design, often referred to as batch effects [18], can have a large impact on clustering [12,[18][19][20]. If not properly addressed, these technical artefacts may mask the true biological signals in cell clustering.
Second, the partitioning of the cell population alone is insufficient to produce biological interpretation. The annotations of the cell clusters require extensive manual literature search in practice and the annotation quality may be dependent on users' domain knowledge [20]. Therefore, an interpretable and flexible model is needed. In the current work, we consider model interpretability as whether the model parameters can be directly used to associate the input features with latent factors or target outcomes. Latent topic models are a popular approach in mining genomic and healthcare data [21][22][23] and are increasingly being used in the scRNAseq literature [24]. Specifically, in topic modeling, we infer the topic distribution for both the samples and genomic features by decomposing the samples-by-features matrix into samplesby-topics and topics-by-features matrices, which also be viewed as a probabilistic non-negative factorization (NMF) [25]. Importantly, the top genes under each latent topic can reveal the gene signatures for specific cellular programs, which may be shared across cell types or exclusive to a particular cell type. Traditionally, the latter are detected via differential expression analysis at individual gene levels, which has limited statistical power in scRNA-seq data analysis because of the sparse gene counts, small number of unique biological samples, and the burdens of multiple testings.
Third, model transferrability is an important consideration. We consider a model as transferable if the learned knowledge manifested as the model parameters could benefit future data modeling. In the context of scRNA-seq data analysis, it translates to learning feature representations from one or more large-scale reference datasets and applying the learned representations to a target dataset [26,27]. If the model is not further trained on the target dataset, the learning setting called zero-shot transfer learning. A model that can successfully separate cells of distinct cell types that are not present in the reference dataset implies that the model has learned some meaningful abstraction of the cellular programs from the reference dataset such that it can generalize to annotating new cell types of different kinds. An analogy would be that someone who has learned how to distinguish triangles from rectangles may also be able to distinguish squares from circles. As the number and size of scRNA-seq datasets continue to increase, there is an increasingly high demand for efficient exploitation and knowledge transfer from the existing reference datasets.
Several recent methods have attempted to address these challenges. Seurat [7] uses canonical correlation analysis to project cells onto a common embedding, then identifies, filters, scores, and weights anchor cell pairs between batches to perform data integration. Harmony [28] iterates between maximum diversity clustering and a linear batch correction based on the mixture-of-experts model. Scanorama [10] performs all-to-all dataset matching by querying nearest neighbors of a cell among all remaining batches, after which it merges the batches with a Gaussian kernel to form a single cell panorama. These methods rely on feature (gene) selection and/or dimensionality reduction methods; otherwise they can not scale to compendiumscale reference [2] or cohort-level single-cell transcriptome data [29] or are sensitive to the noise inherent to scRNA-seq count data. They are also non-transferable, meaning that the knowledge learned from one dataset cannot be easily transferred through model parameter sharing to benefit the modeling of another dataset. NMF approaches such as UNCURL [30] works only with one scRNA-seq dataset. LIGER [9] uses integrative NMF to jointly factorize multiple scRNA-seq matrices across conditions using genes as the common axis, linking cells from different conditions by a common set of latent factors also known as metagenes. LIGER is weakly transferable in the sense that the global metagenes-by-genes matrix can be recycled as initial parameters when modeling a new target dataset, whereas the cells-by-metagenes and the final metagenes-by-genes must be further computed and updated by iterative numerical optimization to fit the target dataset.
Deep learning approaches, especially autoencoders, have demonstrated promising performance in scRNA-seq data modeling. scAlign [15] and MARS [31] encode cells with nonlinear embeddings using autoencoders, which is naturally transferable across datasets. While scAlign minimizes the distance between the pairwise cell similarity at the embedding and original space, MARS looks for latent landmarks from known cell types to infer cell types of unknown cells. Variational autoencoders (VAE) [32] is an efficient Bayesian framework for approximating intractable posterior distribution using proposed distribution parameterized by neural networks. Several recent studies have tailored the original VAE framework towards modeling single-cell data. Single-cell variational inference (scVI) [6] models library size and takes into account batch effect in generating cell embeddings. scVAE-GM [11] changed the prior distribution of the latent variables in the VAE from Gaussian to Gaussian mixture model, adding a categorical latent variable that clusters cells. Lotfollahi et al. developed a VAE model called scGen to infer the expression difference due to perturbation conditions by latent space interpolation [26]. A key drawback for these VAE models is the lack of interpretability -post hoc analyses are needed to decipher the learned model parameters and distill biological meaning from the learned network parameters. To improve interpretability, Svensson et al. (2020) developed a linear decoded VAE (hereafter referred to as scVI-LD) as a part of the scVI software [14].
In this paper, we present single-cell Embedded Topic Model (scETM), a generative topic model that facilitates integrative analysis of large-scale single-cell transcriptomic data. Our key contribution is the utilization of a transferable neural-network-based encoder while having an interpretable linear decoder via a matrix tri-factorization. The scETM simultaneously learns the encoder network parameters and a set of highly interpretable gene embeddings, topic embeddings, and batch-effect linear intercepts from scRNA-seq data. The flexibility and expressiveness of the encoder network enable scETM to model extremely large scRNA-seq datasets without the need of feature selection or dimension reduction. By tri-factorizing cells-genes matrix into cells-by-topics, topics-by-embeddings, and embeddings-by-genes, we are able to incorporate existing pathway information into gene embeddings during the model training to further improve interpretability. This is a salient feature compared to related methods such as scVI-LD. It allows scETM to simultaneously discover interpretable cellular signatures and gene markers while integrating scRNA-seq data across conditions, subjects and/or experimental studies. We demonstrate that scETM offers state-of-the-art performance in clustering cells into known cell types across a diverse range of datasets with desirable runtime and memory requirements. We also demonstrate scETM's capability of effective knowledge transfer between different sequencing technologies, between different tissues, and between different species. We then use scETM to discover biologically meaningful gene expression signatures indicative of known cell types and pathophysiological conditions. We analyze scETM-inferred topics and show that several topics are enriched in cell-type-specific or disease-related pathways. Finally, we directly incorporate known pathway-gene relationships (pathway gene set databases) into scETM in the form of gene embeddings, and use the learned pathway-topic embedding to show the pathway-informed scETM (p-scETM)'s capability of learning biologically meaningful information.

Results scETM model overview
We developed scETM to model scRNA-seq data across experiments or studies, which we term as batches ( Fig. 1a and Supp. Fig. S1a). Adapted from the Embedded Topic Model (ETM) [33], scETM inherits the benefits of topic models, and is effective for handling large and heavytailed distribution of word frequency. In the context of scRNA-seq data analysis, each sampled single-cell transcriptome is provided as a vector of normalized gene counts to a two-layer fullyconnected neural network (i.e., encoder; see detailed architecture in Supp. Fig. S1c), which infers the topic mixing proportions of the cell. The trained encoder on a reference scRNA-seq data can be used to infer topic mixture of unseen scRNA-seq data collected from different tissues or species (Fig. 1b).
For interpretability, we use a linear decoder with the gene and topic embeddings as the learnable parameters. Specifically, we factorize the cells-by-genes count matrix into a cells-by-topics matrix θ (inferred by the encoder), topics-by-embedding α, and embedding-by-genes ρ matrices (Supp. Fig. S1b). This tri-factorization design allows for exploring the relations among cells, genes, and topics in a highly interpretable way. To account for biases across conditions or subjects, we introduce an optional batch correction parameter λ, which acts as a linear intercept term in the categorical softmax function to alleviate the burden of modeling batch effects from the encoder to let it focus on inferring biologically meaningful cell topic mixture θ d . The encoder and embedding learning is performed by an amortized variational inference algorithm to maximize the evidence lower bound (ELBO) of the marginal categorical likelihood of the scRNA-seq counts [32]. Compared to scVI-LD [14], the linear decoder component that learns a common embeddings for both topics and genes offers more flexibility and interpretability and overall better performance as we demonstrate next (Fig. 1c). Details of the scETM algorithm and implementation are described in Methods.
We further experimented the same scETM without the batch correction term, namely scETMλ. Compared to the λ-ablated model, the full scETM model confers higher ARI in 3 out of the 5 datasets (Table 1) and higher NMI in 4 out of the 5 datasets (Supp. Table S1). Improvement over the Human Pancreas (HP) dataset is remarkably high, implying an effective correction of the confounder due to the scRNA-seq technology differences. We observe no improvement in the AD dataset in terms of both ARI and NMI and small improvement in MDD only in terms of NMI. This implies a lesser concern of batch effects from only the individual brain sample donors, with all data being collected by the same technology in a single study.
We also evaluated the batch mixing aspect of scETM and other methods using k-nearestneighbor Batch-Effect Test (kBET) [18] ( Table 2) and examined to what extent scETM's batch mixing performance can be improved by introducing an adversarial loss term to scETM (Methods) [38]. Briefly, we used a discriminator network (a two-layer feed-forward network) to predict batch labels using the cell topic mixture embeddings generated by the encoder network, and directed the encoder network to fool the discriminator. We observe notable improvement on kBET with similar ARI and NMI scores (Table 1 and Supp. Table S1, row "scETM+adv") at the cost of up to 50% more running time. This shows the expandability of scETM. For the subsequent analyses, we opted to use the results from scETM (without the adversarial loss but with the linear batch correction λ) because of its simpler design, scalability, comparable ARI scores, and less aggressive batch correction (see below). scETM is also robust to architectural and hyperparameter changes, requiring very few or no architecture adaptation or hyperparameter tuning efforts when applied to unseen datasets (Supp. Table S2). As a result, we used the same architecture and hyperparameters for all datasets in Table 1. We also performed a comprehensive ablation analysis to validate our model choices. The ablation experiment demonstrates the necessity of the key model components, such as the batch effect correction factors λ and the batch normalization technique used in training the encoder. Normalizing gene expression scRNA-seq counts as the input to the encoder also improves the performance (Supp. Table S3).
Clustering agreement metrics are not the only metrics for evaluating scRNA-seq methods, and are not available to unannotated datasets. Therefore, we also evaluated the negative loglikelihood (NLL) on held-out samples, which is a principled way for model selection without labels. We computed the held-out (10%) NLL. We found that scETM is robust to different architectures in terms of the NLL (Supp. Table S2). We also found that ARI and NLL are modestly negatively correlated on the TM dataset (Supp. Fig. S2), implying an agreement between the two metrics although this might not be always the case since it highly depends on the cell type labels and the data quality.
To further verify the clustering performance and validate our evaluation metrics, we visualized the cell embeddings using Uniform Manifold Approximation and Projection (UMAP) [39] for some of the datasets (Supp. Fig. S3,S4,S5,S6,S7,S8, Fig. 2). Together, these results support that scETM effectively captures cell-type-specific information, while accounting for artefacts arising from individual or technological variations.

Batch overcorrection analysis
Some methods may risk over-correcting batch effects and fail to capture some aspect of biological variations. In the above analysis, we observe that some methods such as LIGER conferred competitive kBET but low ARI, suggesting potential overcorrection of batch effects. To experiment the extent of batch overcorrection by each method, we conducted two experiments using two datasets, namely the Human Pancreas (HP) dataset [7] and the Mouse Retina (MR) dataset [36,37].
For the HP data, we manually removed beta cells from all 5 batches except for batch CelSeq2, resulting in the cell type distributions shown in Supp. Table S4. We expect that, if a method is guilty of batch effect overcorrection, it would assign beta cells to other non-beta cell clusters in the latent space by forcing the alignment of different batches. Consequently, such methods would have poor clustering scores. We evaluated all of the methods on this dataset using 3 metrics: ARI, kBET, and average silhouette width (ASW) [40]. Briefly, higher ASW indicates larger distances between cell types and lower distance within cell type in the cell embedding space (Methods). We measured the overall ASW as well as the ASW for only the beta cells (i.e., ASW-beta). The results are summarized in Supp. Table S5. We observed that scETM struck a good balance between discriminating cell types (ARI: 0.9298; ASW: 0.3525; ASWbeta: 0.5370) and integrating different batches (kBET: 0.1247). Adding the adversarial loss to the scETM (i.e., scETM+adv) increased kBET from 0.1247 to 0.3445 (while maintaining ARI above 0.92) but greatly compromised ASW-beta (0.0045), suggesting a more aggressive overcorrection. Similarly, LIGER performed the best in kBET (0.5978) but conferred a much lower ARI (0.8476) and low ASW-beta (0.0912), indicating a more severe overcorrection of the batch effects (i.e., mixing beta cells with other cells).
We then visualized the clustering by UMAP to examine how the beta cells are assigned to different clusters (Supp. Fig. S9). We found that beta cells (colored in red) were clustered separately by scETM from other cell types. In contrast, beta cells were mixed up with other cell types by methods including Harmony and LIGER, which overcorrected the batch effects when integrating the five batches. Visually, we also observe that scETM+adv method moves the beta cell cluster closer to the alpha cell cluster, confirming a higher level of overcorrection compared to scETM.
The MR dataset is a collection of two independent studies on mouse retina [36,37]. Here we consider the two source studies as two batches, hereafter referred to as the Macosko batch and the Shekhar batch. Many cell types are uniquely present in the Macosko batch (Supp. Table S6). There is also a large difference in the cell proportion between the two batches. In particular, rods is only 0.35% in Shekhar but 65% in Macosko. In this scenario, we expect that methods that overcorrect the batch effects would tend to mix rods with cells of other cell types from the Shekhar batch, resulting in low ARI and high kBET. On the contrary, a desirable integration method would strike a balance between the ARI (or ASW) and kBET on this combined dataset. Therefore, this setup imposes a great challenge on the integration methods.
Overall, scETM achieved the highest ARI (0.859), reasonable ASW (0.2873), and modest kBET (0.0656), indicating its ability to capture the true biology from the data without overcorrecting the batch effects (Supp. Table S7). In contrast, LIGER is more aggressive in its batch correction, resulting in the highest kBET score of 0.176 but lowest ARI score of 0.714. We further investigated the extent of improving kBET while maintaining a high ARI score with scETM+adv. Indeed, scETM+adv conferred an increased kBET of 0.1410 and a reasonably high ARI score of 0.7720. Visualizing the clustering of each method using UMAP (Fig. 2) confirms the quantitative clustering results.
Incidentally, we also notice that scETM is not sensitive to the 669 doublelets or contaminants, all of which were from the Shekhar batch (Supp. Table S8). In contrast, if we did not filter out the doublets/contaminants, the performance of LIGER and Seurat degrades drastically possibly due to batch over-correction or failing to integrate the same cell types from different batches together.

Scalability
A key advantage of scETM is its high scalability and efficiency. We demonstrated this by comparing the run time, memory usage, and clustering performance of the state-of-the-art models using their recommended pipelines when integrating a merged dataset consisting of cells from the MDD and AD datasets (Methods). Because of the simple model design and efficient implementation (e.g., sparse matrix representation, multi-threaded data retrieval, etc; Discussion), scETM achieved the shortest run time among all deep-learning based models (Fig. 3a). Specifically, on the largest dataset (148,247 cells), scETM ran 3-4 times faster than scVI and scVI-LD, and over 10 times faster than scVAE-GM. We note that the run time largely depends on the implementation rather than the network architectures and loss functions in these deep learning methods. Harmony and Scanorama were the only methods faster than scETM, yet they both operate on a hundred principal components at most. Although for comparison purpose we used the top 3000 most variable genes for all of the methods, scETM can easily scale to all of the genes, which is more desirable because the resulting model can generalize to other datasets.
Because of the amortized stochastic variational inference [32, 41,42], scETM in principle takes linear run-time and constant memory with respect to the sample size per training epoch. The use of multi-threaded data loader to streamline the random minibatch retrieval and loading further speed up the training process in practice. In contrast, the memory requirement of Seurat increases rapidly with the number of cells, due to the vast numbers of plausible anchor cell pairs in the two brain datasets (Fig. 3b). In terms of clustering accuracy, scETM consistently confers competitive performance, whereas Harmony and Scanorama perform unstably as dataset sizes vary (Fig. 3c). UMAP visual inspection of scVAE embeddings suggests that scVAE likely suffers from under-correction of batch effects (Supp. Fig. S8). The sudden drop of LIGER's clustering performance in the largest benchmark dataset may be due to batch overcorrection.
Although it has been widely accepted by the deep learning community that computing using Graphical Processing Units (GPUs) results in ∼10× speedup over computing using CPU, the adoption of GPUs in the computational biology community is beginning to catch up. In our nonexhaustive experiment on the Mouse Pancreas dataset, training scETM for 1000 steps on the 6-core Core i7 10750H CPU requires 650 seconds (Fig. 3a), while with an Nvidia RTX 2070 laptop GPU it only takes 50 seconds -a 13× speedup over the CPU computer.

Transfer learning across single-cell datasets
A prominent feature of scETM is that its parameters, hence the knowledge of modeling scRNAseq data, are transferable across datasets. Specifically, as part of the scETM, the encoder trained on a reference scRNA-seq dataset can be applied to infer cell topic mixture of a target scRNA-seq dataset (Fig. 1b), regardless of whether the two datasets share the same cell types. As an example, we trained an scETM model on the Tabula Muris FACS dataset (TM (FACS)) (which is a subset of the TM dataset) from a multi-organ mouse single-cell atlas, and evaluated it using the MP data, which only contains mouse pancreatic islet cells. Although the two datasets were obtained using different sequencing technologies in two independent studies, the model yielded an encouragingly high ARI score of 0.941, considering that a model directly trained on MP achieves ARI 0.946. Interestingly, in the UMAP plot , the TM (FACS)pretrained model placed B cells, T cells and macrophages far away from other clusters and separated B cells and T cells from macrophages, which is not observed in the model directly trained on MP (Supp. Fig. S3,S4,S5,S6,S7; Supp. Table S9). We repeated the same experiment 3 times with different random seeds and observed consistently that B and T cells are close to each other and distant from macrophages (Supp. Fig. S10). We also experimented transfer learning by first training scETM on TM (FACS) with pancreas removed and then applied to MP dataset. The performance decreased but is still reasonably good (Supp. Table S9), demontrasting scETM's ability to transfer knowledge across tissues.
Encouraged by the above results, we then performed a comprehensive set of cross-tissue and cross-species transfer learning analysis with 6 tasks (Methods): (1) Transfer between the TM (FACS) and the MP dataset (including MP→TM (FACS)); (2) Transfer between the Human Pancreas (HP) dataset and the Mouse Pancreas (MP) dataset; (3) Transfer between the Human primary motor cortex (M1C) (HumM1C) dataset and the Mouse primary motor area (MusMOp) dataset both obtained from the Allen Brain Map data portal [43]. We chose to transfer between the human M1C and mouse MOp because of the high number of shared cell types between the brain regions of the two species. The batches for HumM1C are the two postmortem human brain M1 specimens and the two mice for MusMOp. Note that in these transfer learning tasks (A → B) we only corrected batch effects during the training on the source data A but not during the transfer to the target data B.
As a comparison, we evaluated and visualized the clustering results in all 6 transfer learning tasks using scETM, scVI-LD, and scVI ( Fig. 4; Supp. Table S10 and S11). Overall, scETM achieved the highest ARI across all tasks and competitive kBET scores. In particular, scETM trained on TM (FACS) on heterogeneous tissues clustered much better the MP cells (ARI: 0.941; kBET: 0.339) than scVI (ARI: 0.484; kBET: 0.257) and scVI-LD (ARI: 0.398; kBET: 0.256). Remarkably, scETM trained only on the MP dataset can cluster reasonably well the much larger TM single-cell data, which were collected from diverse primary tissues including pancreas. This implies that scETM does not merely learn cell-type-specific signatures but also the underlying transcriptional programs that are generalizable to unseen tissues.
In cross-species transfer learning between HP and MP, scETM captured better the conserved pancreas functions compared to scVI and scVI-LD ( Fig. 4; Supp. Table S10). On the other hand, cross-species transfer between MusMOp and HumM1C is a much more challenging task due to the evolutionarily divergent functions of the brains between the two species. Nonetheless, scETM conferred a much higher ARI of 0.696 for the MusMOp→HumM1C transfer and ARI of 0.167 for the HumM1C→MusMOp transfer. In contrast, scVI-LD and scVI did not work well on these tasks with ARI scores lower than 0.1. Since they cannot separate cells by cell types, all cells are mixed together, leading to a high kBET score. The improvements achieved by scETM over scVI(-LD) are possibly attributable to the simpler linear batch correction on the source data, jointly learning topic and gene embedding, and the topic modeling formalism, which together lead to an encoder network that is better at capturing the transferable cellular programs.

Pathway enrichment analysis of scETM topics
We next investigated whether the scETM-inferred topics are biologically relevant in terms of known gene pathways in human. One approach would be to arbitrarily choose a number of top genes under each topic and test for pathway enrichment using hypergeometric tests. This approach works well when there are asymptotic p-values at the individual gene level. In our case, each gene is characterized by the topic scores, making it difficult to systematically choose the number of top genes per topic. To this end, we resorted to Gene Set Enrichment Analysis (GSEA) [44]. Briefly, we calculated the maximum running sum of the enrichment scores with respect to a query gene set by going down the gene list that is sorted in the decreasing order by a given topic distribution β k (Methods). For each dataset, we trained a scETM with 100 topics.
For the HP dataset, each topic detected many significantly enriched pathways with Benjamini-Hochberg False Discover Rate (FDR) < 0.01 (Fig. 5a). Many of them are relevant to pancreas functions, including insulin processing (Fig. 5b), insulin receptor recycling, insulin glucose pathway, pancreatic cancer, etc (Supp. Table S12). Because scETM jointly learns both the gene embeddings and topic embeddings, we can visualize both the genes and topics in the same embedding space via UMAP (Fig. 5c). Indeed, we observe a strong co-localization of the genes in Insulin Processing pathway and the corresponding enriched topic (i.e., Topic 54).
For the AD dataset, we found topics enriched for Reactome Amyloid Fiber Formation, KEGG AD, and Deregulated CDK5 triggers multiple neurodegenerative in AD (Supp. Fig. S11; Supp. Table S13). For MDD dataset, we found enrichment for Substance/Drug Induced Depressive Disorder (Supp. Fig. S12, S13; Supp. Table S14). The full GSEA enrichment results for all 3 datasets are listed in Supp. Table S18. As a comparison, we also performed GSEA over the 100 gene loadings learned by scVI-LD (matching the 100 topics in the scETM) on these 3 datasets but found fewer relevant distinct gene sets (Supp. Table S12,S13,Supp. Table S14) or weaker statistical enrichments by GSEA (Supp. Fig. S14).

Differential scETM topics in disease conditions and cell types
We sought to discover scETM topics that are condition-specific or cell-type specific. Starting with the AD dataset, we found that the scETM-learned topics are highly selective of cell-type marker genes (Fig. 6a) and highly discriminative of cell types (Fig. 6b). To detect disease signatures, we separated the cells into the ones derived from the 24 AD subjects and the ones from the 24 control subjects. We then performed permutation tests to evaluate whether the two cell groups exhibit significant differences in terms of their topic expression (Methods). Topic 12 and 58 are differentially expressed in the AD cells and control cells (Fig. 6c, d; permutation test p-value = 1e-5). Interestingly, topic 58 is also highly enriched for mitochondrial genes. Indeed, it is known that β-amyloids selectively build up in the mitochondria in the cells of AD-affected brains [45]. For the MDD dataset, topics 1, 52, 68, 70, 86 exhibit differential expressions between the suicidal group and the healthy group (Supp. Fig. S18c) and interesting neurological pathway enrichments (Supp. Table S14).
We also identified several cell-type-specific scETM topics from the HP, AD, and MDD datasets. In HP, topics or metagenes 20, 45, 99 are up-regulated in acinar cells, topic 12 up-regulated in macrophage, topic 52 up-regulated in delta, and topics 30 and 37 are up-regulated in more than one cell types, including endothelial, stellate and others (Supp. Fig. S15). In AD, as shown by both the cell topic mixture heatmap and the differential expression analysis (Fig. 6b), topics 19, 35, 50, 69, 97 are up-regulated in oligodendrocytes, micro/macroglia, astrocytes, endothelial cells, and oligodendrocyte progenitor cells (OPCs) respectively (permutation test pvalue = 1e-5; Fig. 6b, Supp. Fig. S16). Interestingly, two subpopulations of cells from the oligodendrocytes (Oli) and excitatory (Ex) exhibit high expression of topics 12 and 58, respectively, and are primarily AD cells (Supp. Fig. S17). Among them, there is also a strong enrichment for the female subjects, which is consistent with the original finding [35].
For MDD, topics 1, 20, 59, 64 and 72 are up-regulated in astrocytes, oligodendrocytes, micro/macroglia, endothelial cells, and OPCs, respectively (Supp. Fig. S18c). This is consistent with the heatmap pattern (Supp. Fig. S18b). Several topics are dominated by long non-coding RNAs (lincRNAs) (Fig. S18a). While previous studies have suggested that lincRNAs can be cell-type-specific [46], it remains difficult to interpret them [47]. We further experimented the enrichment using only the protein coding genes, but did not find significantly more marker genes among the top 10 genes per topic (Supp. Fig. S19).

Pathway-informed scETM topics
To further improve topic interpretability, we incorporated the known pathway information to guide the learning of the topic embeddings (Fig. 7a). We denoted this scETM variant as the pathway-informed scETM or p-scETM. In particular, we fixed the gene embedding ρ to a pathwaysby-genes matrix obtained from pathDIP4 database [48,49]. We then learned only the topics embedding α, which provides direct associations between topics and pathways (Methods). We tested p-scETM on the HP, AD and MDD datasets. Without compromising the clustering performance (Supp. Table S15), p-scETM learned functionally meaningful topic embeddings ( Fig. S20; Supp. Table S16,S17). In the HP topic embeddings, we found Insulin Signaling, Nutrient Digestion and Metabolism to be the top pathways among several topics (Fig. S20a). In the MDD topic embeddings, the top pathway associated with Topic 40, Beta-2 Adrenergic Receptor Signaling, was also enriched in a MDD genome-wide association studies [50]. In the AD topic embeddings, we found the association between Topic 9 and Alzheimer Disease-Amyloid Secretase pathway.
To further demonstrate the utility of p-scETM, we also used 7481 Gene Ontology Biological Process (GO-BP) terms [51, 52] as the fixed gene embedding, which learns the topics-by-GOs topic embedding from each dataset. Under each topic, we selected the top 5 high-scoring GO-BP terms to examine their relevance to the target tissue or disease ( Fig. 7b; Supp. Table S19).

Discussion
As scRNA-seq technologies become increasingly affordable and accessible, large-scale datasets have emerged. This challenges traditional statistical approaches and calls for transferable, scalable, and interpretable representation learning methods to mine the latent biological knowledge from the vast amount of scRNA-seq data. To address these challenges, we developed scETM and demonstrated its state-of-the-art performance on several unsupervised learning tasks across diverse scRNA-seq datasets. scETM demonstrates excellent capabilities of batch effect correction and knowledge transfer across datasets.
In terms of integrating multiple scRNA-seq data from different technologies, experimental batches, or studies, we introduce a simple batch-effect bias term to correct for non-biological effects. This in general improves the cell clustering and topic quality. When using the original ETM [33], we observed that ubiquitously expressed genes such as MALAT1 tended to appear among the top genes in several topics. Our scETM corrects the background gene expressions by the gene-dependent and batch-dependent intercepts. As a result, the ubiquitously expressed genes do not dominate all topics from scETM. We also introduced a more aggressive batch correction strategy by adversarial network loss, which shows improved kBET with small trade-off for the ARI in most datasets.
In terms of scalability, although scETM is similar to other existing VAE models in terms of theoretical time and space complexity, we emphasize that implementation is also very important, especially for deep learning models. For example, scVAE-GM [11] is much slower and more memory consuming than scVI [6], while they are very similar VAE models. One of the main speedups provided by scETM comes from our implementation of a multi-threaded data loader for minibatches of cells, which does not need to be re-initialized at every training epoch as the standard PyTorch DataLoader. Compared to scVI and scVI-LD, the normalized counts in both the encoder input and the reconstruction loss used by scETM remove the need to infer the cell-specific library size variable, and the simpler categorical likelihood choice also helps reduce the computational time.
In terms of transferability, many existing integration methods require running on both reference and query datasets to perform post hoc analyses such as joint clustering and label transfer [7,9,10,28]. In contrast, our method enables a direct or zero-shot knowledge transfer of the pretrained encoder network parameters learned from a reference dataset in annotating a new target dataset without further training. We demonstrated this important aspect in cross-technology, cross-tissue, and cross-species applications, for which we achieved superior performance compared to the state-of-the-art methods.
In terms of interpretability, our quantitative experiments showed that scETM identified more relevant pathways than scVI-LD. Qualitative experiments also show that scETM topics preserve cell functional and cell-type-specific biological signals implicated in the single-cell transcriptomes. By seamlessly incorporating the known pathway information in the gene embedding, p-scETM finds biologically and pathologically important pathways by directly learning the association between the topics with the pathways via the topic embedding. Recently proposed by [54], single-cell Hierarchical Poisson Factor (scHPF) model applies hierarchical Poisson factorization to discover interpretable gene expression signatures in an attempt to address the interpretability challenge. However, compared to our model, scHPF lacks the flexibility in learning the gene embedding and incorporating existing pathway knowledge, and is not designed to account for batch effects. Moreover, scETM has the benefits of both flexibility in the neural network encoder and the interpretability in the linear decoder.
As future work, we will extend scETM in several directions. To further improve batch correction, as our current model only considers a single categorical batch variable, we can extend it to correct for multiple categorical batch variables. For a small number of categorical batch variables, we may use several sets of batch intercept terms to model them. For hierarchical batch variables, we may use a tree of batch intercept terms. For numerical batch effects such as subject age, one way is to convert them into categorical batch variables by numerical ranges. When the number of batch variables become larger, we consider three strategies. First, we can add the batch variables as the covariates in the linear regression on the gene expression and fit the linear coefficients each corresponding to a sample-dependent batch variable. Second, we can factorize the batches-by-genes into batches-by-factors and factors-by-genes. Learning the two matrices will be similar to the scETM algorithm. Third, we can extend our current scETM+adv to correct for both categorical and continuous batch variables with a discriminator network, which predicts batch effects using the encoder-generated cell topic mixture [38].
To further improve data integration, we can extend scETM to a multi-omic integration method, which can integrate scRNA-seq plus other omics such as protein expression measured in the same cells as scRNA-seq [55] or scATAC-seq measured in different cells but the same biological system [7]. In these applications, multi-modality over different omics will need to be considered to capture the intrinsic technical and biological variance of each omic while borrowing information among them.
To further improve interpretability, the original ETM used pretrained word embedding from word2vec [56] on a larger reference corpus such as Wikipedia to improve topic quality on modeling the target documents [33]. Similarly, although we demonstrated the use of existing pathway information in p-scETM, we can also pretrain our gene embeddings on PubMed articles, gene regulatory network, protein-protein interactions, or Gene Ontology graph using either gene2vec [57] or more general graph embedding approaches [58,59] [59]. We expect that the gene embedding pretrained from these (structured) knowledge graphs will further improve the efficiency and interpretability of scETM.
Together, scETM serves as a unified and highly scalable framework for integrative analysis of large-scale single-cell transcriptomes across multiple datasets. Compared to existing methods, scETM offers consistently competitive performance in data integration, transfer learning, scalability, and interpretability. The simple Bayesian model design in scETM also provides a highly expandable framework for future developments.

Methods scETM data generative process
To model scRNA-seq data distribution, we take a topic-modeling approach [60]. In our framework, each cell is considered as a "document", each scRNA-seq read (or UMI) as a "token" in the document, and the gene that gives rise to the read (or UMI) is considered as a "word" from the vocabulary of size V . We assume that each cell can be represented as a mixture of latent cell types, which are commonly referred to as the latent topics. The original LDA model [60] defines a fixed set of K independent Dirichlet distributions β over a vocabulary of size V . Following the ETM model [33], here we decompose the unnormalized topic distribution β * ∈ R K×V into the topic embedding α ∈ R K×L and gene embedding ρ ∈ R L×V , where L denotes the size of the embedding space. Therefore, the unnormalized probability of a gene belonging to a topic is proportional to the dot product between the topic embedding matrix and the gene embedding matrix. Formally, the data generating process of each scRNA-seq profile d is: 1. Draw a latent cell type proportion θ d for a cell d from logistic normal θ d ∼ LN (0, I): 2. For each gene read (or UMI) w i,d in cell d, draw gene g from a categorical distribution Cat(r d,· ): Here N d is the library size of cell d, w i,d is the index of the gene that gives rise to the i th read (or UMI) in cell d (i.e., [w i,d = g]), and y d,g is the total counts of gene g in cell d. The transcription rate r d,g is parameterized as follows: Here θ d is the 1 × K cell topic mixture for cell d, α is the global K × L cell topic embedding, ρ g is a L × 1 gene-specific embedding, and λ s(d),g is the batch-dependent and gene-specific scalar effect, where s(d) indicates the batch index for cell d. Notably, to model the sparsity of gene expression in each cell (i.e., only a small fraction of the genes have non-zero expression), we use the softmax function to normalize the transcription rate over all of the genes.

scETM model inference
In scETM, we treat the latent cell topic mixture δ d for each cell d as the only latent variable. We treat the topic embedding α, the gene-specific transcriptomic embedding ρ, and the batcheffect λ as point estimates. Let Y be the D × V gene expression matrix for D cells and V genes. The posterior distribution of the latent variables p(δ|Y) is intractable. Hence, we took a variational inference approach using a proposed distribution q(δ d ) to approximate the true posterior. Specifically, we define the following proposed distribution: Hereỹ d is the normalized counts for each gene as the raw count of the gene divided by the total counts in cell d. The function NNET(v; W) is a two-layer feed-forward neural network used to estimate the sufficient statistics of the proposed distribution for the cell topic mixture δ d .
To learn the above variational parameters W θ , we optimize the evidence lower bound (ELBO) of the log likelihood, which is equivalent to minimizing the Kullback-Leibler (KL) divergence between the true posterior and the proposed distribution: The Bayesian learning is carried out by maximizing the reconstruction likelihood with regularization in the form of KL divergence of the proposed distribution (q(δ d |y d ) = N (µ d , diag(σ d ))) from the prior (p(δ d ) = N (0, I)). For computational efficiency, we optimize ELBO with respect to the variational parameters by amortized variational inference [32, 41,42]. Specifically, we draw a sample of the latent variables from q(δ | y) for a minibatch of cells from reparameterized Gaussian proposed distribution q(δ | y) [32], which has the mean and variance determined by the NNET functions. We then use those draws as the noisy estimates of the variational expectation for the ELBO. The optimization is then carried out by back-propagating the gradients into the encoder weights and the topic and gene embeddings.

Details of training scETM
We chose the encoder for inferring the cell topic mixture to be a 2-layer neural network, with hidden sizes of 128, ReLU activations [61], 1D batch normalization [62], and 0.1 dropout rate between layers. We set the gene embedding dimension to 400, and the number of topics to 50. We optimize our model with Adam Optimizer and a 0.005 learning rate. To prevent over-regularization, we start with zero weight penalty on the KL divergence and linearly increase the weight of the KL divergence in the ELBO loss to 10 −7 during the first 1 3 epochs. With a minibatch size of 2000, scETM typically needs 5k-20k training steps to converge. We show that our model is robust to changes in the above hyperparameters (Supp. Table S2). During the evaluation, we used the variational mean of the unnormalized topic mixture µ d in q(δ d |y d ) = N (µ d , diag(σ d )) as the scETM cell topic mixture for cell d.

scETM+adv: adversarial loss for further batch correction
In the scETM+adv variant, we added a discriminator network (a two-layer fully-connected network) to predict batch labels using the unnormalized cell topic mixture embedding δ generated by the encoder network. This discriminator helps batch correction in an adversarial fashion. Specifically, in each training iteration, we first update the scETM parameters once by maximizing the ELBO plus the batch prediction cross-entropy loss from the discriminator, with a hyperparameter controlling the weight of the latter term. It should be noted that by maximizing the prediction loss the encoder network learns to produce batch agnostic cell embeddings. Then we update the discriminator network eight times by trying to minimize the cross-entropy loss in predicting the batch labels.

scETM software
We implemented scETM using the PyTorch library [63]. Our initial implementation was based on the ETM code from GitHub (adjidieng/ETM) by [33]. Since then, we completely revamped the code to substantially improve the scalability and to integrate it into the Python ecosystem. In particular, we packaged and released our code on PyPI so one can easily install the package by entering pip install scETM in the terminal. The package is integrated with scanpy [16] and tensorboard [64]. Users can view the cell, gene and topic embeddings interactively via tensorboard. For example, one can easily train a scETM as follows: from scETM import scETM, UnsupervisedTrainer model = scETM(adata.n_vars, adata.obs.batch_indices.nunique()) trainer = UnsupervisedTrainer(model, adata) trainer.train(save_model_ckpt = False) model.get_all_embeddings_and_nll(adata) The above code snippet will instantiate an scETM model object, train the model, infer the unnormalized cell topics mixture of adata and store them in adata.obsm['delta']. We can also access the gene and topic embeddings via adata.varm['rho'] and adata.uns['alpha'].

Transfer learning with scETM
When transferring from a reference dataset to a target dataset, we operate on the genes common to both datasets. For cross-species transfer, the orthologous genes based on the Mouse Genome Informatics database [65,66] are considered as common genes. The trained scETM encoder can be directly applied to an unseen target dataset, as long as the genes in the target dataset are aligned to the genes in the reference dataset. In the main text, for example, we trained scETM on a reference dataset and evaluated the scETM-encoder on a target dataset in 6 transfer learning tasks (Fig. 4).

Pathway enrichment analysis
To assess whether a topic is enriched in any known pathway, one common way is to test for Over Representation Analysis (ORA) [67]. However, ORA requires choosing a subset of genes (e.g., from differential expression analysis). While we could choose the top genes scored by each topic, it requires some arbitrary threshold to select those genes. To avoid thresholding genes, we employed Gene Set Enrichment Analysis (GSEA) [44]. GSEA calculates a running sum of enrichment scores (ES) by going down the gene list that is sorted in the decreasing order by their association statistic with a phenotype.
In our context, we treated the gene scores under each topic from the genes-by-topics matrix (i.e., β) as the association statistic. The ES for a gene set S is the maximum difference between P_hit(S,i) and P_miss(S,i), where P_hit(S,i) is the fraction of genes in S weighted by their topic scores up to gene index i in the sorted list and P_miss(S,i) is the fraction of genes not in S weighted by their topic scores up to gene index i in the sorted list. The enrichment p-value for each gene set is computed by permutation tests by randomly shuffling the gene symbols on the sorted list (while keeping the gene topic scores in the decreasing order) 1000 times to compute the null distribution of the ES for each gene set and each topic. The empirical p-value was calculated as (N'+1)/(N+1), where N' is the number of permutation trials in which ES is greater than the observed ES, and N is the total number of trials (i.e., 1000). We then corrected the p-values for multiple testing using Benjamini-Hochberg (BH) method [68].
For AD and HP datasets, we used the MSigDB Canonical Pathways gene sets [69] as the gene set database in GSEA; and for MDD, we used PsyGeNET database [70] in order to find psychiatric disease-specific associations. We also run GSEA for scVI-LD gene loadings for comparison. The detailed pathway enrichment statistic can be found in Supp. Table S18.

Differential analysis of topic expression
We aimed to identify topics that are differentially associated with known cell type labels or disease conditions. For topic k and cell label j (i.e., cell type or disease condition), we first calculated the difference of the average topic activities between the cells with label j and the cells without label j. For each permutation trial, we randomly shuffled the label assignments among cells and recalculated the difference of average topic activities from the resulting permutation. The empirical p-value was calculated as (N'+1)/(N+1), where N' is the number of permutation trials in which the difference is greater than the observed difference, and N is the total number of trials. To account for multiple hypotheses, we applied Bonferroni correction by multiplying the p-value by the product of the topic number and the number of labels. We performed N=100,000 permutations.
We determined a topic to be differentially expressed (DE) if the Bonferroni corrected q-value is lower than 0.01 and the mean difference is greater than 2 for cell-type DE topcis or 0.2 for disease DE topics. Supp. Table S20 summarizes the number of DE topics we identified for each cell type and disease conditions from the AD and MDD data. We use the PanglaoDB database [71] to find the overlap between top genes of cell-type-specific DE topics and known cell type markers.

Incorporation of pathway knowledge into the gene embeddings in p-scETM
We downloaded the pathDIP4 pathway database from [49], and the Gene Ontology (Biological Processes) (GO-BP) dataset from MSigDB v7.2 Release [69]. Pathway gene sets or GO-BP terms containing fewer than five genes were removed. We represented the pathway knowledge as a pathways-by-genes ρ matrix, where ρ ij = 1 if gene set i contains gene j, and ρ ij = 0 otherwise. We standardized each column (i.e., gene) of this matrix for numerical stability. During training the p-scETM, we fixed the gene embedding matrix ρ to the pathways-by-genes matrix.

Clustering performance benchmark and visualization
We assessed the performance of each method by three metrics: Adjusted Rand Index (ARI) [72], Normalized Mutual Information (NMI) and k-nearest-neighbor Batch-Effect Test (kBET) [18]. ARI and NMI are widely-used representatives of two families of clustering agreement measures, pair-counting and information theoretic measures, respectively. A high ARI or NMI indicates a high degree of agreement for a given clustering result against the ground-truth cell type labels. We calculated ARI and NMI using the Python library scikit-learn [73].
kBET measures how well mixed the batches are based on the local batch label distribution in randomly sampled nearest-neighbor cells compared against the global batch label distribution. Average silhouette width (ASW) [40] indicates clustering quality using cell type labels. Silhouette width (SW) of a cell i is the distance of cell i from all of the cells within the same cluster subtracted by the distance of cell i from cells in a nearest but different cluster, normalized by the maximum of these two values. ASW is the averaged SW over all the cells in a dataset and larger values indicate better clustering quality. Therefore, larger ASW indicates the higher distances between cell types and lower distance within cell type in the cell embedding space. We choose the distance function to be the euclidean distance. We adapted the Pegasus implementation [74] for kBET calculation, and set the k to 15.
All embedding plots were generated using the Python scanpy package [16]. We use UMAP [39] to reduce the dimension of the embeddings to 2 for visualization, and Leiden [75] to cluster cells by their cell embeddings produced by each method in comparison. During the clustering, we tried multiple resolution values and reported the result with the highest ARI for each method.
For reproducibility, the evaluation and the plotting steps were implemented in a single evaluate function in the scETM package, which takes in an AnnData object with cell embeddings and returns a Figure object for the ARI, NMI, kBET and embedding plot. For consistency, we used this function to evaluate all methods, including those written in R, where we used the reticulate package [76] to call our evaluate function.
We ran all methods under their recommended pipeline settings (Supplementary Methods), and we use batch correction option whenever applicable to account for batch effects. All results are obtained on a compute cluster with Intel Gold 6148 Skylake CPUs and Nvidia V100 GPUs. We limit each experiment to use 8 CPU cores, 192 GB RAM and 1 GPU.

Efficiency and scalability benchmark of the existing methods
To create a benchmark dataset for evaluating the run time of each method, we merged MDD and AD, keeping the genes that appear in both datasets. We then selected the 3000 most variable genes using scanpy's highly_variable_genes(n_top_genes=3000, flavor='seurat_v3') function, and randomly sampled 28,000, 14,000, 70,000 and 148,247 (all) cells to create our benchmark datasets. The memory requirements reported in Fig. 3 were obtained by reading the rss attribute of the named tuple returned by calling Process().memory_info() from the psutil Python package [77]. For methods based on R, we use the reticulate package [76] to call the above Python function for consistency. We used the same settings (RAM size, number of GPUs, etc) as described in the Clustering performance benchmark and visualization section throughout the experiments.

Data Availability
The datasets analysed during the current study are from publicly available repositories or data portals. The acquisition and quality control steps for all datasets are included in the supplementary information.

Code Availability
scETM source codes as well as the benchmarking workflows have been deposited at the GitHub repository (https://www.github.com/hui2000ji/scETM) and available in the zip file accompanied with the paper.

Ethics approval and consent to participate
Not applicable.

Competing interests
The authors declare that they have no competing interests.

Consent for publication
Not applicable.

Authors' contributions
YL and JT conceived of the study. YZ, HC, YL analyzed and interpreted the data, wrote the manuscript, and wrote the code for scETM. HC optimized and completed the final scETM code. ZZ ran some initial experiments. YL and JT supervised the project. All authors approved the final manuscript.        For the HP dataset, GO-BP terms whose names include the keywords "insulin" or "pancreatic" were highlighted. For the AD dataset, GO-BP terms whose names include the keywords "amyloid" or "alzheimer" were highlighted. For MDD -all genes and MDD -coding genes only, GO-BP terms whose names include the keywords "neuron" or "G-protein" were highlighted. Tables   Transferable Interpretable MP  HP  TM  AD Table 1: Model properties and unsupervised clustering performance on data integration tasks. The clustering performance is measured by Adjusted Rand Index (ARI) between ground truth cell types and Leiden [75] clusters. scETM performances with or without the linear batch correction (scETM, scETM−λ) are both reported. scETM+adv is scETM plus adversarial network loss to further correct batch effects. Batch variables include strain, sequencing technologies ("Tech.") and individuals ("Ind."). NA is reported for models that did not converge. Experimental details are described in the Methods section. We ran Seurat on all datasets with integration turned on whenever applicable.

S1.1 Data processing
All of the single-cell datasets used in this study are from publicly available repositories or data portals. We describe below the acquisition and quality control (QC) for each of the datasets used in the current work.

S1.1.2 Mouse pancreatic islet
We obtained the mouse pancreatic islet data and ground truth cell type labels from GSE84133 (inDrops) without conducting additional QC. There are 1,886 mouse cells from two mice of different strains, ICR and C57BL/6 [34]. The cell counts from the two trains are of approximately equal proportions. In our benchmarking experiment, we treated the mouse strain as the batch variable because of the different genetic backgrounds.

S1.1.3 Major Depressive Disorder (MDD)
We obtained the 10X Genomics-based MDD snRNA-seq dataset with ground truth cell type labels from GSE144136. A strict QC step was conducted in the original empirical study by [29], where cells with fewer than 110 detected genes were removed. The top 0.5% of cells based on the total number of UMI (unique molecular identifiers) detected in each cell were also excluded because they are likely to be multiplets rather than single nuclei. No additional QC was performed. The MDD dataset consists of 78,886 cells from the dorsolateral prefrontal cortex of 34 male participants. The participants in the control group (n=17) who died due to natural cause and case group (n=17) who died by suicide were matched for age (18-87 years), postmortem interval (12-93h) and brain pH (6-7.01) [29]. The number of cells from each donor is approximately the same.

S1.1.4 Alzheimer's Disease (AD)
We obtained the droplet-based AD snRNA-seq data and the corresponding ground truth cell type labels from Synapse (https://www.synapse.org/#!Synapse:syn18485175) under the doi 10.7303/syn18485175, and the metadata from https://www.synapse.org/#!Synapse: syn3157322. A strict QC step based on UMI counts and mitochondrial ratio values was conducted in the original empirical study by Mathys et al. [35].

S1.1.5 Tabula Muris
We obtained the Tabula Muris dataset with ground truth cell type labels from FigShare (https: //figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_ and_tissues_from_Mus_musculus_at_single_cell_resolution/27733) for the Version 2 release [3]. This dataset includes mouse single-cell transcriptome data sequenced by two tech-nologies: microfluidic droplet-based, and fluorescence-activated cell sorting (FACS)-based. A QC cutoff was applied in the original empirical study where only cells with at least 500 genes and 50,000 reads/ 1000 UMI are kept. The droplet subset includes data for 422,803 droplets, 55,656 of which passed the QC cutoff. The FACS subset, denoted as "TM (FACS)" in the paper, contains data for 53,760 cells, 44,879 of which passed the QC cutoff.

S1.1.6 Allen Brain Atlas
We downloaded two brain datasets from Allen Brain Atlas [78] (accessed 03/21/2021). The human primary motor cortex dataset (HumM1C) includes single-nucleus transcriptomes from 76,533 nuclei from the primary motor cortex (M1C) of 2 post-mortem human brain specimens. In total, 127 transcriptomic cell types are present in this dataset. The sample processing follows the 10x Genomics pipeline. To generate the ground-truth cell labels, the default 10x Cell Ranger v3 pipeline was used except substituting the curated genome annotation used for SMART-seq v4 quantification. The mouse brain dataset includes single-cell transcriptomes from more than 20 areas of mouse cortex and the hippocampus, which has 1,093,785 cells in total. Samples were collected from male and female mice around 8 week-old, from panneuronal transgenic lines. The cell transcriptomes were sequenced using 10x Genomics. We used the subclass labels in the metadata used as ground truth cell type label in the current study. We removed the cells and nuclei with subclass label "outlier". To encourage better transfer from HumM1C, we subset the mouse brain dataset by keeping the 124,953 cells with re-gion_label "MOp", and obtain the MusMOp dataset.

S1.1.7 Mouse Retina
We obtained the MR dataset from https://hemberg-lab.github.io/scRNA.seq.datasets/ mouse/retina/, which is a collection of mouse retina scRNA-seq datasets from two independent studies, namely Macosko et al. [36] with 44808 samples, and Shekhar et al. [37] with 27499 samples. Both subsets were sequenced using Drop-seq. We kept the genes shared by the two subsets, and filtered out the 669 samples labeled as "Doublets/Contaminants" in the Shekhar batch. The merged dataset contains 71638 cells and 12333 genes.

S1.2 Experimental details of other scRNA-seq methods
Neural-network based models, including scVI, scVI-LD, scVAE-GM and scETM typically need at least 5000 gradient updates to converge. When running on small datasets, the total number of gradient updates per epoch may be very small (sometimes as low as 1). In these cases, we increase the number of epochs T to ensure the model goes through at least 12000 gradient updates, i.e. T = max(800, 12000 B N ), where N is the number of cells in the dataset and B is the mini-batch size.

S1.2.1 Seurat v3
We downloaded Seurat v3 (version 3.1.5) from CRAN [8]. We followed the steps outlined by the integration workflow (https://satijalab.org/seurat/v3.2/integration.html) which includes NormalizeData, FindVariableFeatures, FindIntegrationAnchors, and IntegrateData. To make the comparisons more equitable, we set the min.features=0 to avoid exclusion of cells. All other parameters were set as default. We noted that, with batch integration turned on, Seurat reports error in the integration step due to the high number of anchors arising from the 48 individuals (batch variable in AD), which is a known implementation issue with the standard Seurat v3 integration workflow [79]. We therefore turned off the batch integration for AD in the benchmarking experiments (see Clustering performance benchmark and visualization and Efficiency and scalability benchmark of the existing methods) and followed the steps described in the Guided Clustering Tutorial (https://satijalab.org/seurat/v3.2/pbmc3k_ tutorial.html).

S1.2.2 Scanorama
We downloaded the source code from GitHub brianhe/scanorama. We used the integrate_scanpy function for dataset integration and batch correction as suggested by the guided tutorial. All parameters were set as default. The algorithm performs a PCA on the stacked datasets and uses 100 PCs for downstream computation.

S1.2.3 Harmony
We downloaded the source code from GitHub slowkow/harmonypy suggested by the primary repository immunogenomics/harmony and followed the preprocessing (normalization and top variable gene selection) described in the publication and the integration steps in the provided tutorial. We used the run_harmony function to obtain the corrected PCA embeddings and used 50 PCs as input. All other parameters were set as default.

S1.2.4 LIGER
We used the official implementation provided on the website of the LIGER package. For the convenience of implementation, we followed the usage tutorial using Seurat Wrapper to process the raw data and then ran LIGER with default parameters.

S1.2.5 scVI/scVI-LD
We downloaded the implementation from the Github repository YosefLab/scVI. We used the default model, which has one layer for both the encoder and decoder (for scVI-LD the decoder is a latent dimensions-by-genes matrix), 128 hidden units, 10 latent dimensions and ZINB distribution for modeling the data. We chose 10 −3 as the learning rate and trained on each unprocessed dataset for 400 epochs, following the provided tutorials. We change the training batch size to 2000 for faster training. We obtained the cell embeddings via the get_latent method. We also compared the performance of scVI(-LD) with 10 and 100 latent dimensions on the five benchmark scRNA-seq datasets (Supp. Table S21). We found that in most cases, scVI(-LD) with 10 latent dimensions performs better than scVI(-LD) with 100 latent dimensions, justifying the hyperparameter choices made by the scVI(-LD) authors. When evaluating the interpretability aspect of scVI-LD, we use 100 latent dimensions for fair comparison with scETM.

S1.2.6 scVAE
We downloaded the implementation from GitHub repository scvae/scvae. We set the hidden units to be (256, 128) for the encoder. The decoder is symmetric to the encoder. Latent dimension was set to 128 to match scETM. We chose 10 −4 as the learning rate and NB distribution for modeling the data following the authors' recommendation. We trained on each unprocessed dataset for 400 epochs with batch size of 250, including a 200-epoch warm-up for the KL divergence loss. In the scalability benchmark, we disabled the time-consuming per-epoch checkpoints to match other methods. The model did not converge on the Human Pancreatic Islet dataset, where the ELBO went to infinity. It failed to extract meaningful information from the Tabula Muris dataset, resulting in an ARI of 0.0.                      Table S1: Normalized Mutual Information (NMI) between ground truth cell types and leiden clusters on 6 benchmark scRNA-seq datasets. NA is reported for models that did not converge. scETM performances with or without the linear batch correction (scETM, scETM−λ) are both reported. scETM+adv is scETM plus adversarial network loss to further correct batch effects. Batch variables include strain, sequencing technologies ("Tech.") and individuals ("Ind."). NA is reported for models that did not converge. More details are described in Table 1 0  445  0  0  0  delta  50  203  25  608  127  ductal  304  257  34  915  444  endothelial  5  21  14  235  21  epsilon  1  4  1  16  8  gamma  18  110  18  266  213  macrophage  1  15  1  55  7  mast  1  6  3  39  7  quiescent stellate  1  12  1  160  6  schwann  1  4  5 13 2    Table S10: Clustering performance on target datasets in the 6 cross-tissue and crossspecies transfer learning tasks. The clustering performance is measured by Adjusted Rand Index (ARI) between ground truth cell types and Leiden [75] clusters. The cell clusters based on the cell embedding produced by each method are also depicted in the UMAP visualization in   Table S12: Pathway enrichment statistics for Human Pancreas dataset. Pathways whose names include the keywords "insulin" or "pancreatic" are shown. For each pathway, the number of topics with significant enrichment (BH-corrected q-value < 0.01) are counted. Both scETM and scVI-LD use latent dimensions of 100 for fair comparison.  Table S13: Pathway enrichment statistics for AD dataset. Pathways whose names include the keywords "amyloid" or "alzheimer" are shown. For each pathway, the number of topics with significant enrichment (BH-corrected q-value < 0.01) are counted. Both scETM and scVI-LD use latent dimensions of 100 for fair comparison.