Abstract
Singlecell RNASeq (scRNAseq) is invaluable for studying biological systems. Dimensionality reduction is a crucial step in interpreting the relation between cells in scRNAseq data. However, current dimensionality reduction methods are often confounded by multiple simultaneous technical and biological variability, result in “crowding” of cells in the center of the latent space, or inadequately capture temporal relationships. Here, we introduce scPhere, a scalable deep generative model to embed cells into lowdimensional hyperspherical or hyperbolic spaces to accurately represent scRNAseq data. ScPhere addresses multilevel, complex batch factors, facilitates the interactive visualization of large datasets, resolves cell crowding, and uncovers temporal trajectories. We demonstrate scPhere on nine large datasets in complex tissue from human patients or animal development. Our results show how scPhere facilitates the interpretation of scRNAseq data by generating batchinvariant embeddings to map data from new individuals, identifies cell types affected by biological variables, infers cells’ spatial positions in predefined biological specimens, and highlights complex cellular relations.
Introduction
Singlecell genomics—especially singlecell RNAseq (scRNAseq)—has opened the way to a comprehensive analysis of the relationship between cells, including their different types, states, physiological transitions, differentiation trajectories, and spatial positions^{1,2,3}. Although scRNAseq datasets have high dimensionality, their intrinsic dimensionality is typically low, because many genes are coexpressed and a few variables, such as cell type, a gene program, or the number of detected transcripts, could explain a substantial portion of the variation in a dataset. As a result, dimensionality reduction, followed by visualization or downstream analyses has become a key strategy for exploratory data analysis in singlecell genomics^{4,5}.
Recently, deeplearning models^{6}, especially (variational) autoencoders^{7,8,9}, have been used for dimensionality reduction prior to visualization or downstream analyses, such as clustering^{10,11,12,13,14,15}. This leverages their ability to model largescale highdimensional data and their flexibility in incorporating different factors, especially batch effects in the modeling framework. Moreover, such models can provide an endtoend, single process for analyses that otherwise require multiple separate steps, each with its own method or algorithm, including batch correction, dimensionality reduction, and visualization.
However, standard variational autoencoders (VAEs) have several shortcomings when modeling and analyzing scRNAseq data. First, they assume a multidimensional normal prior for the lowdimensional latent variables, which unfortunately encourages the lowdimensional representations of all cells to the group in the center of the latent space, even for data consisting of distinct cell types. This is especially true if the model is trained long enough, such that the posterior distributions gradually approximate the prior distribution. (Cell crowding also afflicts generalpurpose data visualization tools such as tstochastic neighborhood embedding (tSNE)^{16}, once the large datasets consist of hundreds of thousands of cells^{17,18}.) A second challenge arises from using the cosine to measure the distance between two cells^{19,20,21} for very sparse dropletbased scRNAseq data (>90% genes with zero counts in a typical cell profile). Because the cosine distance between two cell vectors is their Euclidean distance after normalizing the two cell vectors to have a unit ℓ^{2} norm, the cells lie on the surface of a unit hypersphere with a dimensionality of D − 1, where D is the number of measured genes. Embedding data distributed on a hypersphere to a Euclidean space introduces significant distortion for commonly used dimensionality reduction tools^{22}, and standard variational autoencoders also fail to model such data^{23}. Moreover, the Euclidean geometry is not optimal for representing hierarchical, branched developmental trajectories^{24,25,26}. Third, in practice, current applications of VAEs for scRNAseq data can only handle a singlebatch vector (factor), whereas biologically relevant datasets typically have multiple such factors, both technical (e.g., replicate or study) and biological (e.g., patient, tissue location, disease status). Such complex multilevel factors are not wellhandled by current batchcorrection methods in singlecell genomics, either VAEs or other approaches^{5,12,27,28,29,30,31}, but addressing them is critical for integration across studies, interpretation of the impact of various factors on cells in complex tissues, and the ultimate assembly of large tissue atlases.
Here, we present alternative approaches for embedding of cells into hyperspherical or hyperbolic spaces based on deepgenerative models, to better capture their inherent properties, tackle complex batch effects, generate references, and perform diverse analyses. For general scRNAseq data, we minimize the distortion by embedding cells to a lowerdimensional hypersphere instead of a lowdimensional Euclidean space^{23}, using von Mises–Fisher (vMF) distributions on hyperspheres as the posteriors for the latent variables^{23,32,33}. Because the prior is a uniform distribution on a unit hypersphere and the uniform distribution on a hypersphere has no centers, points are no longer forced to cluster in the center of the latent space. For representation and inference of hierarchical, branched developmental trajectories, we embed cells to the hyperbolic space of the Lorentz model and visualize the embedding in a Poincaré disk^{24,25,34}. Using nine diverse datasets from human and model organisms, we demonstrate scPhere’s superior performance on key existing use cases as well as emerging applications, including processing large scRNAseq datasets with complex multilevel batch effects, visualizing cell profiles from highly complex tissues and developmental processes, building batchinvariant reference models to which new data can be readily mapped, identifying the cells impacted by specific biological factors, and mapping cells to spatial positions. Overall, our model provides enhanced representation, complex batch correction, referencegeneration, visualization, and an interpretation tool for singlecell genomics research.
Results
Mapping scRNAseq data to hyperspherical or hyperbolic latent spaces
We developed scPhere (pronounced “sphere”), a deeplearning method that takes scRNAseq count data and information about multiple known confounding factors (e.g., batches, conditions) and embeds the cells to a hyperspherical or hyperbolic latent space (Fig. 1a, “Methods”). We reasoned that scPhere would allow cells to be embedded more appropriately because they will not be constrained to aggregate in the center. In cases where we expect a branching structure with a large number of trajectories, hyperbolic spaces are particularly suitable, because the exponential volume growth of hyperbolic spaces with radius confers them enough capacity to embed trees, which have exponentially increasing numbers of nodes with depth. For 3D visualization, scPhere places cells on the surface area of a sphere (but not inside the sphere), such that we only need to rotate the sphere to see all cells, avoiding the common challenge of exploring the interior of 3D embeddings. The scPhere package renders all 3D plots for interactive visualizations of millions of cells with the rapid rgl R package, with web graphics library files, which can be opened in a browser for exploration. Alternatively, one can convert the 3D coordinates to 2D, based on various projection methods, such as the recent Equal Earth map projection method^{35}.
Specifically, scPhere takes as input an scRNAseq dataset \({D}=\{({{\bf{x}}}_{i},{{\bf{y}}}_{i})\}_{i=1}^{N}\) with N cells, where \({{\bf{x}}}_{i}\) is the UMI count vector of D genes in cell i, and \({{\bf{y}}}_{i}\) is a categorical vector specifying the batch in which \({{\bf{x}}}_{i}\) is measured, and models the \({{\bf{x}}}_{i}\) UMI count distribution as governed by a latent lowdimensional random vector \({{\bf{z}}}_{i}\) and by \({{\bf{y}}}_{i}\) (Fig. 1a, “Methods”). Note that \({{\bf{y}}}_{i}\) can account for multilevel confounding factors, for example, patient, disease status, and lab protocol. The scPhere model assumes that the latent lowdimensional random vector \({{\bf{z}}}_{i}\) is distributed according to a prior, with the joint distribution of the whole model factored as \(p({{\bf{y}}}_{i} {{\mathbf{\uptheta }}}_{i})p({{\bf{z}}}_{i} {{\mathbf{\uptheta }}}_{i})p({{\bf{x}}}_{i} {{\bf{y}}}_{i},{{\bf{z}}}_{i},{{\mathbf{\uptheta }}}_{i})\), where \(p({{\bf{y}}}_{i} {{\mathbf{\uptheta }}}_{i})\) is the categorical probability mass function (constant for our case, as \({{\bf{y}}}_{i}\) is observed). For hyperspherical latent spaces, scPhere uses a uniform prior on a hypersphere for \({\rm{p}}({{\bf{z}}}_{i}{\rm{ }}{{\mathbf{\uptheta }}}_{i})\); for hyperbolic latent spaces, it uses a wrapped normal distribution in the hyperbolic space as the prior. For the observed raw UMI count inputs, we assume a negativebinomial distribution: \(p({{\bf{x}}}_{i} {{\bf{y}}}_{i},{{\bf{z}}}_{i},{{\mathbf{\uptheta }}}_{i})= \mathop{\prod}\limits_{j=1}^{D} {\rm{NB}}({x}_{i,j} {\mu }_{{{\bf{y}}}_{i},{{\bf{z}}}_{i}},{\sigma }_{{{\bf{y}}}_{i},{{\bf{z}}}_{i}})\), with parameters specified by a neural network. The inference problem is to compute the posterior distribution \(p({{\bf{z}}}_{i} {{\bf{y}}}_{i},{{\bf{x}}}_{i},{{\mathbf{\uptheta }}}_{i})\), which is assumed to be a von Mises–Fisher distribution for hyperspherical latent spaces, and a wrapped normal distribution for hyperbolic latent spaces. Because it is intractable to compute the posterior, the scPhere model uses a variational distribution \(q({{\bf{z}}}_{i} {{\bf{y}}}_{i},{{\bf{x}}}_{i},{{\mathbf{\upphi }}}_{i})\) to approximate the posterior (Fig. 1a, “Methods”). When a hyperspherical latent space is used, \({{\bf{x}}}_{i}\) is first logtransformed and scaled to have a unit ℓ^{2} norm for inference, otherwise \({{\bf{x}}}_{i}\) is only logtransformed but not scaled. The parameters \({{\mathbf{\upphi }}}_{i}\) of the variational distribution are (continuous) functions of \({{\bf{x}}}_{i}\) and \({{\bf{y}}}_{i}\) parameterized by a neural network with parameter \({\mathbf{\upphi }}\). As a deeplearning model trained by minibatch stochastic gradient descent, scPhere is especially suited to process large scRNAseq datasets with complex multilevel batch effects and facilitates emerging applications (Fig. 1b). We provide full details in the “Methods” section.
ScPhere visualizes large datasets with multiple cell types and hierarchical structures without cell crowding
Applying scPhere to scRNAseq data shows that its spherical latent variables help address the problem of cell crowding in the origin and that it provides excellent visualization for data exploration, with easily interpretable latent variable posterior means of cells.
To illustrate this, we applied scPhere to six scRNAseq datasets from human and mouse, spanning from small (thousands) to very large (hundreds of thousands) of cells from one or multiple tissues, and with a small (two) to very large (dozens) of expected cell types. We compared scPhere’s visualization with a hyperspherical latent space to scPhere’s VAE but with a Euclidean embedding, as well as to three major generalpurpose data visualization tools commonly applied to scRNAseq data: tSNE^{16}, UMAP^{36}, and PHATE^{37}. The “small” datasets were: (1) a blood cell dataset^{38} with only 10 erythroid cell profiles and 2293 CD14^{+} monocytes; (2) 3314 human lung cells^{39}, (3) 1378 mouse white adipose tissue stromal cells^{40}, and (4) 1755 human splenic nature killer cells spanning four subtypes^{41}. The “large” datasets were: (1) 35,699 retinal ganglion cells in 45 cell subsets^{42}; and (2) 599,926 cells spanning 102 subsets across 59 human tissues in the Human Cell Landscape^{43}.
Applying scPhere with a hyperspherical latent space to each of the “small” datasets readily distinguished cell subsets, and moreover, the posterior means of cells typically did not overlap, which helped ensure that we can discern individual cells without occlusion. In each case, cells of the same type were close to each other on the surface of a sphere, and yet generally two cells were distinguishable, even by eye (Supplementary Fig. 1a–d). Conversely, when we used a standard multivariate normal prior, the posterior means of the latent variables were centered at the origin, leading to crowding (Supplementary Fig. 1e–h). Thus, in the Euclidean space, the closer the cells were to the center, the higher were their densities, a problem persisting in both 2D (Supplementary Fig. 1e–h) and 3D (Supplementary Fig. 1i–l), even with rotation of the 3D space. In particular, similar cell types were very close to each other in the Euclidean space (e.g., APC and FIP, Supplementary Fig. 1g), and rare cell types became “outliers” (hNK_Sp3 and hNK_Sp4, Supplementary Fig. 1h). Notably, although there were discrete cell types in these datasets, even scPhere with hyperbolic latent spaces performed well (Supplementary Fig. 2a). Overall, tSNE, UMAP, and PHATE^{37} generally worked well for these smaller datasets without batch effects (Supplementary Fig. 2b–d), with some minor challenges (e.g., mixing of mouse adipose doublets and macrophages by UMAP, Supplementary Fig. 2c), and PHATE—designed for development trajectories—connecting cells inaccurately when only discrete cell types are present (Supplementary Fig. 2d).
ScPhere’s advantages compared to other approaches were particularly pronounced when applied to datasets with a larger number of cells and clusters: mouse retinal ganglion cells (RGCs)^{42} and the Human Cell Landscape^{43}. While scPhere (with either spherical or hyperbolic latent space and default parameters throughout), tSNE and UMAP all discerned well individual cell types among RGCs (Fig. 2a–h and Supplementary Fig. 3a–c) and the Human Cell Landscape (Fig. 2i–p and Supplementary Fig. 3d–f), scPhere best preserved the hierarchical global structure in these data, grouping together sets of clusters of different subtypes of each major type (Fig. 2e–h, m–p). For example, among RGCs, all CartptRGC clusters were in one part of the scPhere embedding (Fig. 2e, f), but in different parts of the tSNE and UMAP embeddings (Fig. 2g, h). Similarly, most of the 102cell clusters in the Human Cell Landscape organize in the scPhere embedding by their six major cell groups (fetal stromal cells, fetal epithelial cells, adult endothelial cells, endothelial cells, adult stromal cells, and immune cells) (Fig. 2m, n), but were more spread in different parts of the tSNE and UMAP representations (Fig. 2o, p). ScPhere outperformed other methods in preserving the hierarchical global structure, based on global kNN accuracies on both the RGC and the HCL datasets (scPhere: 73.53% and 92.21%, tSNE: 47.06% and 79.22%; UMAP:52.94% and 83.12%, Supplementary Fig. 3b, e, “Methods”). Moreover, with this large number of cells, tSNEs were increasingly “crowded”^{17,18}, such that where even very distinct cell types were very close to each other in 2D (Fig. 2k), and cells from multiple clusters appeared mixed in the UMAP (Fig. 2l), as reflected both visually and based on mean Silhouette scores (Mann–Whitney U test, FDR < 0.0001, Supplementary Fig. 3f, “Methods”). ScPhere did not suffer from these problems, partially because it was trained using minibatches, while tSNE and UMAP were learned using all the data, and their parameters (especially the perplexity parameter of tSNE) have to be adapted to this larger number of cells, but increasing the perplexity parameter makes tSNE computationally expensive^{18}. (In our analyses, we used a list of perplexity parameters that already grow with the number of cells^{18}; the cellcrowding problem suggests that much larger perplexity parameters are required.) By contrast, scPhere was trained with default parameters and is scalable to process a large number of cells, with a time complexity that is linear with the number of input cells (Supplementary Fig. 4a–i, “Methods”). Embedding cells into the Euclidean space performed worse than embedding cells into hyperspherical latent spaces in terms of discerning discrete cell types or in preserving their hierarchical organization (Supplementary Fig. 3g–j), because the normal prior encourages the posterior means of cells to be centered at the origin. As expected, PHATE did not perform well for these large datasets of mostly discrete cell types (Supplementary Fig. 3k–n).
ScPhere effectively models complex, multilevel batch, and other variables
Singlecell profiles in realistic biological datasets are typically impacted by diverse factors, including technical batch effects in separate experiments and different lab protocols, as well as biological factors, such as interindividual variation, sex, disease, or tissue location. However, most batchcorrection methods^{5,12,27,28,29,31} handle only one batch variable (which often is technical in practice) and may not be wellsuited to the increasing complexity of current datasets. ScPhere however can learn models of data with multiple variables.
To assess its ability to performed batch correction, we applied scPhere to a dataset of 301,749 cells we previously profiled in a complex experimental design from the colon mucosa of 18 patients with ulcerative colitis (UC), a major type of inflammatory bowel diseases (IBD), and 12 healthy individuals^{44}. In addition to each individual patient biopsy being a batch, there were many other factors to consider: individuals were either healthy or with UC, cells were collected separately from the epithelial and lamina propria fractions of each biopsy, there were two replicate biopsies for each healthy individual and as a pair of inflamed and uninflamed biopsies for the UC patients (for a few UC patients, there were replicate inflamed and/or replicate uninflamed biopsies)^{44}, and, finally, samples were collected at two time periods, separated by over a year (analyzed as train and test data in the original study^{44}). Notably, these factors had a substantial impact on the cells’ profiles and ability to integrate the data, which required a large number of dedicated and iterative steps in the original study^{44}, with optimization for the specific dataset. To test scPhere, we applied it with default parameters, in a single endtoend process, assessed its results biologically, and compared its performance to that of three leading batchcorrection methods—Harmony^{30}, LIGER^{29}, and Seurat3 CCA^{5} (the latter two can handle only one batch factor, which we chose to be the individual^{44}, as is the common practice; “Methods”).
Analyzing cells with the patient origin as the batch vector, not only recapitulated the main cell groups in our initial study^{44} but was highly refined, allowing us to better visually explore cellular relations (Fig. 3a–c and Supplementary Movies 1–4). For example, in the stromal and glial cells, endothelial cells and microvascular cells were close to each other, and adjacent to postcapillary venules. Conversely, these distinctions can barely be discerned in a UMAP plot of the same data, where endothelial and microvascular cells were very close (Supplementary Fig. 5a; using the 20 batchcorrected components by either Harmony^{30}, Seurat3 CCA^{5}, or LIGER^{29} as inputs). Among fibroblasts, cells arranged in a manner that mirrored their position along the crypt–villus axis, from RSPO3^{+}WNT2B^{+} cells (which support the ISC niche^{44}), to WNT2B^{+} cells, to WNT5B^{+} cells. Strikingly, the inflammatory fibroblasts, which are unique to UC patients^{44}, were readily visible (Fig. 3a, light blue), and were both distinctive from the other fibroblasts, while spanning the range of the “crypt–villus axis” (as shown experimentally^{44}). ScPhere’s batch correction on this complex dataset (30 patients with disease and location factors) performed better than Harmony, Seurat3 CCA, and LIGER based on classification accuracies of cell types for stromal, epithelial, and immune cells (Fig. 3d–f, Supplementary Figs. 6–9, either knearest neighbors (kNN) or logistic regression; we omitted Seurat3 CCA results for immune cells with >200,000 cells and 30 batches, as it failed to complete.). ScPhere performed well even when using fewer latent variables, which avoids the componentcollapse problem in VAEs (Supplementary Fig. 10, “Methods”).
ScPhere’s ability to correct for multiple confounding factors simultaneously (which is not readily possible with many other batchcorrection methods^{5,12,27,28,29}) helps to understand the impact of biological factors. For example, when using both patient origin and disease status (healthy, uninflamed, inflamed) as the batch vector in the stromal cells, scPhere largely merged the inflammatory fibroblasts with WNT2B^{+} fibroblasts (Fig. 3g and Supplementary Movie 2). When analyzing epithelial cells, adding anatomical regions as a component of the batch vector, the cells were grouped solely by types (e.g., stem cells separate from TA2 cells, Fig. 3h, i), whereas anatomical regions dominated the cells, which organized in two respective parallel tracts in some regions of the sphere (Fig. 3j, k). Cell types that were mostly from one region (e.g., tuft cells, mostly from epithelial fractions) remained grouped distinctly (Fig. 3j, k). Similarly, when we did not use disease status (healthy, uninflamed, or inflamed) as a component of the batch vector, some cell types (e.g., TA2, immature enterocytes, and enterocytes) had “outliers” mapped to lowdensity regions of the sphere (Fig. 3l), mostly from UC samples (Fig. 3m), but the cells formed more compact clusters once disease status was included (Supplementary Fig. 5b), with good mixing between the cells from different disease states and patients (Supplementary Fig. 5c, d). This suggested that those cells may be impacted by the disease.
When learning a scPhere model that included patient, disease status, and the anatomical region as the batch vectors, epithelial (Fig. 3b) and immune (Fig. 3c) cells grouped visually by type, with accurate cell classification (Fig. 3e, f), and the influence of region, disease status, and the patient was largely removed (Fig. 3i and Supplementary Fig. 5c, d). For example, epithelial cells were ordered in a manner consistent with their development (Fig. 3b and Supplementary Movie 3), and CD8^{+}IL17^{+} T cells were nestled between CD8^{+} T cells and activated CD4^{+} T cells in a manner that was intriguing and consistent with the mixed features of those cells^{44} (Fig. 3c).
ScPhere’s abilities were particularly strong when analyzing all immune, stromal, and epithelial cells simultaneously (Fig. 3n, o, Supplementary Fig. 5e, and Supplementary Movie 4), demonstrating its capacity to embed large numbers of cells of diverse types, states, and proportions. Conversely, using tSNE or UMAP with Harmony batchcorrected results of all cells as input led to an unsuccessful visualization (Fig. 3p, q): many cell subtypes from the same general compartment became indistinguishable (e.g., clumping WNT2B^{+} fibroblasts, RSPO3^{+} fibroblasts, and inflammatory fibroblasts), others were inexplicably split (plasma cells, which are very abundant), and yet others were adjacent even from different lineages. These results demonstrate the superior performance of scPhere compared to the combination of Harmony’s batch correction and tSNE or UMAP’s visualization when analyzing large datasets with a large number of cells and cell types, multilevel batch effects, and complex structures (discrete cell types, continuous developmental trajectories, dominant, and rare cell subsets).
ScPhere preserves the structure of scRNAseq data even in very lowdimensional spaces
We systematically assessed scPhere’s performance when embedding in a latent space with few dimensions, comparing the kNN classification accuracy of scPhere with hypersphere embedding to a standard normal prior and normal posteriors, which embeds cells in a Euclidean latent space, as well as to tSNE, UMAP, and PHATE (holding out cells from one patient at a time for testing). We used the UC dataset, for each of the three major cell compartments separately, with the labels from the original study^{33}. For tSNE, UMAP, and PHATE, we used 20D or 50D Harmony batchcorrected principal components (PCs) (as Harmony can correct multilevel batch effects and performed equally or better than LIGER^{29} or Seurat3 CCA^{5} on this dataset; Supplementary Figs. 6–9).
Compared to using a Euclidean latent space, when using only two dimensions (Supplementary Figs. 6–8), scPhere performed significantly better, across all ks (FDR < 0.05, paired t test, twotailed), suggesting that a hyperspherical latent space introduced less distortions, and is useful for data visualization. As expected, kNN classification accuracies increased with the number of latent dimensions (Supplementary Figs. 6–8).
Overall, scPhere performed as well as tSNE and UMAP based on kNN accuracy or multinomial logistic regression classification accuracies, and it performed especially well for the cases with multilevel batch effects (Supplementary Figs. 7, 9b). ScPhere with hyperspherical latent spaces of dimensionality M did systematically better than scPhere with Euclidean latent spaces of either dimensionality of M or M + 1 (M > 3; Supplementary Fig. 7a, b). While kNN accuracies increased for all methods at five latent dimensions, further increasing the latent dimensionality did not yield substantial improvements, and with further growth even decreased accuracies. Notably, even by using a 50D latent space, the kNN accuracies from Harmony were worse than those from scPhere with a 5Dlatent space, suggesting that scPhere captures structures in scRNAseq data with multiple batch effects. We observed similar results for stromal (Supplementary Fig. 6) and immune (Supplementary Fig. 8) cells, and when using multinomial logistic regression instead of kNN accuracies (Supplementary Fig. 9).
ScPhere’s decoder that outputs a UMI count vector for each input cell can be used to impute and denoise expression values, either by sampling from the negative binormal distribution or by using the means. For example, when using the original UMI count data from CD8^{+} T cells in the UC dataset, CD8A and CD8B had a Pearson correlation coefficient of only 0.27, but their decoder outputs had a Pearson correlation coefficient of 0.81 (Supplementary Fig. 11a). The CD4 gene, which is not expressed in CD8^{+} T cells, was lowly expressed in both the original data and the decoder outputs, suggesting the decoder outputs did not introduce false positives.
Querying scPhere models to recover cells impacted by different biological factors
We next used scPhere’s ability to correct for multilevel batch effects to determine which cell types were mostly influenced by specific biological factors, such as disease. We performed two analyses for this task. In the first approach, based on scPhere’s ability to generate denoised outputs (Fig. 1b and Supplementary Fig. 11a), we provided both disease (healthy, uninflamed, or inflamed) and patient as the batch vector when learning a latent embedding, and obtained denoised outputs for the cells from inflamed tissues either with the original batch vector or when artificially setting “inflamed” to “healthy” in the disease batch vector (Fig. 1b). Applied to stromal and glia cells (on a 5hypersphere), the inflammatory fibroblasts were recovered as mostly influenced by inflammation, as reflected by low correlations between the two denoised outputs (Fig. 4a). In the second approach (Fig. 4b), we trained kNN classifiers using cells from both healthy and noninflamed tissue to predict cell types for cells from inflamed tissue (in the 5D hyperspherical latent space). Cell types with low truepositive rates (TPR) were likely to be the most influenced by disease (inflammation). Indeed, inflammatory fibroblasts had very low TPRs compared to other cell types (~20%, Fig. 4c), with most misclassified as WNT2B^{+} fibroblasts, and ~10% as WNT5B^{+} fibroblasts (Supplementary Fig. 11b), helping assess their likely origins. The results were consistent if we only considered highconfident cells that were correctly classified when the patient was the only batch vector for scPhere analysis (Supplementary Fig. 11c).
Batchinvariant scPhere builds atlases for annotation of unseen data
As a parametric model, we can train scPhere to coembed unseen (test) data to a latent space learned from training data only. To demonstrate this, we first performed a tenfold crossvalidation analysis, where we partitioned the colon fibroblasts and glial cells into ten roughly equally sized subsamples, held out one subsample as outofsample evaluation data, and used the remaining nine subsamples as training data to select variable genes and learn different scPhere models to embed cells on a 5D hypersphere. We then trained a kNN classifier on the 5D representations of the training data and used the kNN classifier to classify the 5D representations of the outofsample evaluation data. We repeated this process ten times with each of the ten subsamples used exactly once as the outofsample validation data. The kNN classifiers had a median accuracy of 0.834–0.853 (k = 5 or 65, respectively, Supplementary Fig. 11d). By comparison, when we repeat this process but using precomputed 5D representations from all fibroblasts and glia cells, accuracy was similar (0.847–0.860, the minimal twotailed Wilcoxon signedrank test FDR = 0.036, and for two k’s, the FDRs >0.05, Supplementary Fig. 11d).
Next, we used scPhere to map cells from unseen patients, a key use case as multiple studies need to be integrated, by training a “batchinvariant” scPhere model (“Methods”) that takes the gene expression vectors of cells (without batch vectors, the batch vectors were only used in the decoder part of scPhere to retain its batchcorrection capabilities) as inputs and maps them to a 5D hyperspherical latent space. As a test case, we learned a batchinvariant scPhere model for stromal, epithelial, or immune cells in the 18 patients training data of the UC dataset (as in the original study^{44}) and used it to map the cells from the 12 patients test data. There were multiple technical differences between the test and training data (collected nearly 2 years apart, all test cell libraries with 10× Chromium v2 chemistry, 15 of 18 training patient cell libraries with 10× Chromium v1; all test data sequenced with NextSeq but 3 of 18 training patients with HiSeq). We then trained knearest neighbor (kNN) classifiers (k = 25) (using the labels from the original study^{33}) on the 5D representations of the training data and applied the kNN classifiers to the 5D representations of the test data. ScPhere’s mapping of the test data was highly successful (Fig. 4d–f), with accuracies similar to those obtained when applying this process to representations from all cells (all 30 patients). Specifically, batchinvariant scPhere had accuracies of 0.79, 0.83, and 0.82 for stromal, epithelial, and immune cells, respectively, whereas a model trained on the full dataset had respective accuracies of 0.80, 0.87, and 0.80 (Fig. 4g).
Clustering cells following scPhere embeddings
To demonstrate how scPhere impacts clustering analysis, we clustered (using the Louvain algorithm^{45,46}) the embeddings of cells on the surface of 5D hyperspheres and compared them to the clusters in the original study^{44} (where only patients were used as the batch vector and variable genes were selected for each patient separately to compute a census of batchinsensitive variable genes^{44}). For example, stroma and glia cells were partitioned into 18 clusters (Supplementary Fig. 12a), which were largely consistent with the original analysis^{44} with some minor exceptions: RSPO3^{+} fibroblasts included cells from the original WNT2B^{+} Foslo cluster, and some of the inflammatory fibroblasts were in the WNT2B^{+} fibroblast clusters, highlighting their molecular similarity. We obtained similar results with epithelial and immune cells (Supplementary Fig. 12b, c and Supplementary Movie 5), or when we used cell embeddings on the surface of a 10D hypersphere for stromal cells (Supplementary Fig. 12d), consistent with our classification results (Fig. 3d–f). As we corrected for the influences of the region, disease, and patient, some immune cells with very similar molecular but preferentially associated with different regions (e.g., CD69^{−} and CD69^{+} mast cells, Supplementary Fig. 12e) or disease (cycling monocytes and macrophages, Supplementary Fig. 12f) were merged in one cluster. Notably, rare cell types were also distinct in the lowdimensional space, including cells that were missed in the original analysis (e.g., a small cluster of platelets; Supplementary Fig. 12g, cluster 33; a Bcell cluster 34 exclusively expressing IGLC7 and a monocyte cluster 28 expressing FCGR3A and RHOC). UMAP, PHATE, and scPhere with normal latent variables (all in 5D) did not perform as well in some cases (Supplementary Fig. 13a–c), both by biological inspection, and by Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) (Supplementary Fig. 13d). For example, Mcells and TA2 cells were mixed in PHATEbased clustering (Supplementary Fig. 13b), and CD8^{+} IL17^{+} T cells and CD4^{+} activated T cells were mixed in UMAP based clustering, as were CD4^{+} PD1^{+} cells and T_{regs} (Supplementary Fig. 13c).
Inferring spatial locations by embedding cells on a sphere
The scPhere model is flexible and can be extended to additional applications, including to infer the spatial locations of cells in a tissue with the appropriate structure. To demonstrate this we focused on cells of zebrafish embryos at 50% epiboly, which are distributed on the surface of a hemisphere (or quarter sphere because of the symmetry of cell distributions), with gene expression gradients across the dorsalventral axis (right to left) and marginalanimal (bottom to top) axis (Fig. 5a), as well as other, punctate or saltandpepper patterns^{47}.
To map cells to a quarter sphere we forced two components of the 3D coordinates to be positive, as well as augmented the scPhere objective function to incorporate information from landmark genes (Fig. 1b, “Methods”), such that we encourage a cell expressing a given marker gene to be map within the annotated portion of the quarter sphere expressing this gene (in an 8 × 8 grid^{47}). Specifically, for a cell expressing a marker gene and mapped to the quarter sphere, this modified procedure calculates the distances between its position and its annotated expression map on the grid, and minimizes the minimum of the distances. For each cell in a minibatch training, we then calculate the sum of the minimum distances of all marker genes. The final objective function is the original scPhere objective function plus the calculated mean of the sum of minimal distances in a cell across all cells and the sum of all distances in a minibatch training. Importantly, even if the landmarks themselves were not measured at singlecell resolution, scPhere only uses them as weak supervision, and directly maps cells on the surface continuously, rather than to bin.
This simple modification enables scPhere to spatially map the cells successfully. We trained scPhere with only 1406 zebrafish embryonic cells^{48} and 11 landmark genes^{47} (Fig. 5a, “Methods”), spanning ventral, animal ventral, dorsal, animal dorsal, and marginal genes (but not animal) on an 8 × 8 grid on the quarter sphere^{47}. The ventral gene marker cdx4, dorsal gene marker gsc, and marginal gene marker osr1 were expressed in their expected regions (Fig. 5b), as was the animal marker gene, sox3, even though we did not use any animal genes in the training (Fig. 5b). After mapping cells to the quarter sphere, we next calculated the spatial gene expression patterns^{47}, and the results (Fig. 5c) matched the expected patterns (Fig. 5a). We then used the trained scPhere model to map 3820 cells from another three batches (Fig. 5d), obtaining consistent spatial patterns (Fig. 5b). Finally, from the mapped cells, we could correctly predict patterns for genes not included in the training, including saltandpepper patterns and random sparse patterns for “apoptoticlike” cells (Supplementary Fig. 14). Notably, this mapping approach can be extended to nonspherical shapes by transforming the cells distributed on a plane to complex shapes (see “Discussion”).
Embedding cells in a hyperbolic space for trajectory discovery and interpretation
When cells are expected to show developmental trajectories, such as from adult stem cells to differentiated cells, scPhere can embed them into a hyperbolic space of the Lorentz model^{24,25}, and optionally convert the coordinates in the Lorentz model to the Poincaré disk for 2D visualization^{34,49}. Moreover, if we position the expected root cells of the developmental process at the center of a Poincaré disk, then the distance of each cell from the center can be thought of as a pseudotime^{20,26,50}. For a specific cell type, we can see cells progress with distance and angles continuously in the Poincaré disk. We can also encourage mapping root cells (if they are known a priori) to the origin of the Lorentz model during training.
Applying this first to colon epithelial cells, we readily discerned developmental ordering from intestinal stem cells to terminally differentiated cells in either the Poincaré disk (Fig. 6a), with stem cells at the center of the disk for intuitive interpretation, or in the Lorentz model (Supplementary Fig. 15a): the two major cell development trajectories are clearly delineated (Fig. 6a, arrows connecting median coordinates of cells of different types) and Mcells and Best^{+} enterocytes are close to each other. PHATE^{37} visualization using the 5D representations of cells in the Lorentz model as inputs recapitulated the results from the 2D representations (Supplementary Fig. 15b). In contrast, developmental trajectories were less apparent when we embedded cells in Euclidean space (Fig. 6b) or when we used PHATE multidimensional scaling on the 5D representations of cells in the Euclidean space (Supplementary Fig. 15c), with cells in the two major developmental branches being close to each other. 2D visualization with tSNE, UMAP, and PHATE was reasonable (Supplementary Fig. 15d–f), although the tSNE had some small spurious clusters, in the UMAP cycling TAs were intermediate between stem cells and secretory TAs (which can differentiate directly), and in PHATE several cell types were merged (Mcells and TA2 cells, tuft and enteroendocrine cells).
Next, we analyzed 86,024 C. elegans embryonic cells^{51} collected along a time course from <100 min to >650 min after each embryo’s first cleavage, finding that cells were ordered neatly in the latent space by both time and lineage, from a clearly discernible root at time 100–130 at the center of the Poincaré disk (cells from <100 were mostly unfertilized germline cells, “Methods”) to cells from time >650 near the border of the Poincaré disk (Fig. 6c, d and Supplementary Fig. 16) or away from the origin in the Lorentz model (Supplementary Fig. 17a, b). Within the same cell type, cells were ordered by embryo time in the Poincaré disk (Fig. 6d) or in the Lorentz model (Supplementary Fig. 17a, b). After first appearing along a developmental trajectory, cells of the same type progressed with embryo time, forming a continuous trajectory occupying a range of angles. For example, the cells of the body wall muscle (BWM, the most abundant cell type in this dataset, Supplementary Fig. 16) first appeared at embryo time 130–170 in a separable position (bottom left of the Poincaré disk, Fig. 6e), and then “advance” toward bottom right of the Poincaré disk in a continuous progression but in a manner aligned with embryo time (i.e., from 170–210 to >650) and lineages (i.e., from first row and second row BWMs (MS lineage) to anterior (MS to D lineage), and to posterior BWMs (C lineage)^{51}, Supplementary Fig. 17c). Moreover, different cell types (e.g., ciliated amphid neurons, ciliated nonamphid neurons, hypodermis, G2 and W blasts, seam cells, body wall muscle) that appeared at slightly different embryonic time points, had their origins around the same region and progressed with embryonic time in a similar way, forming a continuous trajectory but at a different angle and/or distance ranges from the center (Fig. 6d, arrows). Accordingly, cells’ distances to the origin were correlated with their embryonic time (Pearson correlation coefficients = 0.55, Supplementary Fig. 17d). For a few rare cell types that appeared relatively late in a developmental trajectory, such as coelomocytes (appearing in 270–330), their distances to the origin could be negatively correlated with embryonic times, and recentering their embeddings can help interpret their trajectories “locally”.
These patterns are harder to discern in UMAP, tSNE, or PHATE (Fig. 6e, f, with 50 batchcorrected PCs by Harmony as inputs; Supplementary Figs. 18a, b and 19), where cells from consecutive time points were compacted, cells that appeared early were relatively distant from each other in the embeddings, and temporal progression was not in the same direction. Moreover, when we quantify time continuity, by comparing the knearest neighbor time point classification accuracies (in a tenfold crossvalidation analysis), accuracies from scPhere (in 2D) were higher than those from tSNE, UMAP, and PHATE (in 2D, Supplementary Fig. 18c). Thus, a scPhere model with a hyperbolic latent space learned smooth (in time) and interpretable cell trajectories and helped represent developmental and other temporal processes.
Discussion
We introduced scPhere, a deepgenerative model to embed single cells on hyperspheres or in hyperbolic spaces to enhance exploratory data analysis and visualization of cells from singlecell studies, especially with complex, multilevel batch factors. ScPhere provides more readily interpretable representations, and avoids occlusion, as we demonstrate in diverse systems, and, when embedding cells in hyperbolic spaces, it helps studying developmental trajectories. In this latter case, in addition to providing compelling visualizations compared to stateoftheart methods, by placing root cells at the center of a Poincaré disk, we derived a natural definition for pseudotime as the distance to the center. The cells of type progress continuously with distance and angle in the Poincaré disk.
A major advantage of scPhere is in effectively accounting for multilevel complex batch effects, which we show disentangles cell types from patients, diseases, and location variables. We can harness this ability in several ways: to visualize and cluster cells while controlling for one or more factors, and examining the influence of any combination of them; to investigate which cell types are most affected by a factor (e.g., disease status, or location); or to generate batchinvariant reference embeddings, to which additional data can be mapped from new individuals, samples or conditions. In this study, we parameterize the dispersion parameters of the negativebinomial distributions as functions of cell count vectors. We may let the dispersion parameters as fixed values and optimize them directly for some tasks (not functions of cell count vectors). ScPhere’s ability to handle complex batch factors is an advantage over previous methods for batch correction (e.g., SAUCIE^{27}, scVI^{12}, LIGER^{29}, Seurat3 CCA^{5}, fastMNN^{19}, Scanorama^{28}, and Conos^{31}), which handle only one batch vector. Indeed, in our benchmarking with IBD cells with 30 patients, three disease statuses, and two spatial locations, scPhere performed better than stateoftheart batchcorrection methods such as Harmony, Seurat3 CCA, LIGER. In the future, we can leverage supervised information to further estimate the uncertainty of aligning cells from batches. In addition, as a parametric model, scPhere can naturally coembed unseen (test) data to a latent space learned from training data only, and denoise expression data successfully.
ScPhere is especially useful for analyzing large scRNAseq datasets: It is efficient, as it scales linearly with the number of input cells; it does not suffer from “cellcrowding” even with large numbers of input cells; and it better preserves hierarchical, global structures in data than competing methods. Finally, by learning a “batchinvariant” encoder that takes gene expression as inputs to learn latent embeddings, it forms a reference to annotate new profiled cells from future studies^{52}. This is another major advantage over nonparametric methods such as tSNE, UMAP, and Poincaré maps, which do not have a natural way to embed new data, especially in the presence of batch effects, and have scalability issue. These features should make it wellsuited for the challenge of building a comprehensive reference map, in health studies, such as the Human Cell Atlas^{3}, as well as in diseases, such as in the Human Tumor Atlas Network^{53}.
The scPhere model is robust to hyperparameters. Here, we used the same hyperparameters for scPhere analyses for all nine datasets (varying from ~1000 to >300,000 cells), whereas some previous studies^{54} showed that classical variational autoencoders could be sensitive to hyperparameters. ScPhere’s robustness may stem from the robust negativebinomial distribution for modeling UMI counts, or from the use of nonEuclidean latent spaces to help solve the cellcrowding problem in the latent space.
One key extension we have shown for scPhere is modifying it to spatially map cells. As our first example, we mapped zebrafish embryonic cells to a quarter sphere to infer the spatial locations of cells in a tissue, because a sphere is an appropriate model at this developmental phase. The only extra input we provided was the (binned) spatial expression patterns of a handful of landmark genes^{47}. The resulting model retains scPhere’s scalability and parametric nature, which allows mapping new cells. Importantly, this approach can be readily extended to other tissues with complex, nonspherical shapes (e.g., the mouse hippocampus), by transforming the cells distributed on a plane to such complex shapes, using methods such as normalizing flow^{55}. Our approach to spatial mapping is distinct in that we use the global shape of the physical space as a constraint, whereas most approaches do not consider this at all, and those that do, like novoSPARC^{56}, only incorporate continuity assumptions, which cannot capture many spatial patterns.
ScPhere can be extended in several other ways. When cell type annotations or celltype marker genes for some of the analyzed cells are available, we can include semisupervised learning to annotate cell types^{57,58}. Although scPhere showed promising denoising results, further studies are required to explore its abilities in imputing missing counts in scRNAseq data and removing ambient RNA contamination^{59,60}. Given the rapid development of spatial transcriptomics^{61,62}, singlecell ATACseq^{63,64}, and other complementary measurements, scPhere can be extended for integrative analysis of multimodal data. We can also learn discrete hierarchical trees for better interpreting developmental trajectories, use more complex topological latent spaces such as tori with diffuse VAEs^{65}, and even learn optimal latent spaces using mixture curvature VAEs^{66}. Additional developments can extend scPhere to model perturbation data. Moreover, there are not yet many tools to process data distributed in the hyperbolic space, such as efficient kNN search tools, and future studies can address this gap. Given its scope, flexibility, and extensibility, we foresee that scPhere will be a valuable tool for largescale singlecell and spatial genomics studies.
Methods
Mapping scRNAseq data to a hyperspherical latent space
ScPhere received as input a scRNAseq dataset \({D}=\{({{\bf{x}}}_{i},{{\bf{y}}}_{i})_{i=1}^{N}\}\), where \({{\bf{x}}}_{i}\in {{\mathbb{R}}}^{D}\) is the gene expression vector of cell i, D is the number of measured genes, \({{\bf{y}}}_{i}\) is a categorical variable vector specifying the batch in which \({{\bf{x}}}_{i}\) is measured, and N is the number of cells. Although \({{\bf{x}}}_{i}\) is highdimensional, its intrinsic dimensionality is typically much lower. We therefore assume that the \({{\bf{x}}}_{i}\) distribution is governed by a much lowerdimensional vector \({{\bf{z}}}_{i}\), and the joint distribution is factorized as follows (Fig. 1a):
Here \(p({{\bf{y}}}_{i} {{\mathbf{\uptheta }}}_{i})\) is the categorical distribution, \(p({{\bf{z}}}_{i} {{\mathbf{\uptheta }}}_{i})\) is the prior distribution for \({{\bf{z}}}_{i}\) (\({{\bf{z}}}_{i}\in {{\mathbb{R}}}^{M},{{\bf{z}}}^{T}{\bf{z}}=1,M\ll D\)), which is assumed to be a uniform distribution on a hypersphere with density \({\left(\frac{2({\pi }^{M/2})}{\varGamma (M/2)}\right)}^{1}\). For notational simplicity, we use bold font \({{\mathbf{\uptheta }}}_{i}\) to represent the parameters of each distribution, e.g., the parameters \({{\mathbf{\uptheta }}}_{i}\) in \(p({{\bf{y}}}_{i} {{\mathbf{\uptheta }}}_{i})\) and \(p({{\bf{z}}}_{i} {{\mathbf{\uptheta }}}_{i})\) are the parameters of the two distributions and should be different.
For scRNAseq data, the observed Unique Molecular Identifier (UMI) count of gene j in cell i has typically been assumed to follow a zeroinflated negativebinomial (ZINB) distribution^{11,12,67}. However, a recent study suggests that zero inflation is an artifact of normalizing UMI counts^{68}, and negativebinomial distributions generally fit the UMI counts well^{69,70,71}. We, therefore, assume a negativebinomial distribution of observations in this study:
The negativebinomial parameters mean \({\mu }_{{y}_{i},{{\bf{z}}}_{i}} > 0\) and dispersion \({\sigma }_{{y}_{i},{{\bf{z}}}_{i}} > 0\) are specified by a model neural network (decoder), which can model complex nonlinear relationships between the latent variables and the observations.
We next want to compute the posterior distribution \(p({{\bf{z}}}_{i} {{\bf{y}}}_{i},{{\bf{x}}}_{i},{{\mathbf{\uptheta }}}_{i})\), which is assumed to be a von Mises–Fisher (vMF) distribution on a unit hypersphere of dimensionality \(M1\): \({{\mathbb{S}}}^{M1}=\{{\bf{z}} {\bf{z}}\in {{\mathbb{R}}}^{M},{{\bf{z}}}^{T}{\bf{z}}=1\}\). We turn to variational inference to find a \(q({{\bf{z}}}_{i} {{\bf{y}}}_{i},{{\bf{x}}}_{i},{{\mathbf{\upphi }}}_{i})\) to approximate the posterior, since exact inference is intractable, given that the model is parameterized by a neural network. In addition, the number of parameters to estimate grows with the number of cells, because each cell has a “local” distribution with parameter \({{\mathbf{\upphi }}}_{i}\). To scale to large datasets, variational autoencoders use an inference neural network (encoder, with a fixed number of parameters) to output the “local” parameter \({{\mathbf{\upphi }}}_{i}\) of each cell. Therefore, the learning objective is to find the model neural network and the inference neural network parameters to maximize the evidence lower bounds:
The Kullback–Leibler (\({\mathbb{K}}{\mathbb{L}}\)) divergence^{72} in Eq. (1) can be calculated analytically (below). We use Monte–Carlo integration (sampling from the vMF distribution \(q({{\bf{z}}}_{i} {{\bf{y}}}_{i},{{\bf{x}}}_{i},{{\mathbf{\upphi }}}_{i})\)) to calculate the second term.
To make scPhere robust to small perturbations (e.g., sequencing depth) and to stabilize training, we add a penalty term to the objective function in Eq. (1). Specifically, for each gene expression vector \({{\bf{x}}}_{i}\), we downsample \({{\bf{x}}}_{i}\) by keeping 80% (downsampling ratio 20%) of its UMIs to produce \({{\bf{x}}}_{i}\). The latent representations of \({{\bf{x}}}_{i}\) and \({\hat{\bf{x}}}_{i}\) are \({{\bf{z}}}_{i}\) and \({{\hat{\bf{z}}}}_{i}\), respectively. The penalty term is defined as \(\mathop{\sum}\limits_{j=1}^{M}({z}_{i,j}{\hat{z}}_{i,j})^{2}\) as we want \({{\bf{z}}}_{i}\) and \({{\hat{\bf{z}}}}_{i}\) to be close. Even when increasing the downsampling ratio to 50 or 80%, scPhere (with hyperspherical latent spaces, such that the distance between two points is less than or equal to 2) produced similar results as reflected by kNN accuracies (Supplementary Fig. 20a). KNN accuracies were the lowest when removing this penalty term (i.e., downsampling ratio = 0). For hyperbolic latent spaces, adding this term helps stabilize training. Otherwise, the ELBO is more likely to become NA during training. For example, for the mouse retina neurons dataset, the ELBO became NA after training ~80,000 minibatches without this penalty term.
The von Mises–Fisher (vMF)^{73} distribution represents angular observations as points on the surface of a unitradius hypersphere. Let \({\bf{z}}\) be a Mdimensional random vector with unit radius (\({{\bf{z}}}^{T}{\bf{z}}=1\)), then its probability density function is:
where \({{\mathbf{\upmu }}}^{T}{\mathbf{\upmu }}=1\) is the mean direction vector (not the mean) and κ ≥ 0 is the concentration parameter. The greater the value of κ, the higher the concentration of distribution around the mean direction vector μ. When κ = 0, \({\rm{vMF}}({\bf{z}} {\mathbf{\upmu }},0)={\left(\frac{2({\pi }^{M/2})}{\varGamma (M/2)}\right)}^{1}\) is the uniform distribution on the unit hypersphere \({{\mathbb{S}}}^{M1}\). \({C}_{m} ( \kappa )\) is a constant normalization factor and \({I}_{\nu }(\cdot )\) is the modified Bessel function of the first kind of order v^{74}: \({I}_{\nu } (\kappa ) = ( \frac{\kappa }{2} )^{\nu } \mathop{\sum }\nolimits_{t=0}^{\infty } \frac{({\kappa }^{2}/4)^{t}}{t! \varGamma (\nu + t +1)}\). The Gamma function is defined as \(\varGamma(x) = \int _{0}^{\infty}{s}^{x1}{e}^{s}{ds}\).
For random vectors distributed on the surface of a hypersphere, a natural prior is the uniform distribution, which is the vMF distribution with zero concentration \({\rm{vMF}}({\bf{z}} {\mathbf{\upmu }},0)\). In this case, the Kullback–Leibler (\({\mathbb{K}}{\mathbb{L}}\)) divergence^{52} can be written in closedform:
with
Notice that Eq. (2) is independent of the mean direction vector \({\mathbf{\upmu }}\) as \({{\mathbf{\upmu }}}^{T}{\mathbf{\upmu }}=1\), so we only need to take the derivative of Eq. (2) w.r.t \(\kappa\) during optimization. In other words, minimizing the \({\mathbb{K}}{\mathbb{L}}\) divergence only forces the concentration parameter \(\kappa\) to be close to zero but without any forces on the mean direction vector. This is different from using a locationscale family of priors, such as a standard normal prior, where the prior encourages the posterior means of all points to be close to zero. When \(\nu \ll \kappa\), \({I}_{\nu }(\kappa )\) overflows quite rapidly with \(\kappa\). To avoid numeric overflow, we use the exponentially scaled modified Bessel function \({e}^{\kappa }{I}_{\nu }(\kappa )\) in calculations (the scaling is motivated by the asymptotic expansion of \({I}_{\nu } (\kappa ) \sim {e}^{\kappa } (2\pi \kappa )^{1/2}\mathop{\sum}\limits_{t}{\alpha }_{t}(\nu ){\kappa }^{t}\) for \(\kappa \to \infty\)^{75}). The firstorder derivative of the exponentially scaled modified Bessel function is
Previous work has used vMF distribution as the latent distribution for variational autoencoders^{25,32,33}, but only the spherical variational autoencoder^{23} learns the concentration parameter \(\kappa\).
Samples from vMF distributions can be obtained through a rejection sampling scheme^{76,77}. The algorithm is based on the theorem^{77} that a \(M\)dimensional vector \({\bf{z}} = (\sqrt{1{\omega }^{2}}{{\bf{v}}}^{T},\omega )^{T}\) has a vMF distribution with direction vector \((0,\ldots ,1)^{T}\in {\mathbb{S}}^{M1}\) and concentration parameter \(\kappa\) if \(\omega\) has a univariate density function with the following density function:
where \({\bf{v}}\) is uniformly distributed in \({{\mathbb{S}}}^{M2}\), \({C}_{\kappa }\) is a normalization term such that \(f(\omega )\) is a legitimate density function, and \(B(x,y)=\frac{\varGamma (x)\varGamma (y)}{\varGamma (x+y)}\) is the Beta function. The vector \({\bf{v}}\) is uniformly distributed in \({{\mathbb{S}}}^{M2}\) and can be sampled from a standard normal distribution in \(M1\) dimensions and then we normalize the resulting sample to unit length.
We then use rejection sampling to sample \(\omega\) from the univariate distribution in Eq. (3). The envelope function used for rejection sampling is defined as
Where the term^{78} \(b=\frac{M1}{2\kappa +\sqrt{4{\kappa }^{2}+(M1)^{2}}}\). To sample from \(g(\omega )\), we can first sample \(\varepsilon \sim {{\mathrm{Beta}}}(\frac{M1}{2},\frac{M1}{2})\) and pass the sample \(\varepsilon\) to the invertible function \(h(\varepsilon )=\frac{1(1b)\varepsilon }{1(1+b)\varepsilon }\). We can easily prove that \(\omega =\frac{1(1b)\varepsilon }{1(1+b)\varepsilon }\) is distributed according to Eq. (4) based on the rule of transforming a continuous random variable with an invertible function. A sample \(\omega\) is accepted if \(\kappa \omega +(M1){\log}(1{x}_{0}\omega )c\ge {\log}(u)\), where \({x}_{0}=\frac{1b}{1+b}\), and \(c=\kappa {x}_{0}+(M1){{\log }}(1{x}_{0}^{2})\) and \(u\) is sampled from a continuous uniform distribution with support in \((0,1)\). The vector \({{\bf{z}}^{\prime}}=(\sqrt{1{\omega }^{2}}{{\bf{v}}}^{T}\!,\omega)^{T}\) is a sample from vMF\(({\bf{z}}^{\prime}  {{\bf{e}}}_{1},\kappa )\), where \({{\bf{e}}}_{1}=(0,\ldots ,1)^{T}\in {{\mathbb{S}}}^{M1}\). We can then rotate \({\bf{z}}{\prime}\) using a Householder matrix \({\bf{I}}{\bf{u}}{{\bf{u}}}^{{\bf{T}}}\) to get a sample from vMF(\(({\bf{z}} {\mathbf{\upmu }},\kappa )\)^{23}, where \({\bf{I}}\) is the identify matrix of rank \(M\) and \({\bf{u}}=\frac{{{\bf{e}}}_{1}{\mathbf{\upmu }}}{{{\bf{e}}}_{1}{\mathbf{\upmu }}}\), where \(\parallel \cdot \parallel\) is the Euclidean norm. Overall, the samples from a Beta distribution are transformed and accepted or rejected by the rejection sampling scheme, and combined with samples \({\bf{v}}\) from a uniform distribution in \({{\mathbb{S}}}^{M2}\). The combined samples are further transformed to generate samples from the desired vMF distribution. Remarkably, previous work has shown that this reparameterization approach still holds for these samples^{23} and can be used to optimize the vMF parameters \({\mathbf{\upmu }}\) and \(\kappa\), which are the outputs of the inference neural network (encoder).
For visualization purposes, we typically set \(M=3\). Then the univariate density function becomes \(f(\omega )=\frac{{e}^{\kappa \omega }}{{C}_{\kappa }B(\frac{1}{2},1)}=\frac{\kappa }{{e}^{\kappa }{e}^{\kappa }}{e}^{\kappa \omega }=\frac{\kappa }{2{\rm{sinh }}(\kappa )}{e}^{\kappa \omega }\), where \({\rm{sinh }}(\cdot )\) is the hyperbolic sine function. We can directly draw samples from this density function by transforming a sample \(\xi\), generated from a continuous uniform distribution \(\xi \sim {\rm{Unif}}(0, 1)\) using the inverse cumulative function \(F(t)={\int }_{\omega =1}^{t}\frac{\kappa }{2{\rm{sinh }}(\kappa )}{e}^{\kappa \omega }d\omega =\frac{1}{2{\rm{sinh }}(\kappa )}({e}^{\kappa t}{e}^{\kappa })\). Specifically, we can use the following algorithm to generate a sample from \(f(\omega )\):
Poincaré ball and Lorentz model of the hyperbolic space
The Poincaré ball model represents the hyperbolic space as the interior of a unit ball in the Euclidean space: \({\mathbb{P}}={\bf{z}}\in {{\mathbb{R}}}^{M+1} \parallel {\bf{z}}\parallel < 1,{z}_{0} = 0,M\in {{\mathbb{Z}}}^{+}\), where \({\bf{z}}=({z}_{0},\ldots ,{z}_{M})^{T}\). The distance between two points \({{\bf{z}}}_{1},{{\bf{z}}}_{2}{\mathbb{\in }}{\mathbb{P}}\) is defined as:
where \({{\rm{cosh }}}^{1}(z)={\rm{ln}}(z+\sqrt{{z}^{2}1})\) is the inverse hyperbolic cosine function, which is monotonically increasing for \(z\ge 1\). The symbol \(\parallel \cdot \parallel\) represents the Euclidean norm. Notice that \({{\rm{cosh }}}^{1}(1+z)={\rm{ln}}(1+z+\sqrt{{z}^{2}+2z})\), which approximates \(\sqrt{2z}\) when \({\rm{lim}}z\to 0\) and \({\rm{ln}}(2z)\) for \({\rm{lim}}z\to +{\rm{\infty }}\). When both \({{\bf{z}}}_{1}\) and \({{\bf{z}}}_{2}\) are close to the origin with zero norm, \(d({{\bf{z}}}_{1},{{\bf{z}}}_{2})\approx {{\rm{cosh }}}^{1}(1+2\parallel {{\bf{z}}}_{1}{{\bf{z}}}_{2}{\parallel }^{2})\approx 2\parallel {{\bf{z}}}_{1}{{\bf{z}}}_{2}\parallel\). Therefore, the Poincaré ball model resembles Euclidean geometry near the center of the unit hyperball. The induced norm of a point \({\bf{z}}{\mathbb{\in }}{\mathbb{P}}\) is
As \({\bf{z}}\) moves aways from the origin and approaches the border with \(\parallel {\bf{z}}\parallel \approx 1\), the induced norm \(\parallel {\bf{z}}{\parallel }_{{\mathbb{P}}}\) grows exponentially. Hyperbolic geometry is useful to represent data with an underlying approximate hierarchical structure.
The Lorentz model is a model of the hyperbolic space and points of this model satisfy \({{\mathbb{H}}}^{M}=\{{\bf{z}}\in {{\mathbb{R}}}^{M+1}{{\rm{z}}}_{0} > 0,\langle {\bf{z}},{\bf{z}}\rangle_{{\mathbb{H}}}=1\}\), where \(\langle{\bf{z}},{\bf{z}}{\prime} \rangle_{{\mathbb{H}}}={{\rm{z}}}_{0}{{\rm{z}}}_{0}{\prime} +\mathop{\sum }\limits_{{\rm{i}}=1}^{M}{{\rm{z}}}_{{\rm{i}}}{{\rm{z}}}_{{\rm{i}}}{\prime}\) is the Lorentzian inner product (or Minkowski inner product when \({\bf{z}}\in {{\mathbb{R}}}^{4}\)). The special onehot vector \({{\mathbf{\upmu }}}_{0}=(1,0,\ldots ,0)^{{\rm{T}}}\) is the origin of the hyperbolic space. The distance between two points of the Lorentz model is defined as:
The tangent space of \({{\mathbb{H}}}^{M}\) at point \({\mathbf{\upmu }}\in {{\mathbb{H}}}^{M}\) is defined as \({{\mathscr{T}}}_{{\mathbf{\upmu }}}{{\mathbb{H}}}^{M}:=\{{\bf{z}}\langle{\mathbf{\upmu }},{\bf{z}}\rangle_{{\mathbb{H}}}=0\}\), i.e., all the vectors that pass point \({\mathbf{\upmu }}\) and are orthogonal to vector \({\mathbf{\upmu }}\) based on the Lorentzian inner product. A point \(({{\rm{z}}}_{0},{{\rm{z}}}_{1},\ldots ,{{\rm{z}}}_{M})^{{\rm{T}}}\) in the Lorentz model can be conveniently mapped to the Poincaré ball^{21} for visualization:
We discard the first element as it is a constant of zero.
We used wrapped normal priors and wrapped normal posteriors defined in the Lorentz model to embed cells to a hyperbolic space^{25,34,79}. A wrapped normal distribution in \({{\mathbb{H}}}^{M}\) is constructed by first defining a normal distribution on the tangent space \({{\mathscr{T}}}_{{{\mathbf{\upmu }}}_{0}}{{\mathbb{H}}}^{M}\) (a Euclidean subspace in \({{\mathbb{R}}}^{M+1}\)) at the origin \({{\mathbf{\upmu }}}_{0}=(1,0,\ldots ,0)^{T}\) of the hyperbolic space. Samples from a normal distribution on the tangent space are paralleltransported to desired locations and further projected onto the final hyperbolic space^{25}.
We used a set of invertible functions to transform samples from a normal distribution \({\mathscr{N}}({\bf{z}}{\bf{0}},{{\bf{I}}}_{M}{\boldsymbol{\sigma }})\) in \({{\mathbb{R}}}^{M}\) to samples from a wrapped normal distribution in \({{\mathbb{H}}}^{M}\) with mean of \({\mathbf{\upmu }}\), where \({\boldsymbol{\sigma }}\in {{\mathbb{R}}}^{M}\) is the standard deviation of components \({z}_{1}\) to \({z}_{M}\), respectively, and \({{\bf{I}}}_{M}\) is the identity matrix in \({{\mathbb{R}}}^{M}\)^{25,55}. First, let \({{\bf{z}}}_{0}=(0,{{\bf{z}}}_{0}{\prime} )^{T}\), which can be considered as a sample vector from \({{\mathscr{T}}}_{{{\mathbf{\upmu }}}_{0}}{{\mathbb{H}}}^{M}\), where \({{\bf{z}}}_{0}{\prime}\) is sampled from \({\mathscr{N}}({\bf{z}}{\bf{0}},{{\bf{I}}}_{M}{\boldsymbol{\sigma }})\). Next, \({{\bf{z}}}_{0}\) is paralleltransported to vector \({{\bf{z}}}_{1}\) in the tangent space \({{\mathscr{T}}}_{{\mathbf{\upmu }}}{{\mathbb{H}}}^{M}\) at \({\mathbf{\upmu }}\), in a parallel manner (i.e., \({{\bf{z}}}_{1}\) and \({{\bf{z}}}_{0}\) pointing in the same direction relative to the geodesic between \({{\mathbf{\upmu }}}_{0}\) and \({\mathbf{\upmu }}\)) and vector norm preserving (i.e., \(\langle {{\bf{z}}}_{0},{{\bf{z}}}_{0}\rangle_{{\mathbb{H}}}=\langle {{\bf{z}}}_{1},{{\bf{z}}}_{1}\rangle _{{\mathbb{H}}}\))^{25,80}:
with \(\alpha =\langle {{\mathbf{\upmu }}}_{0},{\mathbf{\upmu }}{\rangle }_{{\mathbb{H}}}\).
Finally, the exponential map^{24,25,79} projects vector \({{\bf{z}}}_{1}\) in the tangent space \({{\mathscr{T}}}_{{\mathbf{\upmu }}}{{\mathbb{H}}}^{M}\) back to the hyperbolic space by:
such that the vector norm is preserved: \(\parallel {{\bf{z}}}_{1}{\parallel }_{{\mathbb{H}}}=\sqrt{\langle {{\bf{z}}}_{1},{{\bf{z}}}_{1}{\rangle }_{{\mathbb{H}}}}={d}_{{\mathbb{H}}}({\mathbf{\upmu }},{\bf{z}})\).
The likelihood after the invertible transformations can be calculated by
The encoder outputs a vector \({\bf{h}}\) in the tangent space at the origin (\({{\mathscr{T}}}_{{{\mathbf{\upmu }}}_{0}}{{\mathbb{H}}}^{M}\), so \(\parallel {\bf{h}}{\parallel }_{{\mathbb{H}}}=\parallel {\bf{h}}{\parallel }_{2}\)) and can be mapped to \({{\mathbb{H}}}^{M}\) using the exponential map (the first zero element of \({\bf{h}}\) is omitted) to get \({\mathbf{\upmu }}\):
Given a sample \({\bf{z}}\) from the wrapped normal distribution, we need to evaluate its density \({\rm{log }}p\left({\bf{z}}\right)\) for calculating the \({\mathbb{K}}{\mathbb{L}}\)divergence term of the ELBO. We can use the inverse exponential map and the inverse parallel transport to compute the corresponding \({{\bf{z}}}_{1}\) and \({{\bf{z}}}_{0}\), respectively, for evaluating the density:
where \(\beta =\langle {\mathbf{\upmu }},{\bf{z}}{\rangle }_{{\mathbb{H}}}\) and \(\alpha =\langle {{\mathbf{\upmu }}}_{0},{\mathbf{\upmu }}{\rangle }_{{\mathbb{H}}}\). We now have all the ingredients to compute Eq. (1) for each training point.
Model structure
As singlecell data are sparse, with typically >90% genes with zero counts in each cell, we used softmax as the activation function to estimate the means of the negativebinomial distributions and help generate sparse outputs from the decoders. The softmax function outputs a vector of positive numbers with a sum of one, and this vector is multiplied by the size vector of a cell (the sum of UMI counts for that cell) to get the means of the negativebinomial distribution of each gene in that cell. We used the exponential linear unit (ELU)^{81} activation functions for hidden layers, as it has been shown to improve convergence of stochastic gradient optimizations.
For all experiments, we used a threelayered encoder network (128–64–32) and a twolayered decoder network (32–128). The dimensionality z of the stochastic layer was typically two for visualization purposes. When comparing scPhere with different latent spaces, we kept all other factors the same. We used the Adam stochastic optimization^{82} algorithm with a learning rate of 0.001. For datasets with <10,000 cells, we trained models for 2000 epochs. For datasets with >10,000 cells but <100,000 cells, we trained models for 500 epochs, and for the large number of immune cells with more than 2,000,000 cells, we trained models for 250 epochs. Using the UC epithelial cells as an example, we provided the average ELBO changes with training minibatches for different latent dimensionalities. For scPhere models with different latent spaces, training was quite stable and converged rapidly, at least for the configurations we used (Supplementary Fig. 20b). For scPhere with hyperbolic latent spaces, training converged a bit more slowly when we increased the dimensionality of the latent spaces, compared to training using other latent spaces.
For our current implementation, we did not introduce an early stop but trained scPhere for a given number of epochs. Larger datasets may require a smaller number of training epochs compared with smaller datasets (e.g., 1000 cells). Training time also depended on the number of genes used. For example, when using the IBD immune cells, time grew linearly with the number of minibatches in training, taking only 2.45 min to train scPhere for 16,450 minibatches (Supplementary Fig. 4a). We can estimate the number of cells equivalent with the minibatch size (128) and number of minibatches in training (Supplementary Fig. 4b). Importantly, we obtained good embedding for the 210,614 immune cells, even when we only train the model for ten epochs (16,450 minibatches, 2.45 min, Supplementary Fig. 4c–i). All the experiments were run using a Mac desktop computer with 32 GB of RAM, 4.2 GHz fourcore Intel i7 processor with 8 MB cache and no GPUs were used.
Parameter setting for other methods used in comparisons
For tSNE, we followed the previous approach optimized for visualizing scRNAseq data^{18}, i.e., using PCA initiation, a high learning rate, and multiscale similarity kernels. We also used the FitSNE package^{83}, as previously described^{18}.
For UMAP, we used the Seurat UMAP wrapper^{5} and adapted its parameter setting to run UMAP, with the “min.dist” parameter set to 0.3 and the “spread” parameter set to 1.
For PHATE^{37}, we followed its tutorial and used the default parameter settings.
For these three methods, we used 50 principal components as inputs by default. Because PHATE had very long run times for large datasets, for the IBD immune cells we only ran PHATE with 2D latent spaces.
For the three batchcorrection methods, Harmony^{30}, Seurat3 CCA^{5}, and LIGER^{29}, we followed their tutorials and used the default parameter settings. Because both Seurat3 CCA and LIGER only handle one batch vector, we used patient, which is the major batch factor for the UC data. Seurat3 CCA encountered scalability/stability issue for the immune cells with >200,000 cells and 30 batches (patients) and failed after running >90 h. We thus removed Seurat3 CCA from the immune cell comparison (Supplementary Fig. 8).
Quantifying global, hierarchical structure preservation
The preservation of global hierarchical structures for each embedding method was quantified by using a “global kNN accuracy” metric, where kNN classifiers (leaveoneout crossvalidation, k = 3 and 5 for the RGC and HCL datasets, respectively) were trained on condensed datasets, where each point was the center of a cluster, and the classifiers were used to classify each cluster, which was represented by its cluster center, to the major cell types (groups).
To quantify cell crowding, we used silhouette values. When cells are more crowded, withincluster and betweencluster distances between cells are more similar to each other, leading to smaller silhouette values. For the HCL dataset with 599,926 cells, silhouette values were calculated from 50 repeated runs, each run with 20,000 randomly sampled cells as inputs.
The embeddings were visualized on a 3D sphere using the rgl package^{84} from R, with the interactive 3D scatter plots saved as web graphics library files that can be opened in a browser. The rgl package uses OpenGL as the rendering backend and can be used to rapidly and interactively visualize 3D scatter plots with millions of cells in a browser.
To learn scPhere models that are invariant to the batch vectors and can be used to map cells from completely new batches, when training scPhere, we use a scPhere encoder (the encoder part of the scPhere model is used to map new data after a scPhere model is trained) to map a gene expression vector to the lowdimensional representation directly without using the batch vector as an input to the encoder. The batch vector is only used in the decoder that took both the latent representation of a cell and its cell batch vector as input to output the recovered gene expression vector during training scPhere. We call this modality of scPhere with no batch vectors for the encoder “batchinvariant” scPhere, as it learns latent representations that are invariant to the batch vectors.
The componentcollapse problems in VAE
We examined if scPhere with a highdimensional latent variable (z) has the “componentcollapse” problem^{33}, where the generative model (decoder) simply ignores some components of the latent variables, such that the posteriors of these components match the prior.
For Euclidean latent spaces, we observed the componentcollapse problem when we used either 10D or 20D, where the means of the absolute values of some components were close to zero—the mean of the standard normal distribution (Supplementary Fig. 10a). Therefore, the effective numbers of components by using 10D and 20D dimensional latent spaces were only six and seven, respectively.
For hyperspherical latent spaces, because the prior has no centers and all components shared the same concentration parameter, we did not observe componentcollapse, so potentially we can get a larger number of effective latent components compared with using Euclidean latent spaces. However, when using 20D hyperspherical latent spaces, the estimated vMF concentration parameters were lower compared to the case with latent variables on 5spheres (Supplementary Fig. 10b), suggesting higher uncertainties of the embedding when using latent variables on a 20sphere. Moreover, some components of the latent representations became highly correlated when we embedded cells on a 20sphere (Supplementary Fig. 10c). By using the hyperbolic latent spaces, we also observed colinear components with 5Dlatent spaces.
Datasets
The cord blood mononuclear cell dataset^{27} consists of 8617 cells, including 8009 cord blood mononuclear cells and 608 mouse 3T3 fibroblasts, produced by the CITEseq protocol^{38} on the 10× Chromium (v2) platform^{85}. We only used the 2293 CD14^{+} erythroid cells and the first 10 erythrocytes in the dataset. Based on the Seurat^{5} tutorial (https://satijalab.org/seurat/v3.0/multimodal_vignette.html), we used the 2000 highly variable genes in this study.
Human splenic NK cells were from a study profiling human and mouse splenic and blood NK cells^{41}, and profiled by 10× Chromium (v2). We used the 1755 human splenic NK cells from donor one in this study. We selected 2724 highly variable genes and partitioned the 1755 cells into four groups, labeled them as hNK_Sp1, hNK_Sp2, hNK_Sp3, hNK_Sp4, as in the original study^{41}.
Human lung cells were from asthma patients and healthy controls^{39}, and profiled by either 10× Chromium or Dropseq^{86}. We used the 3314 cells from a donor prepared by the Dropseq protocol that can be accessed from GEO: GSE130148.
The mouse white adipose tissue stromal cell dataset contains 1378 cells from mouse white adipose tissue^{40} profiled by 10× Chromium (v2). In the original study, the authors only analyzed 1045 tdTomato mGFP+ cells and identified adipocyte precursor cells (APC), fibroinflammatory progenitors (FIP), committed preadipocytes, and mesothelial cells. We analyzed all the cells and further identified pericytes, macrophages, and two groups of doublets.
The reginal ganglion cell atlas dataset consists of 35,699 mouse retinal ganglion cells profiled by 10× Chromium (v2)^{42}. The original analysis identified 45 clusters, and one cluster consisted of two cell types.
We used 599,926 human cell landscape cells^{43} from human fetal or adult tissues profiled by the MicrowellSeq platform^{87}. These cells were portioned into 102 clusters, and 77 of 102cell clusters can be grouped into six major cell groups: fetal stromal cells, fetal epithelial cells, adult endothelial cells, endothelial cells, adult stromal cells, and immune cells.
The cells of the colon mucosa were from 68 biopsies, collected from 18 ulcerative colitis patients and 12 healthy individuals^{44}, and profiled by 10× Chromium (either v1 or v2). After filtering likely lowquality cells (clusters), we obtained a total of 301,749 cells (26,678 stromal cells and glia, 64,457 epithelial cells, and 210,614 immune cells as annotated in the original study^{44}). The cells span 12 stromal cell types/states, 12 epithelial cell types/states, and 23 immune cell types/states, identified by unsupervised clustering and manual annotations^{44}. We used Seurat to select 1307, 1361, and 1068 highly variable genes for the three major cell types, respectively, for scPhere analyses.
The C. elegans embryonic cell dataset consists of 86,024 cells^{51} profiled using 10× Chromium (v2). The embryo times were partitioned into 12time bins, and 63.5% of the cells were assigned to 36 major cell types based on annotation from GEO: GSE126954. We treated the cells with embryo time in the range 100–130 as the root cells because embryo time <100 consisted mostly of germline cells that were also observed in embryo time >650. As the root cells accounted for only ~0.5% of the total cells, it was hard for them to be mapped to the center of the Lorentz model. We thus changed the scPhere objective function by adding a term consisting of the distance between the mean position of embeddings of the root cells to the origin of the Lorentz model.
For the zebrafish embryonic cells, we used the 5226 cells at 50% epiboly^{48}, profiled using Dropseq^{86}. These cells were from four batches, and we used the 1406 cells from one batch to train scPhere, and the trained model was used to map the remaining 3820 cells from another three batches. We used Seurat to select 2000 variable genes, added three genes with annotated spatial locations that were not in the 2000 variable gene list, and used the 2003 genes for analysis. To encourage cells to map to their spatial locations, we used 11 landmark genes expressed in the ventral axis: cdx4, eve1; animal ventral: bambia; dorsal: gsc, chd; animal dorsal: foxd3; marginal: osr1, lft2, lhx1a, wnt8a; and the gene ta which is expressed in ventral, dorsal, and marginal. These landmark genes were used to calculate a penalty term added to the scPhere objective function during training.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
We used publicly available datasets in this study (GEO: GSE126954^{38}, GSE119562^{41}, GSE130148^{39}, GSE111588^{40}, GSE137400^{42}, GSE134355^{43}, GSE126954^{51}, GSE106587^{48}; Single Cell Portal: SCP259. To make the results presented in this study reproducible, all processed data are available in the Single Cell Portal (SCP551).
Code availability
The scPhere software package, implemented in TensorFlow, is available free from https://github.com/klarmancellobservatory/scPhere, and as a Supplementary Software 1 accompanying this manuscript.
References
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in singlecell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with singlecell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
Regev, A. et al. Science forum: the human cell atlas. eLife 6, e27041 (2017).
Luecken, M. D. & Theis, F. J. Current best practices in singlecell RNAseq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
Kingma, D. P. & Welling, M. Autoencoding variational Bayes. in International Conference on Learning Representations (ICLR, 2014).
Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. in Proceedings of the 31st International Conference on Machine Learning (eds Xing, E. P. & Jebara, T.) Vol. 32, 1278–1286 (PMLR, 2014).
Kingma, D. P., Mohamed, S., Rezende, D. J. & Welling, M. Semisupervised learning with deep generative models. in Advances in Neural Information Processing Systems (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. & Weinberger, K. Q.) 3581–3589 (Curran Associates, Inc., 2014).
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Singlecell RNAseq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Wang, D. & Gu, J. VASC: dimension reduction and visualization of singlecell RNAseq data by deep variational autoencoder. Genomics Proteom. Bioinforma. 16, 320–331 (2018).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts singlecell perturbation responses. Nat. Methods 16, 715–721 (2019).
Grønbech, C. H. et al. scVAE: variational autoencoders for singlecell gene expression data. Bioinformatics 36, 4415–4422 (2020).
van der Maaten, L. & Hinton, G. Visualizing data using tSNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Amir, E. D. et al. viSNE enables visualization of high dimensional singlecell data and reveals phenotypic heterogeneity of leukemia. Nat. Biotechnol. 31, 545–552 (2013).
Kobak, D. & Berens, P. The art of using tSNE for singlecell transcriptomics. Nat. Commun. 10, 1–14 (2019).
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Bendall, S. C. et al. Singlecell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of singlecell RNAseq data across data sets. Nat. Methods 15, 359–362 (2018).
Cooley, S. M., Hamilton, T., Deeds, E. J. & Ray, J. C. J. A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNASeq data. Preprint at https://www.biorxiv.org/content/10.1101/689851v1 (2019).
Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T. & Tomczak, J. M. Hyperspherical variational autoencoders. in Conference on Uncertainty in Artificial Intelligence (eds Globerson, A. & Silva, R.) 856–865 (AUAI Press Corvallis, 2018).
Nickel, M. & Kiela, D. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. in International Conference Machine Learning. (eds Jennifer, D. & Andreas, K.) Vol. 80, 3779–3788 (PMLR, 2018).
Nagano, Y., Yamaguchi, S., Fujita, Y. & Koyama, M. A wrapped normal distribution on hyperbolic space for gradientbased learning. in International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.)4693–4702 (PMLR, 2019).
Klimovskaia, A., LopezPaz, D., Bottou, L. & Nickel, M. Poincaré maps for analyzing complex hierarchies in singlecell data. Nat. Commun. 11, 2966 (2020).
Amodio, M. et al. Exploring singlecell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous singlecell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Welch, J. D. et al. Singlecell multiomic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887.e17 (2019).
Korsunsky, I. et al. Fast, sensitive and accurate integration of singlecell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Barkas, N. et al. Joint analysis of heterogeneous singlecell RNAseq dataset collections. Nat. Methods 16, 695–698 (2019).
Guu, K., Hashimoto, T. B., Oren, Y. & Liang, P. Generating sentences by editing prototypes. Trans. Assoc. Comput. Linguist. 6, 437–450 (2018).
Xu, J. & Durrett, G. Spherical latent spaces for stable variational autoencoders. in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.)4503–4513 (Association for Computational Linguistics, 2018).
Mathieu, E., Le Lan, C., Maddison, C. J., Tomioka, R. & Teh, Y. W. Continuous hierarchical representations with Poincaré Variational AutoEncoders. in Advances in Neural Information Processing Systems (eds Wallach, H. et al.) Vol. 32, 12544–12555 (Curran Associates, Inc., 2019).
Šavrič, B., Patterson, T. & Jenny, B. The equal earth map projection. Int. J. Geogr. Inf. Sci. 33, 454–465 (2019).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Moon, K. R. et al. Visualizing structure and transitions in highdimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Braga, F. A. V. et al. A cellular census of human lungs identifies novel cell states in health and in asthma. Nat. Med. 25, 1153–1163 (2019).
Hepler, C. et al. Identification of functionally distinct fibroinflammatory and adipogenic stromal subpopulations in visceral adipose tissue of adult mice. eLife 7, e39636 (2018).
Crinier, A. et al. Highdimensional singlecell analysis identifies organspecific signatures and conserved NK cell subsets in humans and mice. Immunity 49, 971–986 (2018).
Tran, N. M. et al. Singlecell profiles of retinal ganglion cells differing in resilience to injury reveal neuroprotective genes. Neuron 104, 1039–1055.e12 (2019).
Han, X. et al. Construction of a human cell landscape at singlecell level. Nature 581, 303–309 (2020).
Smillie, C. S. et al. Intraand intercellular rewiring of the human colon during ulcerative colitis. Cell 178, 714–730 (2019).
Blondel, V. D., Guillaume, J.L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, P10008 (2008).
Levine, J. H. et al. Datadriven phenotypic dissection of AML reveals progenitorlike cells that correlate with prognosis. Cell 162, 184–197 (2015).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of singlecell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Farrell, J. A. et al. Singlecell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360, eaar3131 (2018).
Nickel, M. & Kiela, D. Poincaré embeddings for learning hierarchical representations. Adv. Neural Inf. Processing Syst. (eds Guyon, I. et al.) Vol. 30, 6341–6350 (Curran Associates, Inc., 2017).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381 (2014).
Packer, J. S. et al. A lineageresolved molecular atlas of C. elegans embryogenesis at singlecell resolution. Science 365, 6459 (2019).
Cao, Z.J., Wei, L., Lu, S., Yang, D.C. & Gao, G. Searching largescale scRNAseq databases via unbiased cell embedding with Cell BLAST. Nat. Commun. 11, 3458 (2020).
RozenblattRosen, O. et al. The human tumor atlas network: charting tumor transitions across space and time at singlecell resolution. Cell 181, 236–249 (2020).
Hu, Q. & Greene, C. S. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics. in PSB (eds Altman, R. B. et al.) 362–373 (World Scientific, 2019).
Rezende, D. J. & Mohamed, S. Variational inference with normalizing flows. in Proceedings of the 32nd International Conference on Machine Learning (eds Bach, F. & Blei, D.) Vol. 37, 1530–1538 (PMLR, 2015).
Nitzan, M., Karaiskos, N., Friedman, N. & Rajewsky, N. Gene expression cartography. Nature 576, 132–137 (2019).
Xu, C. et al. Probabilistic harmonization and annotation of singlecell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).
Zhang, A. W. et al. Probabilistic cell type assignment of singlecell transcriptomic data reveals spatiotemporal microenvironment dynamics in human cancers. Nat. Methods 16, 1007–1015 (2019).
Ding, J. et al. Systematic comparison of singlecell and singlenucleus RNAsequencing methods. Nat. Biotechnol. 38, 737–746 (2020).
Fleming, S. J., Marioni, J. C. & Babadi, M. CellBender removebackground: a deep generative model for unsupervised removal of background noise from scRNAseq datasets. Preprint at bioRxiv https://doi.org/10.1101/791699 (2019).
Rodriques, S. G. et al. Slideseq: A scalable technology for measuring genomewide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Vickovic, S. et al. Highdefinition spatial transcriptomics for in situ tissue profiling. Nat. Methods 16, 987–989 (2019).
Satpathy, A. T. et al. Massively parallel singlecell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).
Lareau, C. A. et al. Dropletbased combinatorial indexing for massivescale singlecell chromatin accessibility. Nat. Biotechnol. 37, 916–924 (2019).
Rey, L. A. P., Menkovski, V. & Portegies, J. W. Diffusion variational autoencoders. in Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence (ed. Bessiere, C.) 2704–2710 (International Joint Conferences on Artificial Intelligence Organization, 2019).
Skopek, O., Ganea, O.E. & Bécigneul, G. Mixedcurvature variational autoencoders. in International Conference on Learning Representations (2020).
Pierson, E. & Yau, C. ZIFA: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 16, 241 (2015).
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for singlecell RNASeq based on a multinomial model. Genome Biol. 20, 295 (2019).
Vieth, B., Ziegenhain, C., Parekh, S., Enard, W. & Hellmann, I. powsimR: power analysis for bulk and single cell RNAseq experiments. Bioinformatics 33, 3486–3488 (2017).
Svensson, V. Droplet scRNAseq is not zeroinflated. Nat. Biotechnol. 38, 147–150 (2020).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Kullback, S. & Leibler, R. A. On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951).
Mardia, K. V. & ElAtoum, S. Bayesian inference for the von MisesFisher distribution. Biometrika 63, 203–206 (1976).
Straub, J., Campbell, T., How, J. P. & Fisher, J. W. Smallvariance nonparametric clustering on the hypersphere. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 334–342 (IEEE Computer Society, 2015). https://dblp.org/db/conf/cvpr/cvpr2015.html.
Abramowitz, M. & Stegun, I. A. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Vol. 55 (Courier Corporation, 1965).
Wood, A. T. Simulation of the von Mises Fisher distribution. Commun. Stat. Simul. Comput. 23, 157–164 (1994).
Ulrich, G. Computer generation of distributions on the Msphere. J. R. Stat. Soc. Ser. C. Appl. Stat. 33, 158–163 (1984).
Hornik, K. & Grün, B. movMF: an R package for fitting mixtures of von MisesFisher distributions. J. Stat. Softw. 58, 1–31 (2014).
Grattarola, D., Livi, L. & Alippi, C. Adversarial autoencoders with constantcurvature latent manifolds. Appl. Soft Comput. 81, 105511 (2019).
Bergmann, R., Fitschen, J. H., Persch, J. & Steidl, G. Priors with coupled first and second order differences for manifoldvalued image processing. J. Math. Imaging Vis. 60, 1459–1481 (2018).
Clevert, D.A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). in International Conference on Learning Representations (2016).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. in International Conference on Learning Representations (2015).
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolationbased tSNE for improved visualization of singlecell RNAseq data. Nat. Methods 16, 243–245 (2019).
Adler, D., Nenadic, O. & Zucchini, W. Rgl: a rlibrary for 3d visualization with OpenGL. in Proceedings of the 35th Symposium of the Interface: Computing Science and Statistics, Salt Lake City Vol. 35 (2003). http://rgl.neoscientists.org/arc/doc/RGL_INTERFACE03.pdf.
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Macosko, E. Z. et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Han, X. et al. Mapping the mouse cell atlas by microwellSeq. Cell 172, 1091–1107 (2018).
Acknowledgements
We thank Jennifer Rood for helpful comments and Leslie Gaffney for help with figure preparation, Inbal Benhar and Karthik Shekhar for the RGC data analysis, Jeffrey A Farrell for mapping zebrafish embryonic cells. This work was supported by the Klarman Cell Observatory, HHMI, the Food Allergy Science Initiative, the Manton Foundation, the NIH BRAIN Initiative (1U19 MH114821), and NIH/National Institute of Diabetes and Digestive and Kidney Diseases grant (1RC2DK114784).
Author information
Authors and Affiliations
Contributions
J.D. and A.R. developed the model. J.D. conducted experimental analyses with guidance from A.R. J.D., and A.R. interpreted the results and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
A.R. is a founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas Therapeutics and until August 31, 2020 was a SAB member of Syros Pharmaceuticals, Neogene Therapeutics, Asimov, and ThermoFisher Scientific. From August 1, 2020, A.R. is an employee of Genentech, a member of the Roche Group. J.D. declares no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ding, J., Regev, A. Deep generative model embedding of singlecell RNASeq profiles on hyperspheres and hyperbolic spaces. Nat Commun 12, 2554 (2021). https://doi.org/10.1038/s41467021228514
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021228514
This article is cited by

Multiomics singlecell data integration and regulatory inference with graphlinked embedding
Nature Biotechnology (2022)

Graphbased autoencoder integrates spatial transcriptomics with chromatin images and identifies joint biomarkers for Alzheimer’s disease
Nature Communications (2022)

Plasticity and heterogeneity of thermogenic adipose tissue
Nature Metabolism (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.