Structure-preserving visualization for single-cell RNA-Seq profiles using deep manifold transformation with batch-correction

Xu, Yongjie; Zang, Zelin; Xia, Jun; Tan, Cheng; Geng, Yulan; Li, Stan Z.

doi:10.1038/s42003-023-04662-z

Download PDF

Article
Open access
Published: 04 April 2023

Structure-preserving visualization for single-cell RNA-Seq profiles using deep manifold transformation with batch-correction

Communications Biology volume 6, Article number: 369 (2023) Cite this article

3798 Accesses
1 Citations
12 Altmetric
Metrics details

Subjects

Abstract

Dimensionality reduction and visualization play an important role in biological data analysis, such as data interpretation of single-cell RNA sequences (scRNA-seq). It is desired to have a visualization method that can not only be applicable to various application scenarios, including cell clustering and trajectory inference, but also satisfy a variety of technical requirements, especially the ability to preserve inherent structure of data and handle with batch effects. However, no existing methods can accommodate these requirements in a unified framework. In this paper, we propose a general visualization method, deep visualization (DV), that possesses the ability to preserve inherent structure of data and handle batch effects and is applicable to a variety of datasets from different application domains and dataset scales. The method embeds a given dataset into a 2- or 3-dimensional visualization space, with either a Euclidean or hyperbolic metric depending on a specified task type with type static (at a time point) or dynamic (at a sequence of time points) scRNA-seq data, respectively. Specifically, DV learns a structure graph to describe the relationships between data samples, transforms the data into visualization space while preserving the geometric structure of the data and correcting batch effects in an end-to-end manner. The experimental results on nine datasets in complex tissue from human patients or animal development demonstrate the competitiveness of DV in discovering complex cellular relations, uncovering temporal trajectories, and addressing complex batch factors. We also provide a preliminary attempt to pre-train a DV model for visualization of new incoming data.

Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces

Article Open access 05 May 2021

A topology-preserving dimensionality reduction method for single-cell RNA-seq data using graph autoencoder

Article Open access 08 October 2021

Clustering single-cell RNA sequencing data via iterative smoothing and self-supervised discriminative embedding

Article 04 June 2024

Introduction

The advent of technologies that interrogate genome-scale molecular information at single-cell resolution, such as single-cell RNA sequencing and mass cytometry, provides important insight into the comprehensive analysis of cellular differentiation and the relationship between cells¹. Although single-cell RNA sequences (scRNA-seq) data have high dimensionality, their intrinsic dimensionality is typically low because many genes are co-expressed and the droplet-based scRNA-seq is very sparse (> 90% genes with zero counts in a typical cell profile). Therefore, dimensionality reduction and visualization methods play an important role in interpreting scRNA-seq datasets, such as extracting effective information, intuitively understanding data distribution, and interpreting the relationship between cells^2,3.

In this paper, we address the following application scenarios: Firstly, we develop a machine learning method with the ability to preserve geometric structure of the high dimensional scRNA-seq data in the dimensionality reduced space and visualization of scRNA-seq data that can be applied to both cell clustering and trajectory inference tasks. These two scenarios are closely related yet have different technical goals: (1) For cell clustering is to explore the relationship between different cell types at a given time^{4,5,6,7,8,9,10,11,12,13,14,15,16,17}, which we call the static (at a time point) scenario. It is to learn a low-dimensional embedding in which cells belonging to the same type should be close to each other whereas those of different types be away from each other. (2) For trajectory inference, or the dynamic (at a sequence of time points) scenario, the learning is to uncover temporal trajectories of cells^{17,18,19,20,21}, which characterizes the transition process of immature cells into mature cells with specific types. Secondly, the still another task we are addressing is (3) performing batch correction to build low-dimensional representations of the biological contents of cells disentangled from the technical variations¹⁷. Thirdly, based on the above tasks, we make another two preliminary attempts including (4) building a “batch invariant” model to embed new incoming data impacted by diverse factors¹⁷ and (5) building a pre-trained model to embed new incoming heterogeneous data. In contrast, the current methods are inflexible and generally follow a uniform assumption when facing with different application scenarios and unable to accommodate these requirements in a unified framework.

Traditional linear/nonlinear dimensionality reduction methods have grown explosively during the last decades, including local geometric structure preservation and global geometric structure preservation methods. The former aims to find a subspace by preserving the local geometric structure such as locally linear embedding (LLE)⁴, Laplacian eigenmaps (LE)⁵ and stochastic neighbor embedding (SNE)⁶. The latter tries to preserve the global characteristics of input data in a low dimensional subspace, such as principal component analysis (PCA)⁷, isometric mapping (ISOMAP)¹⁸, diffusion map (DM)¹⁹ and PHATE²⁰. These methods are often insufficient to mine underlying biological information as they consider global or local structure preservation alone. Furthermore, t-distributed stochastic neighbor embedding (t-SNE)⁸ and uniform manifold approximation and projection (UMAP)⁹ based on manifold learning have demonstrated excellent performance in capturing complex local and global geometric structures of biological data. However, both of them suffer from several limitations. Firstly, t-SNE is not robust in the presence of technical noise and tends to form spurious clusters from randomly distributed data points, producing misleading results that may hinder biological interpretation¹⁶. Meanwhile, t-SNE preserves the local clustering structures, but the global structures such as inter-cluster relationships and distances cannot be reliably preserved. Secondly, the addition of new data points to existing embeddings is infeasible due to the non-parametric nature of t-SNE and UMAP. Instead, they need to be rerun on the combined dataset, which is computationally expensive and is not scalable. Thirdly, the “cell-crowding” problem (e.g., t-SNE), “cell-mixing” (e.g., UMAP) problem, and the lack of batch correction capability will affect the effectiveness of data visualization when handling large-scale datasets with hundreds of thousands of cells²².

In recent years, deep neural networks (DNNs)²³ have been utilized as effective non-linear dimensionality reduction and visualization tools for processing large datasets, incorporating different factors, and improving the scalable ability of models. This field mainly involves two mainstream directions, including (1) Deep manifold learning methods, such as parametric UMAP¹⁰, Markov-Lipschitz deep learning (MLDL)¹¹, deep manifold transformation (DMT)¹², deep local-flatness manifold embedding (DLME)²⁴, EVNet¹³, unified dimensional reduction neural-network (UDRN)¹⁴ and IVIS¹⁵, and (2) Deep reconstruction learning methods, which covers various (variational) autoencoders^16,25,26. Generally speaking, the latter seeks to reconstruct the input data distribution and often ignores the importance of intrinsic geometric structure in input data. In contrast, The former preserves the geometric structure of raw data as much as possible, which is beneficial for mining the underlying information of biological data but lacks batch correction ability. Specifically, these methods usually suffer three fundamental issues: (1) High distortion embedding problem. Most methods assume the embedding space is Euclidean, which is not enough for modeling and analyzing dynamic scRNA-seq data because the Euclidean geometry is not optimal for representing the hierarchical and branched developmental trajectories. As shown in Bourgain’s theorem, Euclidean space is unable to obtain comparably low distortion for tree data, even when using an unbounded number of dimensions²⁷. For example, Poincaré maps (Poin_maps)²¹ was proposed recently to harness the power of hyperbolic geometry into the realm of dynamic scRNA-seq data analysis. (2) Deep manifold learning methods do not have the ability to preserve the geometric structure of the high dimensional scRNA-seq data and correct batch effects in an end-to-end manner. Most methods require multiple separate steps, each with its own method, including batch correction (e.g., Seurat3 CCA², Harmony²⁸, LIGER²⁹, fastMNN³⁰, Scanorama³¹, SAUCIE³², scVI²⁶ and Conos³³), dimensionality reduction, and visualization. Recently, Ding et al.¹⁷ proposed a scalable deep generative model scPhere based on variational autoencoders to embed cells into a low-dimensional hyperspherical or hyperbolic space to better capture the inherent properties of scRNA-seq data. ScPhere addresses multi-level, complex batch factors, facilitates the interactive visualization of large datasets, resolves “cell-crowding” problem, and uncovers temporal trajectories. (3) Poor flexibility for various application scenarios. Most methods tend to follow a uniform assumption for different application scenarios (e.g., static or dynamic data with or without batch effects) without considering the inherent characteristics of biological data, and the pre-trained reference model lacks suitable preprocessing steps to accommodate new incoming heterogeneous data and can only map new incoming homogeneous data.

To address the above challenges, we propose a general visualization model, deep visualization (DV), that preserves the inherent structure of scRNA-seq data and handles complex batch effects. Specifically, DV learns a structure graph based on local scale contraction to describe the relationships between cells more accurately, transforms the data into 2- or 3-dimensional embedding space while preserving the geometric structure of the data, and constructs a priori batch effect graph to correct batch effects in an end-to-end manner. For static scRNA-seq data, we minimize the structure distortion between structure graph and visualization graph in Euclidean space (DV_Eu). For dynamic data, to better represent and infer underlying hierarchical and branched developmental trajectories, we embed cells to the hyperbolic space with Poincaré (DV_Poin) or Lorentz (DV_Lor) model and visualize the embeddings in a Poincaré disk. We demonstrate the superior performance of DV on critical existing cases based on nine diverse datasets from human, mouse, and model organisms, including processing large static and dynamic scRNA-seq datasets with/without complex multilevel batch effects, visualizing cell profiles from highly complex tissues and developmental processes. We also make a preliminary attempt to build a pre-trained reference model for visualization of new incoming homogeneous and heterogeneous data. Overall, our method serves as a unified solution for enhanced representation, complex batch correction, visualization, and an interpretation tool for single-cell genomics research.

Results

For the purpose of static and dynamic scRNA-seq data visualization, DV embeds the data into a 2- or 3-dimensional Euclidean or hyperbolic latent space at the end of the DNNs (Fig. 1a), in terms of the curvature characteristics of the data manifold. A Euclidean space with zero curvature is commonly adopted by most visualization methods (e.g., t-SNE, UMAP, PHATE, and IVIS) for its flatness and intuitive class boundaries, which may be sufficient for exploring the relationship between different cell types in static data. Hyperbolic embedding with negative curvature has been proposed for learning of latent representations from hierarchical textual and graph-structured data²⁷. We believe that it is suitable for dynamic data to uncover temporal trajectories. This is because in such type of data, the exponential growth of the number of leaves in a tree with respect to its depth is analogous to the exponential growth of surface area with respect to its radius. In the hyperbolic space, circle circumference and disc area grow exponentially with radius, as opposed to the Euclidean space where they only grow linearly and quadratically³⁴.

**Fig. 1: The Deep Visualization (DV) model.**

DV model assumes that a good embedding should preserve the geometric structure of scRNA-seq data as much as possible. According to manifold assumption, the observed data are low dimensional manifolds uniformly sampled in a high dimensional Euclidean space. In practice, the droplet-based scRNA-seq data usually has a large number of zero or near-zero values. The relationship between cells is difficult to be defined by vector similarity (e.g., Euclidean distance) directly for such high-dimensional sparse data. Therefore, DV learns a reliable structure graph based on local scale contraction between each cell and its corresponding augmented cells (linear interpolation between each cell and its k neighbor points) to describe the relationship between cells more accurately. Specifically, DV estimates the underlying manifold structure in four main steps (Fig. 1a): Firstly, constructing a fully connected undirected structure graph G_structure for cells and their corresponding augmentation cells based on structure embedding learned by the structure module, where each node corresponds to an individual cell and each edge has a weight (Euclidean distance between the structure embeddings of the two connected cells). The purpose is to estimate the local geometry of the underlying topological manifold. Secondly, DV learns a low-dimensional Euclidean or hyperbolic embedding per cell and constructs a fully connected undirected visualization graph G_{visualization} for cells and their corresponding augmentation cells based on visualization embedding learned by the visualization module. In detail, for Euclidean latent space, DV learns 2-dimensional embeddings and adopts Euclidean distance to describe the relationship between embeddings. For hyperbolic latent space, DV learns 2 or 3-dimensional embeddings with Poincaré or Lorentz model and adopts hyperbolic distance to describe the relationship between embeddings. Thirdly, DV converts G_structure and G_{visualization} edge weight from distance to similarity based on the student’s t-distribution. The purpose is to highlight similar pairwise nodes and weaken dissimilar pairwise nodes. These steps are commonly used in manifold learning (e.g., t-SNE, UMAP) to approximate the structure of an unknown manifold from similarities in the feature space. Finally, to preserve the geometric structure of scRNA-seq data, DV adopts the geometric structure preservation loss function to train DNNs, which minimizes the distribution discrepancy between G_structure and G_{visualization}. To make DV compatible with the batch correction ability simultaneously, we integrate the manually designed priori batch effect graph G_batch into the G_{visualization} to be learned in the training process to learn a G_{visualization} with batch effect removed. As a deep learning model trained by mini-batch stochastic gradient descent, DV is especially suited to process large scRNA-seq datasets with complex multilevel batch effects and facilitates emerging applications. We provide complete details in the “Methods” section.

DV preserves the structure of scRNA-seq data in very low-dimensional spaces in visualizing large datasets

Applying DV to scRNA-seq data, we systematically assess the visualization performance of DV embeddings in a latent space with few (2 or 3) dimensions. We compare the geometric structure preservation performance (Q_global and Q_local scores, interpreted as a scRNA-seq dataset comprises a smooth manifold and a good dimensionality reduction method would preserve local and global distances on this manifold) of DV, which embeds cells in Euclidean or hyperbolic spaces, as well as to PCA⁷, t-SNE⁸, UMAP⁹, IVIS¹⁵, PHATE²⁰, Poin_maps²¹ and hyperbolic scPhere (scPhere_wn)¹⁷. Following scPhere¹⁷, we apply DV to seven scRNA-seq datasets from human and mouse, spanning from small (thousands) to very large number (hundreds of thousands) of cells from one or multiple tissues, and with a small (two) to very large (dozens) of expected cell types. The “small” datasets are: (1) a blood cell dataset³⁵ with only 10 erythroid cell profiles and 2293 CD14+ monocytes, (2) 3314 human lung cells³⁶, (3) 1378 mouse white adipose tissue stromal cells³⁷, and (4) 1755 human splenic nature killer cells spanning four subtypes³⁸. The “large” datasets are: (1) 35,699 retinal ganglion cells (RGC) in 45 cell subsets³⁹, (2) 599,926 cells spanning 102 subsets across 59 human tissues in the human cell landscape (HCL)⁴⁰, and (3) 86,024 C. elegans embryonic cells (ELEGAN) collected along a time course from <100 min to >650 min after each embryo’s first cleavage⁴¹.

DV obtains more competitive performance compared with baseline methods on “small” datasets (Supplementary Fig. 1 and Fig. 2). It is worth noting that DV_Poin and DV_Lor with hyperbolic latent spaces even perform better than DV_Eu in some datasets, although there are discrete cell types in these datasets. Overall, these methods achieve good visualizations for these smaller datasets without batch effects but with minor challenges. For example, (1) In human lung cells, all methods mix pericyte and APC except DV_Lor (Supplementary Fig. 1m). (2) PHATE (Supplementary Figs. 2f, p) and Poin_maps (Supplementary Figs. 1i, s) – are designed for development trajectories – connecting cells inaccurately when only discrete cell types are present. (3) PCA cannot effectively capture complex data structures due to the lack of nonlinear ability, resulting in the distortion issue with the increased number of cell types (Supplementary Fig. 2d). (4) Compared with the dimensionality reduction methods based on manifold learning (e.g., DV, UMAP, and t-SNE), scPhere_wn (Supplementary Figs. 2j, t) based on variational auto-encoder does not have enough power to aggregate similar cells belonging to the same type and push away dissimilar cells belonging to different types.

DV obtains predominant advantages over baseline methods on datasets with a larger number of cells and clusters, such as mouse RGC cells and human HCL cells. As shown in Fig. 2, although t-SNE, UMAP and Poin_maps can distinguish individual cell types among RGC (Fig. 2a–h, Supplementary Figs. 3a–h) and HCL (Fig. 2i–q, Supplementary Fig. 3i–q) well, DV_Eu, DV_Poin and DV_Lor achieve superior local and global structure preservation performance (Fig. 2r, s, Supplementary Data 1). In contrast, DV_Poin and DV_Lor based on hyperbolic latent spaces can obtain better global structure preservation performance than DV_Eu based on Euclidean latent space in RGC cells, especially DV_Poin based on the Poincaré ball model. Specifically, Cartpt-RGC clusters are close together in DV_Poin hyperbolic latent space (Fig. 2j), while in other methods except Poin_maps (Fig. 2q), they are embedded in different locations of the latent space, this means that hyperbolic latent space can better preserve the hierarchical global structure of cells. In HCL cells, there are six major cell groups, including fetal stromal cells, fetal epithelial cells, adult endothelial cells, endothelial cells, adult stromal cells and immune cells, and their respective clusters are close to each other in DV_Eu (Fig. 2a), but are more dispersed in t-SNE embeddings (Fig. 2g). HCL cells contain two cell sources, including adult cells and fetal cells, their respective clusters are close to each other and have better differentiation between both sources in DV_Eu (Fig. 2a), but are more mixed in UMAP embeddings (Fig. 2h). Compared with DV_Eu, DV_Poin (Fig. 2b) and DV_Lor (Fig. 2c) have larger volume for structure storage, thus they gain more power to push away dissimilar clusters, this will help with data analysis in scenarios without priori labels/celltypes, such as visualization partition analysis (e.g., DV_Poin hyperbolic visualization space contains six major regions). Furthermore, some clusters (e.g., enterocyte cells have an immune function) belonging to adult endothelial cells are close to immune cells (mostly B cells). Meanwhile, UMAP and t-SNE are often unable to effectively visualize datasets with a larger number of cells as reflected both visually. For example, the “cell-crowding” problem existing in t-SNE (Fig. 2g), different clusters are uniform spread across the visualization space, leading to the poor ability of t-SNE to recognize the major clusters and mine the distinct cell types, and the “cell-mixing” problem existing in UMAP (Fig. 2h), different clusters are twisted and mixed. DV overcomes the above issues, because it learns a more reliable structure graph G_structure based on nonlinear DNNs and is trained using mini-batches, while t-SNE and UMAP are learned using all the data, and their hyperparameters (e.g., “perplexity” in t-SNE) have to be adapted to larger number of cells, but increasing the “perplexity” makes t-SNE computationally expensive simultaneously (Supplementary Fig. 16). DV is natural to process a large number of cells with a time complexity that is linear with the number of input cells. As expected, PCA (Fig. 2d, l), IVIS (Fig. 2e, m) and PHATE (Fig. 2f, n) do not perform well for these large datasets with mostly discrete cell types.

**Fig. 2: DV preserves local and global structures in visualizing large *static* scRNA-seq data.**

DV obtains more intuitive visualization on dynamic cells, which are expected to show developmental trajectories, such as from stem cells to mature cells (Fig. 3j–r), and explains the relationships between clusters (Fig. 3a–i). As described above, DV_Poin (Fig. 3b) and DV_Lor (Fig. 3c) have larger volumes for hierarchical structure storage, thus the dataset can be divided into multiple regions to facilitate the analysis of cell differentiation for each major cell type individually. For example, there are four major regions (muscle cells, excretory cells, pharyngeal cells, neuron cells and other individual clusters) in DV_Poin latent space if not considering NA cells. Compared with DV, data in the central region of baseline methods are faced with the severe “cell-crowding” problem (Fig. 3m–r), this is not conducive to a more accurate recognition of cell origin. However, the problem is well solved by DV, where the data in the central region can be clearly identified and assigned to different differentiation branches (Fig. 3j–l). Moreover, we can position the expected root cells of the developmental process at the center of a Poincaré disk, then the distance of each cell from the center can be thought of as a pseudo time. For a specific cell type, we can see cells progress with distance continuously in the Poincaré disk at almost a fixed angle. Fortunately, compared with scPhere_wn (Fig. 3r), DV (Fig. 3j–l) can automatically place root cells near the center of the Poincaré disk without prior knowledge of the root label, which is convenient for data analysis when unknown time labels. Furthermore, DV_Poin (Fig. 3b) and DV_Lor (Fig. 3c) alleviate the “cell-mixing” problem (e.g., seam cells and hypodermis cells are entangled) existing in DV_Eu (Fig. 3a) and UMAP (Fig. 3h). Therefore, DV embeds dynamic cells into a hyperbolic space with Lorentz or Poincaré model, which is suitable for differentiated data analysis, and optionally converts the coordinates in the Lorentz model to the Poincaré disk for 2-dimensional visualization. More importantly, DV retains the biological explanation brought by the scPhere_wn method. Specifically, in ELEGAN cells, the cells are ordered neatly in the latent space by both time and lineage, from a clearly discernible root at time 100–130 at the center of the Poincaré disk (cells from < 100 were mostly unfertilized germline cells) to cells from time > 650 near the border of the Poincaré disk or away from the origin in the Poincaré and Lorentz model (Fig. 3k, l, Supplementary Fig. 13). Within the same cell type, cells are ordered by embryo time in the Poincaré disk (Fig. 3b, c). After first appearing along a developmental trajectory, cells of the same type progress with embryo time, forming a continuous trajectory occupying a range of angles¹⁷. Moreover, different cell types (e.g., ciliated amphid neurons, ciliated nonamphid neurons, hypodermis, seam cells and body wall muscle) that appear at slightly different embryonic time points, have their origins around the same region and progress with embryonic time in a similar way, forming a continuous trajectory but at a different angle and/or distance ranges from the center¹⁷. These patterns are harder to discern in IVIS (Fig. 3n), PHATE (Fig. 3o), t-SNE (Fig. 3p) and UMAP (Fig. 3q), where cells from consecutive time points are compacted, cells that appear early are relatively distant from each other in the embeddings, and temporal progression is not in the same direction. Thus, the DV model with a hyperbolic latent space learns smooth (in time) and interpretable cell trajectories.

**Fig. 3: DV shows developmental trajectories in visualizing large scRNA-seq *dynamic* data.**

DV effectively models complex, multilevel batch, and other variables

In realistic biological datasets, scRNA-seq profiles are typically impacted by diverse factors, including technical batch effects in separate experiments and different lab protocols, as well as biological factors, such as inter-individual variation, sex, disease or tissue location. However, most batch-correction methods can handle only one batch variable and may not be well-suited to the increasing complexity of current datasets. Applying DV to scRNA-seq data with multiple known confounding factors (e.g., batches and conditions), we systematically assess the performance of DV embeddings in a latent space with few (2, 3) or low (5, 10, 20) dimensions by comparing the geometric structure preservation performance (Q_global and Q_local scores, interpreted as a complex multi-batch scRNA-seq dataset comprises multiple smooth manifold and a good dimensionality reduction method will preserve local and global distances on each manifold after removing batch effect) and classification performance (ACC_mvo score, interpreted as a good batch correction method will integrate different manifolds) of DV, which embeds cells in Euclidean space for static cells and hyperbolic space for dynamic cells, as well as to Euclidean scPhere (scPhere_normal), hyperspherical scPhere (scPhere_vmf), scPhere_wn and other visualization methods, including t-SNE, UMAP, IVIS and PHATE (with 50 principal components, batch-corrected by Harmony or scVI). Following scPhere¹⁷, we apply DV to a dataset of 301,749 cells profiled in a complex experimental design from the colon mucosa of 18 patients with ulcerative colitis (UC), a major type of inflammatory bowel disease (IBD), and 12 healthy individuals⁴². The static datasets are: (1) 26,678 stromal cells and glia (12 cell types), and (2) 210,614 immune cells (23 cell types). The dynamic dataset are: (1) 64,457 epithelial cells (12 cell types), and (2) 86,024 ELEGAN cells (12 time states).

DV_Eu obtains more competitive performance compared with baseline methods on static datasets with a larger number of cells and multiple confounding factors, such as stromal dataset (30 patients with patient origin and disease status factors) and immune dataset (30 patients with patient origin, disease status and location factors). Analyzing cells with the patient origin and disease status (healthy, uninflamed and inflamed) as the batch vector, not only recapitulates the main cell groups, but also allows us to better visually explore cellular relations (Supplementary Fig. 4a–i) and finds cell groups related to disease directly. For example, in the stromal dataset, the postcapillary venules cells, endothelial cells and microvascular cells are close to each other, and adjacent to the pericyte (Supplementary Fig. 4a). Conversely, these distinctions can barely be discerned in a UMAP (Supplementary Fig. 4f) and Poin_maps (Supplementary Fig. 4g) plot of the same data, where endothelial and microvascular cells are very close. Among fibroblasts, cells are arranged in a manner that mirrors their position along the crypt-villus axis, from RSPO3+ cells, to WNT2B+ cells, to WNT5B+ cells. Strikingly, the inflammatory fibroblasts, which are unique to UC patients and are independent of the patient origin, are readily visible (Supplementary Fig. 4a, pale green) and are both distinct from the other fibroblasts, while spanning the range of the “crypt-villus axis”¹⁷. Considering the ability to integrate disease status, DV_Eu merges part inflammatory fibroblasts with WNT2B+ fibroblasts (Supplementary Fig. 4a). When learning a DV_Eu model that includes patient origin, disease status, and anatomical region as the batch vectors, immune cells groups visually by cell type (Fig. 4a), and the influence of patient origin, disease status and region is largely removed. For example, the CD8+IL17+ T cells are nestled between CD8+ T cells and activated CD4+ T cells in a manner that was intriguing and consistent with the mixed features of those cells (Fig. 4a)¹⁷. In terms of evaluation criteria, on the one hand, DV_Eu (batch correction on 2-dimensional embeddings) obtains competitive local (Q_local score) and global (Q_global score) geometric structure preservation performance compared with UMAP, IVIS and PHATE based on Harmony or scVI (batch correction on 50 principal components) and outperforms scPhere_normal and t-SNE on the stromal dataset (Supplementary Fig. 4j, k) and immune dataset (Fig. 4i, j, Supplementary Data 2). Moreover, DV_Eu achieves better performance when batch correction on low (5, 10 and 20) dimensional embeddings. On the other hand, based on 5-nearest neighbors (5-NN) classification accuracy of cell types, DV_Eu obtains competitive batch correction visualization performance (ACC_mvo score on 2-dimensional embeddings) compared with IVIS, PHATE, t-SNE, UMAP and Poin_maps combined with Harmony or scVI on the stromal dataset (Supplementary Fig. 4l) and immune dataset (Fig. 4k, Supplementary Data 3). Moreover, DV_Eu obtains a significant batch correction advantage (ACC_mvo score on 5, 10 and 20-dimensional embeddings) over scPhere method on low dimensional embeddings (Fig. 4l, Supplementary Fig. 4m, Supplementary Data 3). It is worth noting that while some methods outperform DV on the Q_local, Q_global or ACC_mvo score on 2-dimensional embeddings, they are not always stable. For example, IVIS and PHATE combined Harmony obtains good Q_local and Q_global scores on the stromal dataset, while their ACC_mvo scores are very poor.

**Fig. 4: DV addresses complex technical and biological batch for visualization and analysis in colon biopsies (SCP259) from healthy individuals and UC immune patients.**

DV_Poin and DV_Lor obtain more intuitive visualization and competitive performance compared with baseline methods on dynamic datasets, such as epithelial dataset (30 patients with patient origin, disease status and location factors) and ELEGAN dataset (7 batches). In epithelial cells, we readily discern developmental ordering from intestinal stem cells to terminally differentiated cells in the Poincaré disk (Fig. 5b, c), with stem cells at the center of the disk for intuitive interpretation: one trajectory is from stem cells to secretory TA cells, to immature goblet cells, to goblet cells, the other trajectory is from stem cells to TA2 cells, to immature enterocyte cells, to enterocyte cells. In contrast, developmental trajectories are less apparent when we embed cells in Euclidean space of scPhere_normal (Fig. 5f). The 2-dimensional visualization embeddings of t-SNE (Fig. 5g), UMAP (Fig. 5h), DV_Eu (Fig. 5a), IVIS (Fig. 5d), PHATE (Fig. 5e), scPhere_wn (Fig. 5j) and scPhere_vmf (Fig. 5i) are reasonable, although the t-SNE has some small spurious clusters, goblet cells have one spurious cluster close to enterocytes in DV_Eu, IVIS, scPhere_wn and scPhere_vmf, several cell types (M-cells and TA2 cells, tuft and enteroendocrine cells) are merged in PHATE, and the developmental trajectories are less apparent when the cell types are missing in DV_Eu, scPhere_vmf and scPhere_wn. In addition, DV_Poin and DV_Lor obtain a significant local (Fig. 5n, Supplementary Data 2) and global (Fig. 5o, Supplementary Data 2) geometric structure preservation advantage compared with other methods in Euclidean space, and this advantage increases significantly when the dimension of embedding space is increased. In ELEGAN cells, the cellular relations and developmental trajectories in DV and scPhere_wn have minor changes when compared Supplementary Fig. 5 (with batch correction) with Fig. 3 (without batch correction), which indicates that the batch effect has little influence on this dataset, but the former still mitigate the less apparent batch effect problem existing in some cell types (e.g., abarpaaa lineage, ciliated non amphid neurons and ciliated amphid neurons). Moreover, the advantages of DV_Poin and DV_Lor (Supplementary Fig. 14) in this dataset described in the previous section (Supplementary Fig. 13) still remain when compared with t-SNE, UMAP and scPhere (Supplementary Fig. 15).

**Fig. 5: DV addresses complex technical and biological batch for visualization and analysis in colon biopsies (SCP259) from healthy individuals and UC epithelial patients.**

DV_Eu achieves impressive results on all stromal, epithelial, and immune cells simultaneously (Supplementary Fig. 6a), demonstrating its capacity to embed large numbers of cells of diverse types, states and proportions. The 2-dimensional visualization embeddings of t-SNE (Supplementary Fig. 6b) and UMAP (Supplementary Fig. 6c) using Harmony batch-corrected results accounting for the patient status as inputs are reasonable. However, when removing the PCA preprocessing, t-SNE (Supplementary Fig. 6g) and UMAP (Supplementary Fig. 6h) fail to an unsuccessful visualization, while DV_Eu (Supplementary Fig. 6f) can achieve competitive performance compared with scPhere_wn (Supplementary Fig. 6e) and scPhere_vmf (Supplementary Fig. 6i). For example, there are a lot of noise points existing in t-SNE embeddings, which illustrates that t-SNE can not distinguish different cell types, and UMAP suffers from a severe confusion issue among different clusters. Moreover, this also reflects that the Harmony method can not effectively remove the batch effect problem in sparse data. For example, the plasma cells are separated in UMAP embeddings. Overall, these results demonstrate the superior performance of DV_Eu compared to the combination of Harmony’s batch correction and t-SNE or UMAP’s visualization through multiple experiments on large datasets with a large number of cells and cell types, multilevel batch effects, and complex structures.

Pre-trained reference DV builds atlases for visualization and annotation of new incoming data

As a parametric model, we can train DV to co-embed new incoming/testing data to a latent space learned from training data only. We use DV to map cells from new incoming patients, one critical homogeneous application case as multiple studies need to be integrated by training a “batch-invariant” DV model, the other critical heterogeneous application case as a new study can be explored by a pre-trained DV model. Then, DV takes the gene expression vectors or PCA principal components of cells as inputs and maps them to a 2-dimensional Euclidean latent space to achieve data visualization and annotation. Therefore, we conduct preliminary experiments to explore the scalable ability of DV to train general models based on large-scale datasets.

For homogeneous case (training data and testing data share the same gene names and numbers), we learn a “batch-invariant” DV_Eu model for stromal, epithelial and immune cells from 18 patients training data of the UC dataset and use it to visualize the cells from 12 patients testing data directly. Then we train k-nearest neighbor (k-NN) classifiers (k = 5) on 2-dimensional embeddings of the training data and apply the learned k-NN classifiers to 2-dimensional embeddings of the testing data. The experiment results demonstrate that DV’s embeddings of testing data with high quality. For the stromal dataset (Supplementary Fig. 7), DV_Eu (72.84%) obtains better classification accuracy compared with scPhere_normal (71.25%), scPhere_wn (70.25%) and scPhere_vmf (70.24%), it significantly improves the precision of most categories, especially on inflammatory fibroblasts cells, while the precision of RSPO3+ is unsatisfactory. For the immune dataset (Fig. 6, Supplementary Data 4), DV_Eu (82.17%) outperforms scPhere_normal (78.23%), scPhere_wn (79.37%) and scPhere_vmf (78.58%) in terms of classification accuracy, especially on DC1 cells, DC2 cells, macrophages and cycling monocytes, while the precision decreases in inflammatory monocytes, ILC cells and DC8+IEL cells. For epithelial cells (Supplementary Fig. 8), DV and scPhere achieve similar classification accuracy, but DV possesses a better visualization ability to show two main developmental trajectories.

**Fig. 6: Using learned DV model to visualize cells, pinpoint cell types impacted by biological factors, and generate a “batch-invariant” reference atlas.**

For heterogeneous case (training data and testing data share different gene names and numbers), we design a series of critical preprocessing methods (Fig. 1c) to overcome the heterogeneous problem, including a heterogeneous correction module (the same genes in training data and testing data are selected, for testing data, the same genes are maintained as original value, missing genes are set to 0, and redundant genes are removed), normalization, log scaling, standardization module (mean and standard deviation learned on training data is used to scale testing data), PCA module (PCA model learned on training data is used to map testing data as 50 principal components). We learn a pre-trained DV_Eu model for HCL cells from 43 tissues training data, then use it to map the testing data, including the HCL cells from 28 tissues and the mouse cell atlas (MCA) cells. The experiment results demonstrate that DV’s embeddings of test data with high quality even in heterogeneous cases. For example, the underlying biological information analysis in the HCL dataset mentioned above still remains, and DV_Eu obtains 72.82% classification accuracy (Supplementary Fig. 9c, only assessing the cell types presented in the training data) when data is collected in the same species and profiled by the same platform (e.g., Microwell-Seq platform). Moreover, the pre-trained DV can still locate some important clusters (e.g., Erythroid cells, Macrophage, Monocyte, B cells and Fasciculata cells, Supplementary Fig. 9d) corresponding to the training dataset when conducting cross-species experiments (e.g., training on HCL and testing on MCA). This verifies the analysis in the original study⁴⁰ that the major cell types in mammalian organs are similar.

Discussion

We propose the DV model to embed static and dynamic scRNA-seq cells in low-dimensional Euclidean and hyperbolic spaces to enhance exploratory data analysis and visualization of cells from single-cell studies, especially with complex multilevel batch factors. DV provides more readily interpretable representations and avoids “cell-crowding” and “cell-mixing” problems. When embedding dynamic cells in hyperbolic spaces, it helps to study developmental trajectories. In this case, DV_Poin and DV_Lor can place root cells near the center of a Poincaré disk automatically (distance to the center can be used as a natural definition for pseudo time). They can divide cells into multiple regions to facilitate the analysis of cell differentiation for each major cell type individually (the cells of specified type progress continuously with distance and angle in the Poincaré disk).

The main advantage of DV is to realize the geometric structure preservation of scRNA-seq data and accounting for multilevel complex batch effects simultaneously, which disentangles cell types from patients, diseases and location variables. This is an important advantage over other DNNs methods, which fail to combine geometric structure preservation with batch correction capability. To prove it, we evaluate the effectiveness of three main components (Supplementary Fig. 17), including visualization module, structure module and batch correction module. We can harness these abilities in several ways to meet different task requirements based on the inherent characteristics of biological data: to visualize static or dynamic cells directly when dataset without batch effects, to visualize static or dynamic cells considering one factor or combination of them for dataset with multilevel complex batch effects, to investigate which cell types are most affected by a factor, or to generate general reference model, which can map new incoming homogeneous or heterogeneous data to an existing embeddings and annotate cell types. DV’s ability to handle complex batch factors is an advantage over previous methods for batch correction, which handle only one batch vector. Indeed, in our benchmarking with IBD cells with 30 patients with three disease statuses, DV performs better than state-of-the-art batch correction methods such as Harmony, scVI and scPhere. In the future, we can leverage supervised information to construct a more reliable priori batch effect graph. In addition, as a parametric model, DV can naturally co-embed homogeneous or heterogeneous new incoming/testing data to a latent space learned from training data only.

DV is especially suitable for analyzing large scRNA-seq datasets: its running time scales linearly with the number of input cells (Supplementary Fig. 16). It alleviates “cell-crowding” and “cell-mixing” issues when handling with large numbers of input cells, and it can preserve local and hierarchical global geometric structures of data better than baseline methods thought its runtime slightly longer than the other methods. Finally, by learning a “batch-invariant” model that takes gene expressions or principal components as inputs to learn latent embeddings, it forms a reference to visualize and annotate new profiled cells from future studies. This is an important advantage over nonparametric methods such as t-SNE, UMAP, PHATE and Poin_maps, which do not have the ability to embed new data, especially in the presence of batch effects.

DV converges rapidly and is robust to hyperparameters. For DV models with different latent spaces (e.g., Euclidean or hyperbolic space), training is quite stable and converges rapidly. DV completes fitting with 100 epochs, and its morphology is consistent with that of 300 epochs (Supplementary Fig. 10a–j, k–t). Moreover, we adjust the hyperparameters according to the proposed typical value ranges. It can be observed that the influence of hyperparameters is limited (Supplementary Fig. 11a–f, g–m). Even if the visualization results change, it does not affect the underlying biological significance in dynamic scRNA-seq datasets (Supplementary Fig. 12a–g, h–n).

DV can be extended in several other ways. When cell type annotations or cell type marker genes for some of the analyzed cells are available, we can include semi-supervised learning to annotate cell types. Given the rapid development of spatial transcriptomics, single-cell ATACseq and other complementary measurements, DV can be extended for the integrative analysis of multimodal data. DV can also learn discrete hierarchical trees for better interpreting developmental trajectories using hyperbolic neural networks. Given its scope, flexibility, and extensibility, we foresee that DV will be a valuable tool for large-scale single-cell and spatial genomics studies.

Methods

Data preprocessing

The raw sequencing data is preprocessed with a series of pipelines as a common practice. The preprocessing steps consist of normalization (in the summed value), log scaling, PCA and batch correction. Among others, batch correction is crucial for complex data from multi-batches and is one of the focuses in this article. Moreover, all the compared methods but scPhere are preprocessed by default (normalization, log scaling, and PCA).

When the dimensionality of data exceeds 50 dimensions, PCA is usually applied to alleviate “the curse of dimensionality”, such as in t-SNE⁸, UMAP⁹, IVIS¹⁵, PHATE²⁰ and Poin_maps²¹. We keep top 50 principal components as the input data to all the compared methods but scPhere while the log scaling of raw data as inputs. The proposed DV method is accounting for the above two preprocessing methods.

For the traditional dimensionality reduction methods, such as PCA, t-SNE, UMAP, IVIS, PHATE and Poin_maps, require multiple separate steps (batch correction, dimensionality reduction and visualization) to achieve visualization, each with its own method or algorithm. For the batch-correction methods, such as Harmony, scVI, Seurat3 CCA and LIGER methods can only handle one batch vector, we use the patient status as the batch label. For Harmony, we use the code in “scanpy” package⁴³ and use the default parameter settings (e.g., dimension is set to 50). For scVI, we used the code in “scvi-tools” package⁴⁴ and use the default parameter settings (e.g., dimension is set to 10). For Seurat3 CCA and LIGER, we remove them from the experimental comparison considering their poor effect in the scPhere paper¹⁷. However, DV and scPhere provide an end-to-end, single process that can achieve visualization and multi-batch correction simultaneously.

We compare the proposed DV with seven previous methods, namely, PCA, t-SNE, UMAP, PHATE, Poin_maps, IVIS and scPhere. We used the code in the “scikit-learn” package⁴⁵ for PCA, t-SNE, UMAP and PHATE, and their tutorial and released code for Poin_maps, IVIS and scPhere.

DV overview

DV receives a scRNA-seq dataset ${{{{{{{\mathcal{D}}}}}}}}={\left\{\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)\right\}}_{i = 1}^{n}$ as input, where ${{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\in {{\mathbb{R}}}^{d}$ is the gene expression vector of cell i, y_i is a categorical variable vector specifying the batch label (multi-hot encoding) in which x_i is measured, d is the number of measured genes, and n is the number of cells.

For scRNA-seq data, the observed unique molecular identifier (UMI) counts of cells are sparse. Therefore, linear data augmentation (e.g., linear mixup) is adopted to improve the stability and generalization of the model, and it generates augmented data by a convex combination between cells and their k neighbors based on a input graph G_input (k-NN graph constructed on input data):

$$\begin{array}{rcl}{\hat{{{{{{{{\bf{x}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}=\left(1-{r}_{u}\right)\cdot {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}+{r}_{u}\cdot {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}},&&where\,\,{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\in {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{N\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)}\\ {\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}=\left(1-{r}_{u}\right)\cdot {{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}+{r}_{u}\cdot {{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}},&&where\,\,{{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\in {{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{N\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)}\end{array}$$

(1)

where $N\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)$ denotes the k-NN list of the data point x_i, r_u denotes the linear combination parameter and is sampled from the uniform distribution $U\left(0,{p}_{u}\right)$, p_u is a hyperparameter and is set to 1. Now, we obtain the updated dataset $\widetilde{{{{{{{{\mathcal{D}}}}}}}}}={\left\{\left({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\tilde{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)\right\}}_{i = 1}^{a\times b}={\left\{\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{y}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)\right\}}_{i = 1}^{b}\cup {\left\{\left({\hat{{{{{{{{\bf{x}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)\right\}}_{i = b+1}^{(a+1)\times b}$ combined original dataset ${{{{{{{\mathcal{D}}}}}}}}$ with augmented dataset $\widehat{{{{{{{{\mathcal{D}}}}}}}}}={\left\{\left({\hat{{{{{{{{\bf{x}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\right)\right\}}_{i = 1}^{a\times b}$ as the input data, where a is the data augmentation number of each cell, ${\hat{{{{{{{{\bf{x}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}$ and ${\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}$ are the augmented gene expression vector of cell i and the corresponding batch categorical variable vector, respectively, and here b = n due to DV is trained by mini-batch stochastic gradient descent.

Visualization module

Although x_i is high-dimensional, its intrinsic dimensionality is typically much lower. Manifold learning assumes that a decent embedding should preserve the geometric structure of data as much as possible. Therefore, we optimize DV based on the geometric structure preservation loss function, which minimizes the distribution discrepancy between G_structure and G_{visualization} in the form of fuzzy sets cross entropy (two-way divergency):

$${L}_{GSP}=\mathop{\sum }\limits_{i,j=1}^{(a+1)\times b}{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}\log \frac{{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}}{{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}}+\left(1-{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}\right)\log \frac{\left(1-{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}\right)}{\left(1-{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}\right)}$$

(2)

where b is the number of batch size, ${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}$ is the undirectional similarity between structure embedding ${{{{{{{{\bf{z}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}$ and ${{{{{{{{\bf{z}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}$ learned by structure module, and ${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}$ is the undirectional similarity between visualization embedding ${{{{{{{{\bf{z}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}$ and ${{{{{{{{\bf{z}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}$ learned by visualization module. The undirectional similarity u_ij is defined as following:

$${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{ij}}}}}}}}}={{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}| {{{{{{{\bf{j}}}}}}}}}+{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}-2{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}| {{{{{{{\bf{j}}}}}}}}}{{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}$$

(3)

where u_j∣i is a directional similarity (edge weigt of graph) converted from the Euclidean or hyperbolic distance $D({\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})$ between embedding ${\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}$ and embedding ${\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}$ and adopts normalized squared t-distribution:

$${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}\left(\nu \right)=g(D({\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})| \nu )={C}_{\nu }{\left(1+\frac{D({\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})}{\nu }\right)}^{-(\nu +1)}$$

(4)

where ν is the degrees of freedom in the t-distribution, ν^st in ${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}$ is set to 100, ν^vi in ${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}$ is a hyperparameter, and

$${C}_{\nu }=2\pi {\left(\frac{\Gamma \left(\frac{\nu +1}{2}\right)}{\sqrt{\nu \pi }\Gamma \left(\frac{\nu }{2}\right)}\right)}^{2}$$

(5)

is the normalizing function of v.

Structure module

Especially, given that the observed UMI counts of cells are sparse, the relationship between cells is difficult to be defined by vector similarity (e.g., Euclidean distance) directly. Therefore, to estimate the local geometries of the underlying topological manifold and construct a more reliable G_structure to describe the relationship between cells, we make a local scale contraction for Euclidean distance between structure embedding ${{{{{{{{\bf{z}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}$ of each cell and structure embedding ${\hat{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}$ of its corresponding augmented cells for G_structure, and the corresponding ${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}$ is redefined as:

$${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}\left(\nu \right)=g\left(D\left({{{{{{{{\bf{z}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}},{\hat{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}\right)| \nu \right)={C}_{\nu }{\left(1+\frac{{D}_{{\mathbb{E}}}\left({{{{{{{{\bf{z}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}},{\hat{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{st}}}}}}}}}\right)/\gamma }{\nu }\right)}^{-(\nu +1)}$$

(6)

where γ is the local scale contraction coefficient.

Batch correction module

In addition, to remove batch effect problem simultaneously, we introduce a priori batch effect graph G_batch constructed on batch categorical variable vector based on Euclidean distance ${D}_{{\mathbb{E}}}({\tilde{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\tilde{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})$ merged with G_{visualization} in the training process (Fig. 1b). Therefore, the ${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}$ is redefined as:

$${{{{{{{{\bf{u}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}| {{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}}\left(\nu \right)=g(D({\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}},{\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}})| \nu )={C}_{\nu }{\left(1+\frac{D({\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}},{\tilde{{{{{{{{\bf{z}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}^{{{{{{{{\bf{vi}}}}}}}}})+\beta \cdot {D}_{{\mathbb{E}}}({\tilde{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{\tilde{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})}{\nu }\right)}^{-(\nu +1)}$$

(7)

where β represents the importance of G_batch.

Poincaré ball and Lorentz model of the hyperbolic space

A Riemannian manifold $({\mathbb{M}},g)$ is a real and smooth manifold equipped with an inner product ${g}_{{{{{{{{\bf{x}}}}}}}}}:{{\mathbb{T}}}_{{{{{{{{\bf{x}}}}}}}}}{\mathbb{M}}\times {{\mathbb{T}}}_{{{{{{{{\bf{x}}}}}}}}}{\mathbb{M}}\to {\mathbb{R}}$ at each point ${{{{{{{\bf{x}}}}}}}}\in {\mathbb{M}}$, which is called a Riemannian metric and allows us to define the geometric properties of a space such as angles and the length of a curve. We introduce two commonly used hyperbolic manifolds compared with the Euclidean manifold:

The Euclidean manifold is a manifold with zero curvature. The metric tensor is defined as ${g}^{{\mathbb{E}}}={{{{{{{\rm{diag}}}}}}}}([1,1,\ldots ,1])$. The closed-form distance, i.e., the length of the geodesic, which is a straight line in Euclidean space, between two points is given as:

$${D}_{{\mathbb{E}}}({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})=\sqrt{{({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}-{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})}^{T}({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}-{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})}$$

(8)

The exponential map of the Euclidean manifold is defined as:

$${\exp }_{{{{{{{{\bf{x}}}}}}}}}({{{{{{{\bf{v}}}}}}}})={{{{{{{\bf{x}}}}}}}}+{{{{{{{\bf{v}}}}}}}}$$

(9)

The Poincaré ball model with constant negative curvature − K(K > 0) corresponding to the Riemannian manifold $({\mathbb{P}},{g}_{{{{{{{{\bf{x}}}}}}}}}^{{\mathbb{P}}})$, where ${\mathbb{P}}=\left\{{{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{d}: \parallel \!\!{{{{{{{\bf{x}}}}}}}}\!\! \parallel < \frac{1}{K}\right\}$ is an open ball. The metric tensor is defined as ${g}_{{{{{{{{\bf{x}}}}}}}}}^{{\mathbb{P}}}={\left({\lambda }_{{{{{{{{\bf{x}}}}}}}}}^{K}\right)}^{2}{g}^{{\mathbb{E}}}$, where ${\lambda }_{{{{{{{{\bf{x}}}}}}}}}^{K}=\frac{2}{1+K{\left\Vert {{{{{{{\bf{x}}}}}}}}\right\Vert }^{2}}$ is the conformal factor and g^E is the Euclidean metric tensor. The origin of ${\mathbb{P}}$ is ${{{{{{{\bf{o}}}}}}}}=(0,...,0)\in {{\mathbb{R}}}^{d}$. The distance between two points ${{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\in {\mathbb{P}}$ is given as:

$${D}_{{\mathbb{P}}}({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})=\frac{1}{\sqrt{K}}{{{{{{{\rm{arcosh}}}}}}}}\left(1+2K\frac{{\Vert {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}-{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\Vert }^{2}}{(1-K{\Vert {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\Vert }^{2})(1-K{\Vert {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\Vert }^{2})}\right)$$

(10)

For any point ${{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\in {\mathbb{P}}$, the exponential map ${\exp }_{{{{{{{{\rm{x}}}}}}}}}:{{\mathbb{T}}}_{{{{{{{{\rm{x}}}}}}}}}{\mathbb{P}}\to {\mathbb{P}}$ is defined for the tangent vector v ≠ 0 and the point x_j ≠ 0 as:

$${\exp }_{{{{{{{{\bf{x}}}}}}}}}^{K}({{{{{{{\bf{v}}}}}}}})={{{{{{{\bf{x}}}}}}}}{\oplus }_{K}\left(\tanh \left(\frac{\sqrt{K}{\lambda }_{{{{{{{{\bf{x}}}}}}}}}^{K}\parallel \!\!{{{{{{{\bf{v}}}}}}}}\!\! \parallel }{2}\right)\frac{{{{{{{{\bf{v}}}}}}}}}{\sqrt{K}\parallel \!\!{{{{{{{\bf{v}}}}}}}}\!\! \parallel }\right)$$

(11)

where ⊕ _K is the Möbius addition for any ${{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\in {\mathbb{P}}$:

$${{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}{\oplus }_{K}{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}=\frac{(1+2K\langle {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\rangle -K{\Vert {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\Vert }^{2}){{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}+(1-K{\Vert {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\Vert }^{2}){{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}}{1+2K\langle {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\rangle +{K}^{2}{\Vert {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}}\Vert }^{2}{\Vert {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\Vert }^{2}}$$

(12)

The Lorentz model avoids numerical instabilities that may arise with the Poincaré distance (mostly due to the division). Let ${{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\in {{\mathbb{R}}}^{d+1}$, then the Lorentzian scalar product is defined as:

$${\langle {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\rangle }_{{\mathbb{L}}}=-{({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}})}_{0}{({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})}_{0}+\mathop{\sum }\limits_{i=1}^{d}{({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}})}_{d}{({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})}_{d}$$

(13)

The Lorentz model with constant negative curvature − K(K > 0) corresponding to the Riemannian manifold $\left({\mathbb{L}},{g}_{{{{{{{{\bf{x}}}}}}}}}^{{\mathbb{L}}}\right)$, where ${\mathbb{L}}=\left\{{{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{d+1}:{\langle {{{{{{{\bf{x}}}}}}}},{{{{{{{\bf{x}}}}}}}}\rangle }_{{\mathbb{L}}}=-1,{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}} \, > \, 0\right\}$ and where ${g}^{{\mathbb{L}}}={{{{{{{\rm{diag}}}}}}}}([-1,1,\ldots ,1])$. The induced distance function is given as:

$${D}_{{\mathbb{L}}}({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}})=\sqrt{K}{{{{{{{\rm{arcosh}}}}}}}}(-{\langle {{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{i}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{j}}}}}}}}}\rangle }_{{\mathbb{L}}}/K)$$

(14)

The exponential map ${\exp }_{{{{{{{{\bf{x}}}}}}}}}:{{\mathbb{T}}}_{{{{{{{{\bf{x}}}}}}}}}{\mathbb{L}}\to {\mathbb{L}}$ is defined as:

$${\exp }_{{{{{{{{\bf{x}}}}}}}}}^{K}({{{{{{{\bf{v}}}}}}}})=\cosh \left(\frac{{\left\Vert {{{{{{{\bf{v}}}}}}}}\right\Vert }_{{\mathbb{L}}}}{\sqrt{K}}\right){{{{{{{\bf{x}}}}}}}}+\sqrt{K}\sinh \left(\frac{{\left\Vert {{{{{{{\bf{v}}}}}}}}\right\Vert }_{{\mathbb{L}}}}{\sqrt{K}}\right)\frac{{{{{{{{\bf{v}}}}}}}}}{{\left\Vert {{{{{{{\bf{v}}}}}}}}\right\Vert }_{{\mathbb{L}}}}$$

(15)

where ${\left\Vert {{{{{{{\bf{v}}}}}}}}\right\Vert }_{{\mathbb{L}}}=\sqrt{{\langle {{{{{{{\bf{v}}}}}}}},{{{{{{{\bf{v}}}}}}}}\rangle }_{{\mathbb{L}}}}$. The origin, i.e., the zero vector in Euclidean space and the Poincaré ball, is equivalent to (1, 0, . . . , 0) in the Lorentz model. The point ${\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{1}}}}}}}}},\ldots ,{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{n}}}}}}}}}\right)}^{T}$ between Poincaré ball and Lorentz model can be conveniently converted:

$${p}_{{\mathbb{L}}\to {\mathbb{P}}}\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}},{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{1}}}}}}}}},\ldots ,{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{n}}}}}}}}}\right)=\frac{\left({{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{1}}}}}}}}},\ldots ,{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{n}}}}}}}}}\right)}{\sqrt{K}{{{{{{{{\bf{x}}}}}}}}}_{{{{{{{{\bf{0}}}}}}}}}+1}$$

(16)

Model structure

We use the Leaky rectified linear units (LeakyReLU) activation functions for hidden layers. The gradient can also be calculated where the input of LeakyReLU activation function is less than zero during the backpropagation process, avoiding the jagged problem in the direction of the gradient. Meanwhile, we use batch normalization (BN) for hidden layers. In the training process of DNNs, the input of each layer can keep the same distribution.

For all experiments, we use a six-layered neural network, including a structure module (d/50-500-300-100) and a visualization module (100-300-100-3/2). The dimensionality of the visualization embedding layer was typically 2/3 for visualization purposes. When comparing DV based on different dimensions of latent spaces (e.g., 2/3, 5, 10 and 20), we keep all other factors the same. We use the Adam stochastic optimization algorithm and train model for 300 epochs. For our current implementation, we do not introduce an early stopping but train DV for a given number of epochs. We can obtain a good embedding when we only train the model for 50 epochs. We run all the experiments using a Ubuntu server and a single V100 GPU with 32GB memory.

Choices of hyperparameters

To ensure that each method achieves its optimal performance, we use the grid search method to find the optimal hyperparameter. For t-SNE, the search space of “perplexity” is {5, 10, 15, 20, 30, 40}. For UMAP, the search space of “min_dist” is {0.1, 0.3, 0.5, 0.7, 0.9}. For IVIS, the search space of “k” is {2, 4, 8, 16, 32, 64}. For Poin_maps, the search space of “knn” is {15, 20, 30}, “sigma” is {1, 2} and “gamma” is {1, 2}. For PHATE, we use the default hyperparameter settings. For scPhere, we use the default hyperparameter setting according to the released code.

In the following, we discuss the function of different hyperparameters in DV and propose typical value ranges. The “learning rate” adjusts the objective function converge to the local minimum in the proper time and the search space is typically set to {1e⁻³, 5e⁻³}. The “batch size” improves the memory utilization and the accuracy of gradient descent direction, and the search space is typically set to {500, 1000, 2000}. The “v^vi” controls the magnitude of edge weight in G_{visualization} and the search space is typically set to {1e⁻³, 5e⁻³, 1e⁻²}. The “γ” controls the local scale contraction level between each cell and its corresponding augmented cells and the search space is typically set to {10, 1000, 1e⁵}. The “β” controls how much batch information need to remove in loss function and the search space is typically set to {1e⁻², 1, 100}. The hyperparameters used by DV_Eu, DV_Poin and DV_Lor for each experimental dataset are shown in Supplementary Tabs. 1-3, respectively.

Quantifying global/hierarchical and local structure preservation

To quantitatively compare the performance of different visualization methods, we use the scale-independent quality criteria proposed by Lee and Verleysen⁴⁶ following Poin_maps²¹. The main idea of this method is that a good dimensionality reduction method should have good preservation of local and global distances on the manifold, e.g., close neighbors should be placed close to each other while maintaining large distances between distant points. Therefore, they proposed to use two scalar quality criteria Q_local and Q_global focusing separately on low- and high-dimensional qualities of the embeddings. The quantities of Q_local and Q_global range from 0 (bad) to 1 (good) and reflect how well are local and global properties of the dataset are preserved in the embeddings. To estimate distances in the high-dimensional space, we use Euclidean distances estimated as the length in a full-connected graph. For Poin_maps, we follow its released code and use the geodesic distances estimated as the length of a shortest-path in a k-nearest neighbors graph. For distances in the low-dimensional space, we use Euclidean distances except DV_Poin, DV_Lor, and Poin_maps methods, for which we use hyperbolic distances. Furthermore, for datasets with batch effect problem, we evaluate the geometric structure preservation performance of each batch separately, calculate the Q_local and Q_global between the input data without batch correction and output embeddings, and use the boxplots for statistics.

To learn a DV model that is invariant to the batch vectors and can be used to map cells from completely new batches, we use DV model to map a gene expression vector to the low-dimensional representation directly without using the batch vector as an input to the model when training the DV model. The batch vector is only used in the objective function that takes both the latent representation of a cell and its cell batch vector to construct the latent geometric structure to preserve the semantic geometric structure during training DV. We call this modality of DV with no batch vectors for the “batch-invariant” DV, as it learns latent representations that are invariant to the batch vectors.

Statistics and reproducibility

The details about experimental design and statistics used in different data analyses performed in this study are given in the respective sections of results and methods.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

We use publicly available datasets in this study (GEO: GSE126954³⁵, GSE119562³⁸, GSE130148³⁶, GSE111588³⁷, GSE137400³⁹, GSE134355⁴⁰, GSE126954⁴¹; Single Cell Portal: SCP259, SCP551. To make the results presented in this study reproducible, all processed data are available in Single Cell Portal SCP1873.

Code availability

The DV software package, implemented in Pyotrch, is available free from https://github.com/Westlake-AI/DV, and as a Supplementary Software 1 accompanying this manuscript.

References

Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Article CAS PubMed Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS PubMed PubMed Central Google Scholar
Luecken, M. D. & Theis, F. J. Current best practices in single-cell rna-seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).
Article PubMed PubMed Central Google Scholar
Roweis, S. T. & Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000).
Article CAS PubMed Google Scholar
Belkin, M. & Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, 1373–1396 (2003).
Article Google Scholar
Hinton, G. & Roweis, S. T. Stochastic neighbor embedding. In NIPS, vol. 15, 833-840 (Citeseer, 2002).
Wold, S., Esbensen, K. & Geladi, P. Principal component analysis. Chemometr. Intell. Lab. Syst. 2, 37–52 (1987).
Article CAS Google Scholar
Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using umap. Nat. Biotechnol. 37, 38–44 (2019).
Article CAS Google Scholar
Sainburg, T., McInnes, L. & Gentner, T. Q. Parametric umap embeddings for representation and semisupervised learning. Neural Comput. 33, 2881–2907 (2021).
PubMed PubMed Central Google Scholar
Li, S. Z., Zang, Z. & Wu, L. Markov-lipschitz deep learning. arXiv preprint arXiv:2006.08256 (2020).
Li, S. Z., Zang, Z. & Wu, L. Deep manifold transformation for dimension reduction. arXiv preprint arXiv:2010.14831 (2020).
Zang, Z. et al. Evnet: An explainable deep network for dimension reduction. IEEE Transactions on Vis. & Comput. Graph. 1–18 (2022).
Zang, Z. et al. Udrn: unified dimensional reduction neural network for feature selection and feature projection. Neural Networks 161, 626–637 (2023).
Szubert, B., Cole, J. E., Monaco, C. & Drozdov, I. Structure-preserving visualisation of high dimensional single-cell datasets. Sci. Rep. 9, 1–10 (2019).
Article CAS Google Scholar
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 1–13 (2018).
Article Google Scholar
Ding, J. & Regev, A. Deep generative model embedding of single-cell rna-seq profiles on hyperspheres and hyperbolic spaces. Nat. Commun. 12, 1–17 (2021).
Article Google Scholar
Tenenbaum, J. B., De Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Article CAS PubMed Google Scholar
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput Harmonic Anal. 21, 5–30 (2006).
Article Google Scholar
Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 37, 1482–1492 (2019).
Article CAS PubMed PubMed Central Google Scholar
Klimovskaia, A., Lopez-Paz, D., Bottou, L. & Nickel, M. Poincaré maps for analyzing complex hierarchies in single-cell data. Nat. Commun. 11, 1–9 (2020).
Article Google Scholar
Kobak, D. & Berens, P. The art of using t-sne for single-cell transcriptomics. Nat. Commun. 10, 1–14 (2019).
Article CAS Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article CAS PubMed Google Scholar
Zang, Z. et al. Dlme: Deep local-flatness manifold embedding. In European Conference on Computer Vision, 576-592 (Springer, 2022).
Wang, D. & Gu, J. Vasc: dimension reduction and visualization of single-cell rna-seq data by deep variational autoencoder. Genomics, Proteom. Bioinforma. 16, 320–331 (2018).
Article Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. methods 15, 1053–1058 (2018).
Article CAS PubMed PubMed Central Google Scholar
Peng, W., Varanka, T., Mostafa, A., Shi, H. & Zhao, G. Hyperbolic deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. methods 16, 1289–1296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Article CAS PubMed PubMed Central Google Scholar
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat. Biotechnol. 37, 685–691 (2019).
Article CAS PubMed PubMed Central Google Scholar
Amodio, M. et al. Exploring single-cell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
Article CAS PubMed Google Scholar
Barkas, N. et al. Joint analysis of heterogeneous single-cell rna-seq dataset collections. Nat. Methods 16, 695–698 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cannon, J. W., Floyd, W. J., Kenyon, R. & Parry, W. R. et al. Hyperbolic geometry. Flavors Geom. 31, 2 (1997).
Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Article CAS PubMed PubMed Central Google Scholar
Vieira Braga, F. A. et al. A cellular census of human lungs identifies novel cell states in health and in asthma. Nat. Med. 25, 1153–1163 (2019).
Article CAS PubMed Google Scholar
Hepler, C. et al. Identification of functionally distinct fibro-inflammatory and adipogenic stromal subpopulations in visceral adipose tissue of adult mice. Elife 7, e39636 (2018).
Article PubMed PubMed Central Google Scholar
Crinier, A. et al. High-dimensional single-cell analysis identifies organ-specific signatures and conserved nk cell subsets in humans and mice. Immunity 49, 971–986 (2018).
Article CAS PubMed PubMed Central Google Scholar
Tran, N. M. et al. Single-cell profiles of retinal ganglion cells differing in resilience to injury reveal neuroprotective genes. Neuron 104, 1039–1055 (2019).
Article CAS PubMed PubMed Central Google Scholar
Han, X. et al. Construction of a human cell landscape at single-cell level. Nature 581, 303–309 (2020).
Article CAS PubMed Google Scholar
Packer, J. S. et al. A lineage-resolved molecular atlas of c. elegans embryogenesis at single-cell resolution. Science 365, eaax1971 (2019).
Article CAS PubMed PubMed Central Google Scholar
Smillie, C. S. et al. Intra-and inter-cellular rewiring of the human colon during ulcerative colitis. Cell 178, 714–730 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Article Google Scholar
Gayoso, A. et al. A python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).
Article CAS PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Lee, J. A. & Verleysen, M. Scale-independent quality criteria for dimensionality reduction. Pattern Recognit. Lett. 31, 2248–2257 (2010).
Article Google Scholar

Download references

Acknowledgements

This work is supported in part by the Science and Technology Innovation 2030 - Major Project (No. 2021ZD0150100) and National Natural Science Foundation of China (No. U21A20427).

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, 310058, China
Yongjie Xu, Zelin Zang, Jun Xia & Cheng Tan
AI Division, School of Engineering, Westlake University, Hangzhou, 310024, China
Yongjie Xu, Zelin Zang, Jun Xia, Cheng Tan, Yulan Geng & Stan Z. Li

Authors

Yongjie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zelin Zang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Xia
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Tan
View author publications
You can also search for this author in PubMed Google Scholar
Yulan Geng
View author publications
You can also search for this author in PubMed Google Scholar
Stan Z. Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.Z.L. proposed this research, Y.X. Z.Z., and S.Z.L. developed the method, Y.X., J.X., C.T., and Y.G. collected the datasets, Y.X. conceived the experiment and wrote the manuscript with guidance from S.Z.L., and Y.X. conducted the experiment and analyzed the results. All authors discussed the results, revised the draft manuscript, and read and approved the final manuscript.

Corresponding author

Correspondence to Stan Z. Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Mengwei Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Mireya Plass and Manuel Breuer. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, Y., Zang, Z., Xia, J. et al. Structure-preserving visualization for single-cell RNA-Seq profiles using deep manifold transformation with batch-correction. Commun Biol 6, 369 (2023). https://doi.org/10.1038/s42003-023-04662-z

Download citation

Received: 03 July 2022
Accepted: 06 March 2023
Published: 04 April 2023
DOI: https://doi.org/10.1038/s42003-023-04662-z

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.