A topology-preserving dimensionality reduction method for single-cell RNA-seq data using graph autoencoder

Dimensionality reduction is crucial for the visualization and interpretation of the high-dimensional single-cell RNA sequencing (scRNA-seq) data. However, preserving topological structure among cells to low dimensional space remains a challenge. Here, we present the single-cell graph autoencoder (scGAE), a dimensionality reduction method that preserves topological structure in scRNA-seq data. scGAE builds a cell graph and uses a multitask-oriented graph autoencoder to preserve topological structure information and feature information in scRNA-seq data simultaneously. We further extended scGAE for scRNA-seq data visualization, clustering, and trajectory inference. Analyses of simulated data showed that scGAE accurately reconstructs developmental trajectory and separates discrete cell clusters under different scenarios, outperforming recently developed deep learning methods. Furthermore, implementation of scGAE on empirical data showed scGAE provided novel insights into cell developmental lineages and preserved inter-cluster distances.

Single-cell RNA sequencing (scRNA-seq) is an ideal approach for investigating cell-cell variation. Conventional dimensionality reduction techniques such as principal component analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) 1 were implemented on scRNA-seq data for visualization and downstream analyses, significantly increasing our understanding of cellular heterogeneity and development progress. The recent emergence of massively parallel scRNA-seq such as droplet platforms enabled interrogation of millions of cells in complex biological systems [2][3][4][5] , which provide a fantastic potential for dissection of tissue and cellular microenvironment, identification of rare/new cell types, inference of developmental lineages, and elucidation of the mechanism of cellular response to stimulations 6 . However, the data generated by massively parallel scRNA-seq are of high dropout and high noise with complex structure, which posed a series of challenges on dimensionality reduction. Particularly, it is a big challenge to preserve the complex topological structure among cells.
Many dimensionality reduction methods have been developed or introduced for scRNA-seq data analyses in the past several years. Recently developed competitive methods include DCA 7 , scVI 8 , scDeepCluster 9 , PHATE 10 , SAUCIE 11 , scGNN 12 , ZINB-WaVE 13 and Ivis 14 . Among them, deep learning showed the greatest potentials. For instance, DCA, scDeepCluster, Ivis, and SAUCIE adapted the autoencoder to denoise, visualize and cluster the scRNA-seq data. However, these deep learning-based models only embedded the distinct cell features while ignoring the cell-cell relationships, which limited their ability to reveal the complex topological structure among cells and made them difficult to elucidate the developmental trajectory. The recently proposed graph autoencoder 15 is very promising as it preserves the long-distance relationships among data in a latent space. In this study, we developed the single-cell graph autoencoder (scGAE). It improved the graph autoencoder to preserving global topological structure among cells. We further extended the scGAE for visualization, trajectory inference, and clustering. Analyses of simulated data and empirical data showed that scGAE outperformed the other competitive methods.

Results
The model architecture of scGAE. scGAE combines the advantage of the deep autoencoder and graphical model to embed the topological structure of high-dimensional scRNA-seq data to a low-dimensional space (Fig. 1). After getting the normalized count matrix, scGAE builds the adjacency matrix among cells by K-nearest-neighbor algorithm. The encoder maps the count matrix to a low-dimensional latent space by graph attentional layers 16 . scGAE decodes the embedded data with a feature decoder and a graph decoder. The feature decoder reconstructs the count matrix to preserve the feature information; The graph decoder recovers the adjacency matrix and preserves the topological structure information. It decodes the embedded data to the spaces with the same dimension as original data by minimizing the distance between the input data and the reconstructed data (see "Methods"). We use deep clustering to learn the data embedding and do cluster assignment simultaneously 17 , generating a clustering-friendly latent representation ( Supplementary Fig. S1). The implementation and usage of scGAE can be found on Github: https:// github. com/ Zixia ngLuo 1161/ scGAE. Visualization of scGAE embedded data and comparison to other methods. To systematically evaluate the performance of scGAE, we summarized four representative scenarios (scenario1: cells in continuous differentiation lineages; scenario2: cells in differentiation lineages where cells concentrate at the center of each branch; scenario3: distinct cell populations with apparent differences; and scenario4: distinct cell populations with small population differences) (Fig. 2 left). We used Splatter 18 and PROSSTT 19 to simulate scRNA-seq data in four scenarios. For scGAE, the data was visualized by tSNE after projected to a latent space. Compared with other methods, scGAE better captured the complex structures in the data (Fig. 2). In scenario1 and sec-nario2, scGAE almost entirely reproduced the differentiation lineages (Fig. 2a,b), while other methods only revealed some local structures and failed to exhibit the overall structure of simulated data. The results of tSNE and SAUCIE exhibited distinct clusters but lost lineage relationship in scenario2. In scenario3 and 4, scGAE almost perfectly preserved the compact cell clusters and inter-cluster distances in the simulated data, while the clusters inferred by other methods are dispersed, and the topological structure among these clusters was not preserved (Fig. 2c,d). Only scGAE separated all the clusters while the other methods mixed different types of cells when the differences between clusters are small (Fig. 2d). Based on these observations, scGAE perfectly reproduced the differentiation lineages and distinct clusters in the simulated data, indicating scGAE outperforms other competitive methods in restoring the relationship between cells.
Trajectory inference and cell clustering based on scGAE embedded data. We further quantitatively evaluated the performance of scGAE for trajectory inference tasks. The scGAE and other competitive methods were used to perform dimensionality reduction on the developmental lineage data simulated by PROSSTT (scenario1 and 2). We conducted trajectory inference on these embedded data using DPT 20 . The Kendall correlation coefficient 21 between the inferred trajectories and the ground truth was calculated to measure their similarity. Because scDeepCluster is a clustering method, we didn't include it for trajectory inference tasks. The results showed that scGAE, scGNN, and scVI better recovered the original trajectory than the other competitive methods on both scenario1 and 2 ( Fig. 3a,b). Compared with scenario1, the data is not uniformly  www.nature.com/scientificreports/ distributed along the developmental trajectory in scenario2. Most methods have a lower Kendall correlation, but two graph neural network based methods and scVI still have good performances. It shows that the graph-based structure can well preserve the relationship among data. Next, we evaluated the performance of scGAE and other competitive methods on cell clustering tasks with data simulated by Splatter (scenario3 and 4). We performed Louvain clustering on these embedded data. Normalized mutual information (NMI) was used to measure the difference between inferred clusters and ground truth. The results showed that scGAE was the best among these methods (Fig. 3c,d, Supplementary Fig. S2). Although scVI, ZINB-Wave, and scGNN performed well for trajectory inference (Fig. 3a,b), they got a low score in the cell clustering task (Fig. 3c,d). The inconsistence between data structure imposed in existing methods and simulated data structure might contribute the differences of performance. Some methods such as scGAE assume no prior hypothesis on the data, which may facilitate their performances in all cases. Also, different data preprocessing approaches might affect the results. For the methods that takes normalized data as input, we normalized data using the Seurat R package. While the three method that dropped most only accept raw data as input. Moreover, when there are noises, scGAE can do better than these three methods in the low-dimensional cell clustering. This may be because scGAE optimize clustering and latent representation simultaneously in one shot.
It indicates that the cells in Ba/Eo/MaP may have different differentiation potentials at early phase. While the other competitive methods did not identify the subpopulations in Ba/Eo/MaP (Supplemental Figs. S3a, S4a), supporting scGAE has the highest statistical power to identify the substructure in the scRNA-seq data. scGAE preserved topological structure among human pancreatic cells populations. The function of the pancreas hinges on complex interactions among distinct cell types and cell populations. We re-analyzed the scRNA-seq data of human pancreatic cells from Baron et al. 28 . Although the pancreatic cell subpopulations identified by scGAE are the same as the original study, we found the distances and topological structures among cell types inferred by scGAE better fit our knowledge (Fig. 4c). For instance, the activated stellate and quiescent stellate showed similar expression profiles and phenotypes 29 . scGAE revealed the close relationship between two cell populations better than the other methods ( Fig. 4d and Supplemental Figs. S3b, S4b). scGAE also preserved the short distance between two ductal subtypes, while some methods including tSNE project them into a longer distance. Moreover, scGAE clearly separated other cell populations while SAUCIE, Ivis, and PHATE mixed some of the clusters. Overall, scGAE preserved the topological structure among different cell populations, which greatly benefit our understanding of the cellular relationships.

Discussion
Because of the high noises of scRNA-seq data and complicated cellular relationships, preserving the topological structure of scRNA-seq data in low-dimensional space is still a challenge. We proposed scGAE which is a promising topology-preserving dimensionality reduction method. It generates a low-dimensional representation that better preserves both the global structure and local structure of the high-dimensional scRNA-seq data. The www.nature.com/scientificreports/ key innovation of scGAE is to embed the structure information and feature information simultaneously using a multitask graph autoencoder. It is suitable for analyzing the data both in lineages and clusters. The learned latent representation benets various downstream analyses, including clustering, trajectory inference, and visualization. The analyses on both simulated data and empirical data suggested scGAE accurately preserved the topological structures of data. scGNN 12 is another tool that utilize graph autoencoder for single cell RNA-seq data dimensinoality reduction. scGAE is designed to perform dimensionality reduction while being friendly for further clustering and trajectory inference. scGNN is designed to do multi-tasks for modeling heterogeneous cell-cell relationships and their underlying complex gene expression patterns. It consists of four types of autoencoders with appropriate regularizations and iterations among these autoencoders. From the performance perspective, scGAE and scGNN have similar performance on the trajectory inference while scGAE has better performance on clustering. From the computational perspective, the running time of scGAE is much shorter than scGNN and memory cost is slightly lower than scGNN. This is due to the iterative process in scGNN, which is more time-consuming and requires more computational resources.
As an early study adapting graph autoencoder for dimensionality reduction of scRNA-seq data, this approach is likely to be significantly improved in the future. Firstly, because the complex data structure is hard to be directly embedded into two-dimensional space by graph autoencoder, we embedded the scRNA-seq data into an intermediate dimension and used tSNE to visualize the embedded data into a two-dimensional space. However, the tSNE focuses more on local information, and it sometimes fails to correctly recover the global structure, which may distort the topological structure in the data. A better visualization method is needed to preserve the topological structure of scRNA-seq data. Secondly, the graph in scGAE is constructed by the K-nearest neighbor (KNN) algorithm that relies on a predefined parameter K. However, the optimal K varies among different datasets and different parts of a dataset. Constructing an optimal graph is challenging due to the difficulty in determining a suitable K, which could be our potential future endeavors. Thirdly, scGAE has a moderate time cost but a relatively high memory cost compared with other statistics model and deep learning methods without graph-based layers ( Supplementary Figs. S5-S7). This is caused by the recursive neighborhood expansion across layers in graph neural network 30 . In the future, we will investigate more efficient architectures such as GNN with graph sampling 30 to reduce the time and memory cost.

Methods
Joint graph autoencoder. The graph autoencoder is a type of artificial neural network for unsupervised representation learning on graph-structured data 15 . The graph autoencoder often has a low-dimensional bottleneck layer so that it can be used as a model for dimensionality reduction. Let the inputs be single-cell graphs of node matrices X and adjacency matrices A. In our joint graph autoencoders 31 , there is one encoder E for the whole graph and two decoders D X and D A for nodes and edges respectively. In practice, we first encode the input graph into a latent variable h = E(X, A) , and then we decode h into the reconstructed node matrix X r = D X (h) and the reconstructed adjacency matrix A r = D A (h) . The objective of learning process is to minimize the the reconstruction loss where the weight is a hyper-parameter. In our experiments, is set to be 0.6.
We used the Python package Spektral 32 to implement our model. There are many types of graph neural networks that can be used as the encoder or decoder. Hereby, to extract the features of a node with the aid of its neighbors, we apply graph attention layers as default in the encoder. Other graph neural networks such as GCN 33 , GraphSAGE 34 and TAGCN 35 can also be implemented as the encoder in scGAE. The feature decoder D X is a four-layer fully connected neural network with 64, 256, 512 nodes in hidden layers.
The edge decoder consists of a fully connected layer followed by the composition of quadratization and activation: where Z = σ (Wh) arises as an output of a fully connected layer with the weight matrix W, and σ (x) = max(0, x) is the rectified linear unit.
Deep-clustering embedding. Motivated by Yang et al. 36 , we use a two-stage method. The first stage is to pre-train scGAE by minimizing L r . The resulting neural network parameters are set as the initialization of the second stage, which we call alter-training. The loss function in the alter-training stage compromises both reconstruction error L r and clustering cost L c = L c (h, µ): where µ is a collection of clustering centroids, and γ is a hyper-parameter set as 2.5 in our experiments.
The alter-training consists of doing the following two steps alternately: 1. Given a collection of clustering centroids µ , update network parameters by minimizing L; 2. Compute the embedded data h using the updated network, and do clustering in the embedded space to obtain new centroids µ; www.nature.com/scientificreports/ In experiments, we use the pre-trained network to generate the initial embedded data which are clustered to obtain the initial centroids by Louvain 37 . There are various choices for the loss L c and the clustering algorithm in the second step 17 . In practice, we compute the new centroids µ by minimizing L c using the stochastic gradient descent. A good choice of L c is the soft assignment loss, which is the KL divergence of empirical clustering assignment distribution Q from a target distribution P. This is motivated by t-SNE 1 which uses a proper distribution Q in low dimensional space in order to inherit the clustering property from the high dimensional space. Given an embedded point h i and a centroid µ j , Q is defined as Student's t-distribution An ideal target distribution should have the following properties: (1) improve cluster purity, (2) put more emphasis on data points assigned with high confidence, and (3) prevent large clusters from distorting the hidden feature space. In experiments, we follow DEC 38 choose P as p ij = Evaluation metric. Clustering results are measured by Normalized Mutual Information (NMI) 39 . Given the knowledge of the ground truth class assignments U and our clustering algorithm assignment V on n data points, NMI measures the agreement of the two assignment, ignoring permutations. NMI is defined as is the entropy. Trajectory inference results are measured by Kendall correlation coefficient. We define an order among the set of observations (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ) : any pair of observations (x i , y i ) and (x j , y j ) , where i < j are said to be concordant if either both x i > x j and y i > y j hold or both x i < x j and y i < y j hold; otherwise they are said to be discordant. Denote the number of concordant pairs as N conco and the number of discordant pairs as N discon , Kendall correlation coefficient is defined as Data simulation. We simulated five scRNA-seq datasets using Splatter R package (data1, data3, and data4) and PROSSTT Python package (data2 and data5). The cells in data1 and data5 are in the linear distribution along the developmental trajectory. The cells in data2 have a skewed distribution where cells concentrate at the center of each branch. The cells in data3 and data4 are in distinct clusters with moderate and small cluster differences, respectively. All datasets have 2000 cells and 5000 genes. Data1, data2, data3, and data4 were simulated for sce-nario1 to scenario4 for data visualization. Data5, data2, data3, and data4 are used for the evaluation of scGAE on trajectory inference and cell clustering tasks.
Data preprocessing. The scRNA-seq data preprocessing was conducted using scTransform 40 in The Seurat package 41 . The pre-processed count matrix was used to construct the single-cell graph, where the nodes represent cells, and the edges represent the relationships between cells. The cell graph is built by the K-nearest neighbor (KNN) algorithm 42 in the Scikit-learn Python package 43 . The default K is predened as 35 in this study and adjusted according to the datasets in our experiments. The generated adjacency matrix is a 0-1 matrix, where 1 represents being connected, and 0 represents no connection.
Empirical scRNA-seq data. We analyzed two different scRNA-seq datasets, namely HSPCs data and pancreatic cells data. HSPCs data and pancreatic cells data represent cells showing lineages relationship and cells showing distinct clusters, respectively. The HSPCs data are single-cell transcriptome data of FACS sorted CD34+ cells from human bone marrow mononuclear cells, accessible in the national genomics data center (HRA000084) and described in our previous study 5 . The pancreases cells data contains 10,000 single-cell transcriptomes with 14 distinct cell clusters, download from GEO (GSE84133) 28 .
Competitive methods. Nine competitive methods, namely scDeepCluster, DCA, scVI, PCA, Ivis, SAUCIE, scGNN, ZINB-Wave, and PHATE, were compared with scGAE. Among these methods, scDeepCluster, DCA, scVI, Ivis, scGNN, and SAUCIE are deep learning based and showed the greatest potential. These methods usually generate hidden variables for downstream analysis, including visualization, clustering, and trajectory inference. The raw count matrix was used as input for DCA, scVI, scGNN, ZINB-WaVE and scDeep-Cluster. For methods that take normalized data as input (scGAE, SAUCIE, PCA, Ivis, and PHATE), scTransform was used for data preprocessing. Each software was run following its manual and with default parameters. For SAUCIE, Ivis, and DCA, we first performed PCA to reduce the dimension to 100, 50, and 32 PCs, respectively. Ivis, SAUCIE, and PHATE directly generate the 2-dimensional embeddings. The cell clustering and trajectory inference were performed on the two-dimensional embeddings. scGNN and ZINB-Wave generated 128 and 10 dimensional embeddings. Both scGAE and PCA embedded simulated data to ten dimensions and embedded empirical data to 20 dimensions due to the complex structure of the empirical data. We performed tSNE to visualize data for these methods. .