Discovering cell types using manifold learning and enhanced visualization of single-cell RNA-Seq data

Identifying relevant disease modules such as target cell types is a significant step for studying diseases. High-throughput single-cell RNA-Seq (scRNA-seq) technologies have advanced in recent years, enabling researchers to investigate cells individually and understand their biological mechanisms. Computational techniques such as clustering, are the most suitable approach in scRNA-seq data analysis when the cell types have not been well-characterized. These techniques can be used to identify a group of genes that belong to a specific cell type based on their similar gene expression patterns. However, due to the sparsity and high-dimensionality of scRNA-seq data, classical clustering methods are not efficient. Therefore, the use of non-linear dimensionality reduction techniques to improve clustering results is crucial. We introduce a method that is used to identify representative clusters of different cell types by combining non-linear dimensionality reduction techniques and clustering algorithms. We assess the impact of different dimensionality reduction techniques combined with the clustering of thirteen publicly available scRNA-seq datasets of different tissues, sizes, and technologies. We further performed gene set enrichment analysis to evaluate the proposed method’s performance. As such, our results show that modified locally linear embedding combined with independent component analysis yields overall the best performance relative to the existing unsupervised methods across different datasets.


Materials and methods
The block diagram of the proposed pipeline is depicted in Fig. 1. First, the scRNA-seq data is pre-processed based on the number of cells and the number of genes. Highly variable genes are extracted as part of the feature selection step after normalization and scaling of the filtered data. Linear regression is one of the most widelyused methods to regress out potential sources of variation presented in the data based on the total counts per cell and mitochondrial percentage as discussed in Refs. 11,14 . The data obtained at this point is then processed to reduce the feature space into two or three dimensions; afterward, k-means clustering is applied. In addition, we performed ICA on the lower-dimensional data followed by k-means clustering to achieve improved visualization and meaningful clusters.

Datasets.
To evaluate the performance of the proposed method, a total of thirteen benchmark datasets were used, which include single-cell gene expression profiles. The details of all datasets used in this work are given in Table 1. They vary across size, tissue (pancreas, lung, peripheral blood), sequencing protocol (three different protocols), and species (Human and Mouse). Peripheral blood dataset, 3k PBMC from a healthy donor, were downloaded from the 10X Genomics portal 15 . Pancreas datasets include Baron (GSE84133) 16 , Muraro (GSE85241) 17 , Segerstolpe (EMTAB-5061) 18 , Xin (GSE81608) 19 , and Wang 20 . In lung datasets, H1299 scRNAseq (GSE148729) 21 and Calu3 scRNA-seq (GSE148729) 21 , different cell lines were contaminated with SARS- www.nature.com/scientificreports/ CoV-1 and SARS-CoV-2 and sequenced at different time slots. These datasets are unlabeled and do not have any background knowledge of the data. In this case, we analyzed the data and provided useful information about the unknown data. On the other hand, the cell-type annotation for PBMC, Baron, Segerstolpe, and Wang was provided with the data. We used these annotations as "background knowledge" during the evaluation of the clustering method. To make a fair assessment of the clustering methods, the cell annotations were removed, and the clustering was done. Then, the labels were considered to compare the biological results.
Data pre-processing and quality control. A common practice for generating RNA-seq raw data is to use next-generation sequencing technologies to create read count matrices. The read count data matrix contains gene names and their expression levels across individual cells. Before analyzing scRNA-seq data, one needs to ensure that gene expressions and cells are of standard quality. We followed a typical scRNA-seq analysis workflow including quality control, as described in 14,22 . A Python package, Scanpy, is used to perform pre-processing and quality control steps. Based on the expression levels, we filtered out weakly expressed genes and low-quality cells in which fewer reads are mapped, as shown in Fig. 1, the first step of pre-processing. Low-quality cells that are dead, degraded, or damaged during sequencing are represented by a low number of expressed genes. Genes expressed in less than three cells and cells with less than 200 expressed genes are removed. We also investigated the distribution of the data (Fig. 2) as a data-specific quality-control step and filtered out low-quality cells and genes. We removed cells with a high percentage of mitochondrial gene counts. Mitochondrial genes do not contribute significant information to the downstream analysis 22,23 . Also, some cells present large total counts compared with other cells, indicating potential sources of variation. To reduce this effect, we www.nature.com/scientificreports/ scaled the data to unit variance. Since scRNA-seq data are expressed at different levels, normalization is a must. Normalization is the method of translating numeric columns' values in a dataset to a standard scale without distorting the ranges of values. We normalize the data using the Counts Per Million (CPM) normalization combined with logarithmic scaling on the data: where totalReads is the total number of mapped reads of a sample, and readsMappedToGene is the number of reads mapped to a selected gene. At this point, we extracted highly variable genes (HVGs) as a part of the feature selection step, aiming at minimizing the search space, and only these genes are examined in further evaluation. HVGs are those genes that are expressed significantly more or less in some cells compared to other ones. This step in quality control makes sure that the differences occur because of biological differences and not technical noise. The simplest approach to compute such a variation is to quantify the variance of the expression values for each gene across all the samples. A good trade-off between mean and variance would help select the subset of genes that keep useful biological knowledge, while removing noise. We use log-normalized data because we want to ensure having the same log-values in the clustering and dimensionality reduction follow a consistent analysis through all steps. There are conventional approaches to find the best threshold. The normalized dispersion is obtained by scaling the mean and standard deviation of the dispersion for genes falling into a given bin for the mean of expressions (Fig. 3). This means that HVGs are selected for each bin.
Visualization of top genes in the dataset is shown in Fig. 4 before and after normalization.
(1) CPM = readsMappedToGene × 1 totalReads × 10 6 . www.nature.com/scientificreports/ Dimensionality reduction. The majority of real-life data is multidimensional, and the majority of the high-dimensional data is complex and sparse. Ideally, understanding the data in such dimensions is tricky, and visualization is not possible. Dimensionality reduction is the process of transforming data from a high-dimensional space to a low-dimensional space while retaining some of the original data's meaningful properties. Working in high-dimensional spaces may be inconvenient for various reasons. For example, data analysis is typically computationally intractable. Single-cell gene expression data is complex and should be well-explored. Each gene is characterized as a dimension in a single-cell expression profile. As such, dimensionality reduction is very productive in summarizing biological attributes in fewer dimensions. Dimensionality reduction is divided into linear and non-linear techniques. We discuss in details in the following subsections.
Modified locally linear embedding. Modified LLE is a non-linear dimensionality reduction technique and the enhanced version of LLE. To understand the MLLE, we first need to know the algorithm of LLE. When used for dimensionality reduction, LLE attempts to reveal the manifold underlying structure based on simple geometric intuitions. LLE preserves the data locality in lower dimensions because it reconstructs each sample point from its neighbors. In other words, LLE focuses on finding the lower-dimension representation of high-dimensional data that preserves the locally linear structure of neighboring point patterns most accurately.
In the simplest formulation of LLE, it first identifies t-nearest neighbors per data point, as measured by Euclidean distance 24 . One can choose the number of neighbors, t, based on rules, metrics, or simply a random number. Consider n sample points X = {x 1 , x 2 , . . . , x n } in high dimensional space of R d . For each sample point x i and its neighborhood set N i = {x t , t ∈ t i } , one can form t-NN graph to construct locally linear structure at that point using the combination of reconstruction weights, W ={w ti , t ∈ t i } , i = 1, . . . , n . On the basis of this, each data point viewed as a small linear patch of the sub-manifold.
To compute the weights w ti for linear reconstruction of each point, we minimized the cost function with respect to two constraints: (1) each data point x i is reconstructed only from its neighbors imposing w ti = 0 if x i does not belong to that set; (2) sum of the weights matrix rows is equal to one, that is w ti = 1 . Optimal weights are calculated by solving Eq. (2), the constrained least squares problem:  www.nature.com/scientificreports/ ..] t ∈ t i can help to formulate the local weight vector w ti ; Hence, (2) can be reformulated as: where 1 ti shows the vector of all 1's with t i -dimension. Using this formulation, the embedding space Y can be computed by singular value decomposition of G i , with Y as a solution to the linear combination of G T i G i Y = 1 ti . Finally, using the same weights computed in the input space, each high-dimensional input sample x i is mapped to a lower dimensional point set Y = {y 1 , y 2 , . . . , y n } in R m ( m << d ), representing the manifold's global internal coordinates.
Equation (4) reflect the locality preservation property by solving a minimization problem over the output manifold.
Regularization is a well-known problem in LLE, which manifests itself in the embedding that distorts the underlying geometry of the manifold. Standard LLE uses an arbitrary regularization parameter concerning the weight matrix's local trace 25 . MLLE overcomes this problem using multiple weight vectors in each neighborhood, discovering a more stable and enhanced embedding space. MLLE modifies or adjusts the reconstruction weights, which modifies the embedding cost function as follows: where, s i is the smallest right singular values of G i . This aims to take advantage of the dense relations that exist in the embedding space 26 .
Independent component analysis. ICA is an independent and linear dimensionality reduction method. By using simple statistical properties assumptions, ICA learns an efficient linear transformation of the data and attempts to find the underlying structures are presented in the data 27 . Based on the definitions given in 28 , ICA is considered as a special case of projection pursuit; it is a technique for finding relevant projections of multi-dimensional data. Such projections can then be used for enhanced visualization of the clustered data. When ICA is used for visualizing the data, dimensionality reduction becomes its secondary objective 28 . Unlike other approaches, the transformation's underlying vectors are presumed to be independent of one another. It employs a non-Gaussian data structure, which is crucial for retrieving the transformed underlying data components. ICA aims to find projections of the data that provides estimations of the independent components 28 . When dealing with noisefree data, if there are no assumptions made on the data, ICA can be considered as a high-performance method of exploratory data analysis 28 . Consider r being a random vector whose elements are {r 1 , r 2 , . . . , r n } , and similarly, random vector s with its elements {s 1 , s 2 , . . . , s n } , and also A is the matrix with elements a ij . ICA is a generative model, which captures how the observed data are generated by mixing the components s i (Eq. 6). The independent components are latent variables, B , which means they are unknown. Also, the mixing matrix ( A ) is assumed to be unknown and V is the observed matrix.
The rows of these matrices are orthogonal to each other. As such, it leads to more independent components than PCA. ICA requires knowing the structure of the data, which is hidden while being analyzed, to untangle their complex relationships and translate them into meaningful measurements. Thus, this feature of ICA is referred to as blind source separation 29 .
Other dimensionality reduction methods. We used other dimensionality reduction techniques to compare our proposed method such as Standard LLE, Isomap, Laplacian eigenmap, PCA, and t-SNE. Isomap stands for isometric mapping. Isomap is a non-linear dimensionality reduction method based on the spectral theory that aims to preserve the lower dimension's geodesic distances. Isomap starts by creating a neighborhood network. After that, it uses graph distance to estimate the geodesic distance among all pairs of points. The eigenvalue decomposition of the geodesic distance matrix finds the lower-dimensional embedding of the data 30 . Laplacian eigenmap is another non-linear technique. It is computationally efficient and maps nearby input patterns to nearby outputs by computing the lower-dimensional representation of a high-dimensional data set. It focuses on preserving local proximity relations among input data points 31 . t-SNE is also a non-linear dimensionality reduction technique that is commonly-used for visualization, and has extensively applied in genomic data analysis, and speech processing 6 . On the other hand, PCA is a popular linear technique used for feature extraction or dimensionality reduction. Given a dataset composed of d-dimensional points, PCA maps the data linearly to find a subspace in lower-dimensional space so that the dispersion of the data is maximized. It does so via eigen decomposition of the covariance matrix. The principal components (eigenvectors that correspond to the largest eigenvalues) are used to recreate a substantial portion of the original data's variance 30 .
(3) min �G i w�, s.t. t∈t i w ti = 1, and w T 1 ti = 1, www.nature.com/scientificreports/ Clustering. Performing clustering is one of the critical tasks in single-cell data analysis. Clusters are formed by grouping cells based on their similarity of the gene expression profiles. Distance functions are used to describe expression profile similarity, which employs dimensionality-reduced representations as input. We used the popular clustering technique, k-means, an iterative clustering algorithm that groups the data into C separate groups of C = {c 1 , c 2 , . . . , c k } by minimizing the within-cluster dispersion while maximizing the inter-cluster distances. The number of clusters to be formed from the data needs to be specified as an input parameter to the algorithm. k-means works in three key steps. The first step is to choose the initial centroids, and the simplest method is to choose k samples from the dataset X = {x 1 , x 2 , . . . , x n } . Then, each point in the dataset is allocated to its nearest centroid. The next step involves taking the mean value of all samples allocated to each centroid to update centroids. The algorithm calculates the difference between the old and new centroids, then repeats the last two steps until that value falls below a certain threshold. In other words, it keeps iterating until almost no change in the centroids is observed. The points in the data tends mostly to the centroid which leads to a high degree of cluster compactness or a minimum sum of squared error (SSE), as shown in Eq. (7); wherein n is the number of samples in the data, C j is the jth cluster, µ is the mean of the samples, and x is the corresponding sample. How to choose the best number of clusters is explained in the next subsection, Parameter Optimization.
Parameter optimization. When applying MLLE, a neighborhood graph, t-NN, is created by connecting points that are close to each other. Different measures are used for this purpose, including the number of neighbors, distance from each point to its neighbors, and others. A common measure to determine the sparsity of the neighbor graph is the tolerance factor, which makes the graph sparser or denser. In this regard, we tested different tolerance values on each dataset and selected those values that yielded the best validity index scores, as explained in the Performance Evaluation subsection. With the aim of preserving locality, the number of nearest neighbors, t, is a crucial parameter to construct the neighborhood graph. Another critical parameter is the number of clusters, k, in the clustering algorithm. The number of nearest neighbors, t, are examined within the range [8,24], and the number of clusters k for each value of t is also assessed, where k ranges from 4 to 14, and the validity of indices are calculated for each cluster.
We followed "elbow" method to select the best combination of t and k with the highest score, considering all the three validity indices, as explained in the Performance Evaluation subsection. By plotting all scores with different number of clusters on the x-axis, an elbow-shape point is observed on the plot at a certain value of k. A decreasing trend can be observed on the scores, while k increases. At this point, which is the best number of clusters, the plot will change rapidly and the trend will lean towards a line almost parallel to the x-axis. The corresponding plots are provided in the Supplementary Fig. S1. Performance evaluation. Generally speaking, the best clustering is the one that maintains high intra-cluster distance and gives the most compact clusters. In this work, we used the Silhouette coefficient 32 , an evaluation metric that measures either the mean distance between a sample point and all other points in the same cluster or all other points in the next nearest neighbor cluster. Consider a set of clusters C = {C 1 , C 2 , . . . , C k } as the output by a clustering algorithm, k-means in our case. The Silhouette coefficient, SH, for the i th sample point in cluster C j , where j = 1, . . . , k , can be defined as follows: where w c is the mean distance between point x i and all other points inside the cluster, within-cluster distance, and b c is the minimum mean value of the distance between a sample point x i and the nearest neighbor cluster, between-cluster distance, which are calculated as: We also used Calinski-Harabasz (CH) and Davies-Bouldin (DB) validity of indices to assess the clustering performance. The CH score is used to evaluate the model, where a higher score tells better-defined clusters 33 . CH is the ratio of the sum of between-cluster and within-cluster dispersion for all clusters, calculated as follows: where n is size of input samples, tr(S B ) is the trace of the between-group dispersion matrix and tr(S W ) is the within-cluster dispersion.
The Davies-Bouldin (DB) index 34 is another validity measure that is defined as the average of the similarity measure of each cluster. DB is computed as follows: a(x i ) = 1 www.nature.com/scientificreports/ where s ij is the ratio between within-cluster distances and between cluster distances, and is calculated as The smaller the DB value is, the better the clustering, and as such, we aim to minimize Eq. (11). Here, d ij is the Euclidean distance between cluster centroids µ i and µ j ; and w i is the within-cluster distance of cluster C k .
Overall, we used the Silhouette score to evaluate the clustering performance, whereas CH and DB indices were used to verify and find the optimal parameters, namely the best number of clusters for our experiments.
Cluster annotation. To validate the obtained clusters, we first identified the top 20 differentially expressed genes in each cluster based on the Wilcoxon test and considered them as marker genes that drive high separation among clusters. Marker genes are up or down-regulated in different individual cells, pathways, or GO terms. We used Gene Set Enrichment Analysis, GSEA 35,36 , to annotate the clusters with the corresponding cell types of each group of marker genes. GSEA 37 is a computational tool that determines whether a predefined set of genes shows a statistically significant level of expression in a specific cell type, biological process, cellular component, molecular function, or biological pathway. The GSEA uses MSigDB, the Molecular Signature Database, to provide gene sets for the gene set enrichment analysis. Also, we employed ToppCluster 38 for Gene Ontology (GO) analysis. ToppCluster is a multi-gene list functional enrichment analysis online tool to identify the GO terms and pathways associated with the top gene lists extracted from each cluster. Pathways were extracted from the MSigDB C2 BIOCARTA (V7.3) database 39 . The corresponding networks are visualized using Cytoscape 40 . We decreased the minimum number of genes present in the corresponding annotations to achieve a better visualization.

Results and discussion
We first applied the Scanpy pipeline, including its clustering method (Leiden clustering), on the PBMC dataset. The corresponding results are presented in the Supplementary Fig. S3. Then, in order to obtain the most suitable clustering and dimensionality reduction method with higher performance, we employ Scanpy only for the pre-processing step. Since different tools use the same pre-processing approach, it facilitates new innovations emerging in scRNA-seq analysis.
We developed a well-constructed pipeline that can be applied to scRNA-seq data to discover individual cell types. Considering dimensionality reduction and clustering as two significant steps in the pipeline, we explored many ways of untangling the data in two and three dimensions. We found optimal parameters for both dimensionality reduction and clustering that achieve the meaningful separation of cell types and compact clusters. To demonstrate the applicability of our pipeline, we tested it on thirteen datasets of different sizes. Finally, we evaluated our method in terms of both computational and biological perspectives. As k-means and, generally, all distance-based methods are known to not work well with non-linear methods such as t-SNE, we employed DBSCAN clustering to investigate the behavior in the datasets used in this work. The resulting Silhouette score for the Calu3 dataset is 0.871, which is relatively lower than those of k-means whose score is 0.924 for the same datasets. These results along with further discussions are provided in the Supplementary Fig. S2. Clustering and cell type discovery. To achieve optimized results, we experimented with all possible combinations of parameters as discussed in the Material and Methods section. As a result, the best parameters that we could obtain for each dataset are depicted in Table 2. In a few datasets, to achieve the best clustering score in the proposed approach, the data is reduced to lower dimensions, such as 5, 6, and 7. Afterward, the data is reduced to three dimensions to visualize and obtain better results. The results of k-means clustering combined with each dimensionality reduction method using the best parameters are listed in Table 3. The last column shows the result after applying ICA on the result of clustering combined with MLLE. The clustering scores range from 0 to 1. A score close to 1 represents good quality clustering, with 1 being the best, while a score near zero indicates that the clusters are not well defined.
When testing other widely-used techniques such as t-SNE and PCA, we noticed that both methods were not efficient in separating the data into well-defined clusters. On the other hand, the results of Isomap and Laplacian Eigenmaps show slightly better performance comparatively. To demonstrate this statement graphically, we visualize the two-dimensional projection of cells resulting from different dimensionality reduction methods and colored by k-means clustering on the H1299 scRNA-seq dataset, in Figs. 5, 6, 7, 8, 9 and 10. Moreover, three-dimensional results on the same dataset are presented in Supplementary Fig. S5, Supplementary Material. Finally, we investigated MLLE and found the most insightful cluster separation in most of the datasets. This outcome demonstrates the power of MLLE in exploring the data's dense and complex relations, creating better embeddings in lower-dimensional spaces. We performed an additional dimensionality reduction step that uses ICA to enhance the visualization of the clusters. The last column of Table 3 shows that MLLE combined with ICA improves the overall results except for some datasets in which we do not notice much difference; very negligible difference of 0.004 (Baron_human1), 0.001 (Baron_human2), 0.014 (Baron_human3), 0.004 (Segerstolpe), and 0.011 (Xin) can ignore them. To achieve a better view of the impact of ICA on the MLLE transformation, we show a visual comparison of the clusters in Figs. 11 and 12. Two-dimensional ICA projection of the cells applied to the three-dimensional MLLE data shows the best visualization and clustering scores (Fig. 12). When applied alone, ICA performs very poorly with significantly inseparable clusters (Fig. 10). This result is because ICA is limited to linear transformations. On the other hand, manifold learning techniques consider data locally. As such, it can reveal complex relationships among the data points in higher-dimensional spaces. We instead applied ICA on www.nature.com/scientificreports/ Table 2. Parameters used for experiments; t is the number of neighbors and K is the number of clusters. These parameters are generated considering both dimensionality reduction and clustering together.  www.nature.com/scientificreports/ the lower-dimensional data because we observed well-marked "lines" or "axes" in the three-dimensional data, which led us to conclude that we could apply ICA to learn the linearly independent components, not necessarily orthogonal. Applying ICA reveals some hidden, complex relationships among the cells in the clusters, which are not noticeable in three dimensions.       Table 5. Moreover, we observed some reported marker genes of the PBMC dataset in some clusters, which are shown in the same table as well.
Additionally, the visualization of the networks of GO terms and pathways associated with the corresponding marker genes of the H1299 scRNA-seq dataset are depicted in Figs. 13 and 14, respectively. For each cluster, we identified a set of biological process or pathway terms that connect with a term that is significantly associated with the top 20 gene list in that cluster. By observing Fig. 14, some significant pathways are found to be enriched in immunity functions and signaling, including SARS-CoV-2 innate Immunity Evasion, Host-pathogen interaction of human coronaviruses, SARS coronavirus and innate immunity, Type II interferon signaling (IFNG), and the human immune response to tuberculosis. Also, Fig. 13 shows that most biological processes associated with  www.nature.com/scientificreports/ immunity functions, including response to interferon-alpha, protection from a natural killer cell, type III interferon production, regulation by virus of viral protein levels in a host cell, and detection of virus, among others. In addition, we obtained a list of overlapping marker genes involved in Herpes simplex virus 1 (HSV-1) infection and the Influenza A pathway. These findings suggest potential markers for subsequent medical treatment or drug discovery by comparing similar diseases in terms of functionality. Moreover, although numerous findings suggest potential links between HSV-1 and Alzheimer's disease, a causal relationship has not been demonstrated yet 41 .

Conclusion and future work
This work focuses on the identification of different cell types using manifold learning combined with clustering techniques on scRNA-seq data. Identifying similarities that result from structural, functional, or evolutionary relationships among the genes is the primary goal of clustering the cells. Our proposed two-step representation learning approach demonstrated that k-means clustering technique combined with Modified LLE leads to improved clustering output and meaningful organization of cell clusters by "untangling" the complex, hidden relationship in a higher-dimensional space. Non-linear dimensionality reduction methods have been shown to be very powerful as they preserve the locality of the data from higher to lower dimensions. UMAP is one of the most commonly-used non-linear dimensionality reduction technique, and has been shown to perform well on large-scale scRNA-seq data. However, for dimensionality reduction, UMAP is not as efficient as MLLE on high-dimensional cytometry, especially when combined with clustering to enhancing the visualization of the clustering results. This behavior of MLLE has been observed in our experiments. A comparative analysis with UMAP in the Supplementary Material, Supplementary Fig. S4, confirms this observation.
Moreover, performing ICA on transformed data after applying manifold learning techniques provides enhanced view of the data in a reduced space. Evaluating the incidence of ICA as a visualization scheme and further reduction step, after applying MLLE, shows better clustering and enhanced visualization simultaneously.  Figure 13. A set of biological process that are enriched by marker genes in H1299 scRNA-seq dataset. The numbers show the clusters and edges shows the link between a cluster and a biological process term.
Scientific Reports | (2022) 12:120 | https://doi.org/10.1038/s41598-021-03613-0 www.nature.com/scientificreports/ This trend leads to a research avenue that involves a combination of non-linear manifold learning techniques followed by linear methods, which has shown to be more powerful than conventional methods such as PCA or ICA applied alone. Using multiple benchmark datasets shows the effectiveness of our proposed method. Performing gene set enrichment analysis to annotate a set of HVGs obtained from each cluster reveals biomarker genes involved in different gene ontology terms.
There are some other potential applications for investigating scRNA-seq data, even beyond cell type identification. Using an extension of the proposed method by employing other manifold or deep learning techniques on the other epigenetic challenges in scRNA-seq data analysis, such as trajectory analysis, is our next step. Figure 14. Pathways that are enriched by marker genes in H1299 scRNA-seq dataset. The numbers show the clusters and edges shows the link between a cluster and a pathway. The nodes that are highlighted in yellow show the SARS-CoV-2 cell-specific pathway. Most of the other green nodes reveal the shared and cluster-specific functional pathways in the immune system.