Introduction

Single-cell sequencing is an emerging technology used to capture cell information at a single-nucleotide resolution and by which individual cells can be analyzed separately1. As of now, single-cell RNA-seq (scRNA-seq) datasets have been generated for different purposes2. However, these high-dimensional and sparse data lead to some analytical challenges. While many computational methods have been successfully proposed for analyzing scRNA-seq data, there are still some open problems in this research area. One of the main challenges is the sparsity of the data and the curse of dimensionality present in scRNA-seq data. Also, performing well-defined pre-processing steps leads to enhancing the quality of data and new biological insights. Analyzing scRNA-seq data can be divided into two main categories: cell and gene levels. Finding cell sub-types or highly differentially expressed tissue-specific gene set is one of the common challenges at the cell level3. Arranging cells into clusters to find the data’s heterogeneity is arguably the most significant step of any scRNA-seq data downstream analysis. This step could be used to distinguish tissue-specific sub-types based on identified gene sets. Indeed, cell clustering aims to identify cell types based on the patterns embedded in gene expression without prior knowledge at the cell level. Since the number of genes that are profiled in scRNA-seq data is typically large, cells tend to be located close to each other via non-metric distances, but rather complex relationships in high-dimensional spaces4. Therefore, traditional dimensionality reduction and clustering algorithms are not efficient for these scenarios, and hence, they cannot efficiently separate individual cell types. Several algorithms have been proposed to this aim and alleviate the problem of the curse of dimensionality.

Dimensionality reduction techniques have been widely used in large-scale scRNA-seq data processing5. Most of the previous studies use principal component analysis (PCA). However, one of the main drawbacks of PCA is that it cannot deal with sparse matrices and non-metric relationships among high-dimensional data points. Other works have also employed PCA as a pre-processing step to remove cell outliers for dimensionality reduction and visualization. Other methods proposed non-linear dimensionality reduction, including t-distributed Stochastic Neighborhood Embedding (t-SNE)6. This method is able to preserve the local structure of data, although it is not efficient applicable when applied on very large datasets7.

Moreover, various studies have used unsupervised clustering models to identify rare novel cell types. For instance, the hierarchical clustering algorithm divides large clusters into smaller ones or progressively merges each data point into larger clusters. This algorithm has been employed to analyze scRNA-seq data by BackSPIN8 and pcaReduce9, through dimensionality reduction after each division or combination in an iterative manner. k-means, which is one of the most common clustering algorithms, has been employed in the Monocle, specifically for analyzing scRNA-seq data10. Also, the authors of11 used the Louvain algorithm, which is based on community detection techniques to analyze complex networks12.

However, to achieve acceptable clustering performance on scRNA-seq data, other comprehensive studies indicated that hybrid models, designed as a combination of clustering and dimensionality reduction techniques, tend to improve the clustering results13. They learned 20 different models using four dimensionality reduction methods, including PCA, non-negative matrix factorization (NMF), filter-based feature selection (FBFS), and Independent Component Analysis (ICA). They also used five clustering algorithms as k-means, density-based spatial clustering with noise (DBSCAN), fuzzy c-means, Louvain, and hierarchical clustering. Their experiments highlighted the positive effect of hybrid models and showed that using feature-extraction methods could be a decent way to improve clustering performance. Their experimental results indicate that Louvain combined with ICA performed well in small feature spaces.

This paper proposes a model to obtain the well-separated and meaningful clusters of cells from large-scale scRNA-seq data. We focus on the combination of unsupervised dimensionality reduction followed by conventional clustering. We discovered a hybrid model of non-linear dimensionality reduction technique, MLLE, and linear combination method, ICA for visualization. We used PCA, t-SNE, Isomap, standard Locally Linear Embedding (LLE), and Laplacian eigenmaps to perform a comparative analysis on diverse methods. ICA is employed to enhance the visualization of clustered data. Parameter tuning or choosing the best parameters for dimensionality reduction and clustering has been one of the critical challenges, and that is well addressed in our work. Experimental results on thirteen different benchmark scRNA-seq datasets show the power of modified LLE and ICA on the representation quality of the clustering data, providing very high accuracy and enhanced visualization. Confirmatory biological annotations were observed in the clusters using corresponding marker genes found by our method.

Materials and methods

The block diagram of the proposed pipeline is depicted in Fig. 1. First, the scRNA-seq data is pre-processed based on the number of cells and the number of genes. Highly variable genes are extracted as part of the feature selection step after normalization and scaling of the filtered data. Linear regression is one of the most widely-used methods to regress out potential sources of variation presented in the data based on the total counts per cell and mitochondrial percentage as discussed in Refs.11,14. The data obtained at this point is then processed to reduce the feature space into two or three dimensions; afterward, k-means clustering is applied. In addition, we performed ICA on the lower-dimensional data followed by k-means clustering to achieve improved visualization and meaningful clusters.

Datasets

To evaluate the performance of the proposed method, a total of thirteen benchmark datasets were used, which include single-cell gene expression profiles. The details of all datasets used in this work are given in Table 1. They vary across size, tissue (pancreas, lung, peripheral blood), sequencing protocol (three different protocols), and species (Human and Mouse). Peripheral blood dataset, 3k PBMC from a healthy donor, were downloaded from the 10\(X\) Genomics portal15. Pancreas datasets include Baron (GSE84133)16 , Muraro (GSE85241)17, Segerstolpe (EMTAB-5061)18, Xin (GSE81608)19, and Wang20. In lung datasets, H1299 scRNA-seq (GSE148729)21 and Calu3 scRNA-seq (GSE148729)21, different cell lines were contaminated with SARS-CoV-1 and SARS-CoV-2 and sequenced at different time slots. These datasets are unlabeled and do not have any background knowledge of the data. In this case, we analyzed the data and provided useful information about the unknown data. On the other hand, the cell-type annotation for PBMC, Baron, Segerstolpe, and Wang was provided with the data. We used these annotations as “background knowledge” during the evaluation of the clustering method. To make a fair assessment of the clustering methods, the cell annotations were removed, and the clustering was done. Then, the labels were considered to compare the biological results.

Figure 1
figure 1

Block diagram of the proposed approach for discovering cell types in scRNA-seq data.

Table 1 Datasets used in this work.

Data pre-processing and quality control

A common practice for generating RNA-seq raw data is to use next-generation sequencing technologies to create read count matrices. The read count data matrix contains gene names and their expression levels across individual cells. Before analyzing scRNA-seq data, one needs to ensure that gene expressions and cells are of standard quality. We followed a typical scRNA-seq analysis workflow including quality control, as described in14,22. A Python package, Scanpy, is used to perform pre-processing and quality control steps. Based on the expression levels, we filtered out weakly expressed genes and low-quality cells in which fewer reads are mapped, as shown in Fig. 1, the first step of pre-processing. Low-quality cells that are dead, degraded, or damaged during sequencing are represented by a low number of expressed genes. Genes expressed in less than three cells and cells with less than 200 expressed genes are removed.

We also investigated the distribution of the data (Fig. 2) as a data-specific quality-control step and filtered out low-quality cells and genes. We removed cells with a high percentage of mitochondrial gene counts. Mitochondrial genes do not contribute significant information to the downstream analysis22,23. Also, some cells present large total counts compared with other cells, indicating potential sources of variation. To reduce this effect, we scaled the data to unit variance. Since scRNA-seq data are expressed at different levels, normalization is a must. Normalization is the method of translating numeric columns’ values in a dataset to a standard scale without distorting the ranges of values. We normalize the data using the Counts Per Million (CPM) normalization combined with logarithmic scaling on the data:

$$\begin{aligned} CPM= readsMappedToGene \times \frac{1}{totalReads} \times 10^6. \end{aligned}$$
(1)

where totalReads is the total number of mapped reads of a sample, and readsMappedToGene is the number of reads mapped to a selected gene.

Figure 2
figure 2

Investigating the distribution of the data to filtered out weakly expressed genes and low-quality cells from dataset; (a) number of expressed genes, (b) total counts per cell, and (c) the percentage of mitochondrial genes for H1299 scRNA-seq.

At this point, we extracted highly variable genes (HVGs) as a part of the feature selection step, aiming at minimizing the search space, and only these genes are examined in further evaluation. HVGs are those genes that are expressed significantly more or less in some cells compared to other ones. This step in quality control makes sure that the differences occur because of biological differences and not technical noise. The simplest approach to compute such a variation is to quantify the variance of the expression values for each gene across all the samples. A good trade-off between mean and variance would help select the subset of genes that keep useful biological knowledge, while removing noise. We use log-normalized data because we want to ensure having the same log-values in the clustering and dimensionality reduction follow a consistent analysis through all steps. There are conventional approaches to find the best threshold. The normalized dispersion is obtained by scaling the mean and standard deviation of the dispersion for genes falling into a given bin for the mean of expressions (Fig. 3). This means that HVGs are selected for each bin.

Visualization of top genes in the dataset is shown in Fig. 4 before and after normalization.

Figure 3
figure 3

Comparison of dispersion of normalized and not normalized genes to extract highly variable genes.

Figure 4
figure 4

Comparison of the top 20 highly-variable genes before and after normalization.

Dimensionality reduction

The majority of real-life data is multidimensional, and the majority of the high-dimensional data is complex and sparse. Ideally, understanding the data in such dimensions is tricky, and visualization is not possible. Dimensionality reduction is the process of transforming data from a high-dimensional space to a low-dimensional space while retaining some of the original data’s meaningful properties. Working in high-dimensional spaces may be inconvenient for various reasons. For example, data analysis is typically computationally intractable. Single-cell gene expression data is complex and should be well-explored. Each gene is characterized as a dimension in a single-cell expression profile. As such, dimensionality reduction is very productive in summarizing biological attributes in fewer dimensions. Dimensionality reduction is divided into linear and non-linear techniques. We discuss in details in the following subsections.

Modified locally linear embedding

Modified LLE is a non-linear dimensionality reduction technique and the enhanced version of LLE. To understand the MLLE, we first need to know the algorithm of LLE. When used for dimensionality reduction, LLE attempts to reveal the manifold underlying structure based on simple geometric intuitions. LLE preserves the data locality in lower dimensions because it reconstructs each sample point from its neighbors. In other words, LLE focuses on finding the lower-dimension representation of high-dimensional data that preserves the locally linear structure of neighboring point patterns most accurately.

In the simplest formulation of LLE, it first identifies t-nearest neighbors per data point, as measured by Euclidean distance24. One can choose the number of neighbors, t, based on rules, metrics, or simply a random number. Consider n sample points \(\mathbf {X}=\{\mathbf {x}_1,\mathbf {x}_2, \ldots ,\mathbf {x}_n\}\) in high dimensional space of \(R^d\). For each sample point \(x_i\) and its neighborhood set \(N_i = \{\mathbf {x}_t,t \in t_i\}\), one can form t-NN graph to construct locally linear structure at that point using the combination of reconstruction weights, \(\mathbf {W=}\{w_{ti},t \in t_i\}\), \(i= {1, \ldots , n}\). On the basis of this, each data point viewed as a small linear patch of the sub-manifold.

To compute the weights \(w_{ti}\) for linear reconstruction of each point, we minimized the cost function with respect to two constraints: (1) each data point \(\mathbf {x_i}\) is reconstructed only from its neighbors imposing \(w_{ti}=0\) if \(\mathbf {x_i}\) does not belong to that set; (2) sum of the weights matrix rows is equal to one, that is \(\sum {w_{ti}}=1\). Optimal weights are calculated by solving Eq. (2), the constrained least squares problem:

$$\begin{aligned} \min \ \Vert {\mathbf {x}_i - \sum _{t\in t_i}w_{ti}\mathbf {x}_t} \Vert \quad \text {s.t.} \quad \sum _{t \in t_i}w_{ti}=1. \end{aligned}$$
(2)

Matrix \(\mathbf {G_i}=[..., x_t - x_i,...]_t \in t_i\) can help to formulate the local weight vector \(w_{ti}\); Hence, (2) can be reformulated as:

$$\begin{aligned} \min \ \Vert {\mathbf {G}_i w} \Vert , \quad \text {s.t.} \quad \sum _{t \in t_i}w_{ti}=1, \quad \text {and} \quad w^T 1_{ti}=1, \end{aligned}$$
(3)

where \(1_{ti}\) shows the vector of all 1’s with \(t_i\)-dimension.

Using this formulation, the embedding space Y can be computed by singular value decomposition of \(G_i\), with Y as a solution to the linear combination of \(G_i^T {G_i}Y = 1_{ti}\).

Finally, using the same weights computed in the input space, each high-dimensional input sample \(\mathbf {x_i}\) is mapped to a lower dimensional point set \(\mathbf {Y}=\{\mathbf {y}_1,\mathbf {y}_2, \ldots ,\mathbf {y}_n\}\) in \(R^m\) (\(m<<d\)), representing the manifold’s global internal coordinates.

Equation (4) reflect the locality preservation property by solving a minimization problem over the output manifold.

$$\begin{aligned} \min _{Y=\left[ y_1, \ldots ,y_n\right] } \sum _{i=1}^n \Vert {\mathbf {y}_i - \sum _{t\in t_i}w_{ti}\mathbf {y}_t}\Vert ^2 \quad \text {s.t.}\quad YY^T=I. \end{aligned}$$
(4)

Regularization is a well-known problem in LLE, which manifests itself in the embedding that distorts the underlying geometry of the manifold. Standard LLE uses an arbitrary regularization parameter concerning the weight matrix’s local trace25. MLLE overcomes this problem using multiple weight vectors in each neighborhood, discovering a more stable and enhanced embedding space. MLLE modifies or adjusts the reconstruction weights, which modifies the embedding cost function as follows:

$$\begin{aligned} \min _{Y=\left[ y_1, \ldots ,y_n\right] } \sum _{i=1}^n \sum _{l=1}^{s_i} \Vert {\mathbf {y}_i - \sum _{t\in t_i} {w}_{ti}^{l} \mathbf {y}_t}\Vert ^2 \quad \text {s.t.}\quad YY^T=I, \end{aligned}$$
(5)

where, \(s_i\) is the smallest right singular values of \(\mathbf {G_i}\).

This aims to take advantage of the dense relations that exist in the embedding space26.

Independent component analysis

ICA is an independent and linear dimensionality reduction method. By using simple statistical properties assumptions, ICA learns an efficient linear transformation of the data and attempts to find the underlying structures are presented in the data27. Based on the definitions given in28, ICA is considered as a special case of projection pursuit; it is a technique for finding relevant projections of multi-dimensional data. Such projections can then be used for enhanced visualization of the clustered data. When ICA is used for visualizing the data, dimensionality reduction becomes its secondary objective28. Unlike other approaches, the transformation’s underlying vectors are presumed to be independent of one another. It employs a non-Gaussian data structure, which is crucial for retrieving the transformed underlying data components. ICA aims to find projections of the data that provides estimations of the independent components28. When dealing with noise-free data, if there are no assumptions made on the data, ICA can be considered as a high-performance method of exploratory data analysis28. Consider \(\mathbf {r}\) being a random vector whose elements are \(\{\mathbf {r}_1,\mathbf {r}_2, \ldots ,\mathbf {r}_n\}\), and similarly, random vector \(\mathbf {s}\) with its elements \(\{\mathbf {s}_1,\mathbf {s}_2, \ldots ,\mathbf {s}_n\}\), and also \(\mathbf {A}\) is the matrix with elements \(a_{ij}\). ICA is a generative model, which captures how the observed data are generated by mixing the components \(s_i\) (Eq. 6). The independent components are latent variables, \(\mathbf {B}\), which means they are unknown. Also, the mixing matrix (\(\mathbf {A}\)) is assumed to be unknown and \(\mathbf {V}\) is the observed matrix.

$$\begin{aligned} \begin{aligned} \mathbf {r}=\mathbf {A}\mathbf {s} \\ \mathbf {V}=\mathbf {A}\mathbf {B}. \end{aligned} \end{aligned}$$
(6)

The rows of these matrices are orthogonal to each other. As such, it leads to more independent components than PCA. ICA requires knowing the structure of the data, which is hidden while being analyzed, to untangle their complex relationships and translate them into meaningful measurements. Thus, this feature of ICA is referred to as blind source separation29.

Other dimensionality reduction methods

We used other dimensionality reduction techniques to compare our proposed method such as Standard LLE, Isomap, Laplacian eigenmap, PCA, and t-SNE. Isomap stands for isometric mapping. Isomap is a non-linear dimensionality reduction method based on the spectral theory that aims to preserve the lower dimension’s geodesic distances. Isomap starts by creating a neighborhood network. After that, it uses graph distance to estimate the geodesic distance among all pairs of points. The eigenvalue decomposition of the geodesic distance matrix finds the lower-dimensional embedding of the data30. Laplacian eigenmap is another non-linear technique. It is computationally efficient and maps nearby input patterns to nearby outputs by computing the lower-dimensional representation of a high-dimensional data set. It focuses on preserving local proximity relations among input data points31. t-SNE is also a non-linear dimensionality reduction technique that is commonly-used for visualization, and has extensively applied in genomic data analysis, and speech processing6. On the other hand, PCA is a popular linear technique used for feature extraction or dimensionality reduction. Given a dataset composed of d-dimensional points, PCA maps the data linearly to find a subspace in lower-dimensional space so that the dispersion of the data is maximized. It does so via eigen decomposition of the covariance matrix. The principal components (eigenvectors that correspond to the largest eigenvalues) are used to recreate a substantial portion of the original data’s variance30.

Clustering

Performing clustering is one of the critical tasks in single-cell data analysis. Clusters are formed by grouping cells based on their similarity of the gene expression profiles. Distance functions are used to describe expression profile similarity, which employs dimensionality-reduced representations as input. We used the popular clustering technique, k-means, an iterative clustering algorithm that groups the data into C separate groups of \(\mathbf {C}=\{\mathbf {c}_1,\mathbf {c}_2,\ldots ,\mathbf {c}_k\}\) by minimizing the within-cluster dispersion while maximizing the inter-cluster distances. The number of clusters to be formed from the data needs to be specified as an input parameter to the algorithm.

\(\textit{k}\)-means works in three key steps. The first step is to choose the initial centroids, and the simplest method is to choose \(\textit{k}\) samples from the dataset \(\mathbf {X}=\{\mathbf {x}_1,\mathbf {x}_2, \ldots ,\mathbf {x}_n\}\). Then, each point in the dataset is allocated to its nearest centroid. The next step involves taking the mean value of all samples allocated to each centroid to update centroids. The algorithm calculates the difference between the old and new centroids, then repeats the last two steps until that value falls below a certain threshold. In other words, it keeps iterating until almost no change in the centroids is observed. The points in the data tends mostly to the centroid which leads to a high degree of cluster compactness or a minimum sum of squared error (SSE), as shown in Eq. (7); wherein n is the number of samples in the data, \(C_j\) is the jth cluster, \(\mu\) is the mean of the samples, and x is the corresponding sample. How to choose the best number of clusters is explained in the next subsection, Parameter Optimization.

$$\begin{aligned} SSE= \sum _{i=1}^k \min _{\mu \in C_j} \left( |\mathbf {x}_i - \mathbf {x}_j|\right) ^2. \end{aligned}$$
(7)

Parameter optimization

When applying MLLE, a neighborhood graph, t-NN, is created by connecting points that are close to each other. Different measures are used for this purpose, including the number of neighbors, distance from each point to its neighbors, and others. A common measure to determine the sparsity of the neighbor graph is the tolerance factor, which makes the graph sparser or denser. In this regard, we tested different tolerance values on each dataset and selected those values that yielded the best validity index scores, as explained in the Performance Evaluation subsection. With the aim of preserving locality, the number of nearest neighbors, t, is a crucial parameter to construct the neighborhood graph. Another critical parameter is the number of clusters, k, in the clustering algorithm. The number of nearest neighbors, t, are examined within the range [8,24], and the number of clusters k for each value of t is also assessed, where k ranges from 4 to 14, and the validity of indices are calculated for each cluster.

We followed “elbow” method to select the best combination of t and k with the highest score, considering all the three validity indices, as explained in the Performance Evaluation subsection. By plotting all scores with different number of clusters on the x-axis, an elbow-shape point is observed on the plot at a certain value of k. A decreasing trend can be observed on the scores, while k increases. At this point, which is the best number of clusters, the plot will change rapidly and the trend will lean towards a line almost parallel to the x-axis. The corresponding plots are provided in the Supplementary Fig. S1.

Performance evaluation

Generally speaking, the best clustering is the one that maintains high intra-cluster distance and gives the most compact clusters. In this work, we used the Silhouette coefficient32, an evaluation metric that measures either the mean distance between a sample point and all other points in the same cluster or all other points in the next nearest neighbor cluster. Consider a set of clusters \(\mathbf {C}\) = \(\left\{ \mathbf {C}_1, \mathbf {C}_2,\ldots ,\mathbf {C}_k \right\}\) as the output by a clustering algorithm, \(\textit{k}\)-means in our case. The Silhouette coefficient, SH, for the \(i^{th}\) sample point in cluster \(\mathbf {C}_j\), where \(j = 1, \ldots , k\), can be defined as follows:

$$\begin{aligned} \begin{aligned} SH\left( \mathbf {x}_i\right) =\frac{b_c\left( \mathbf {x}_i\right) -w_c\left( \mathbf {x}_i\right) }{max\left( w_c\left( \mathbf {x}_i\right) ,b_c\left( \mathbf {x}_i\right) \right) }, \end{aligned} \end{aligned}$$
(8)

where \(w_c\) is the mean distance between point \(\mathbf {x}_i\) and all other points inside the cluster, within-cluster distance, and \(b_c\) is the minimum mean value of the distance between a sample point \(\mathbf {x}_i\) and the nearest neighbor cluster, between-cluster distance, which are calculated as:

$$\begin{aligned} \begin{aligned} a\left( \mathbf {x}_i\right) =\frac{1}{|\mathbf {C}_k|-1}\sum _{\mathbf {x}_j\in \mathbf {C}_k,i\ne j}d\left( \mathbf {x}_i,\mathbf {x}_j\right) \\ b\left( \mathbf {x}_i\right) =\min _{k\ne i}\frac{1}{|\mathbf {C}_k|}\sum _{j=1}^k d\left( \mathbf {x}_i,\mathbf {x}_j\right) . \end{aligned} \end{aligned}$$
(9)

We also used Calinski–Harabasz (CH) and Davies–Bouldin (DB) validity of indices to assess the clustering performance. The CH score is used to evaluate the model, where a higher score tells better-defined clusters33. CH is the ratio of the sum of between-cluster and within-cluster dispersion for all clusters, calculated as follows:

$$\begin{aligned} CH=\frac{tr\left( \mathbf {S}_B\right) }{tr\left( \mathbf {S}_W\right) } \times \frac{n - k}{k - 1}, \end{aligned}$$
(10)

where n is size of input samples, \(tr(\mathbf {S}_B)\) is the trace of the between-group dispersion matrix and \(tr(\mathbf {S}_W)\) is the within-cluster dispersion.

The Davies–Bouldin (DB) index34 is another validity measure that is defined as the average of the similarity measure of each cluster. DB is computed as follows:

$$\begin{aligned} DB=\frac{1}{k}\sum _{i=1}^k max_{i\ne j}s_{ij}, \end{aligned}$$
(11)

where \(s_{ij}\) is the ratio between within-cluster distances and between cluster distances, and is calculated as \(s_{ij}=\frac{w_i+w_j}{d_{ij}}\). The smaller the DB value is, the better the clustering, and as such, we aim to minimize Eq. (11). Here, \(d_{ij}\) is the Euclidean distance between cluster centroids \(\mu _i\) and \(\mu _j\); and \(w_i\) is the within-cluster distance of cluster \(\mathbf {C}_k\).

Overall, we used the Silhouette score to evaluate the clustering performance, whereas CH and DB indices were used to verify and find the optimal parameters, namely the best number of clusters for our experiments.

Cluster annotation

To validate the obtained clusters, we first identified the top 20 differentially expressed genes in each cluster based on the Wilcoxon test and considered them as marker genes that drive high separation among clusters. Marker genes are up or down-regulated in different individual cells, pathways, or GO terms. We used Gene Set Enrichment Analysis, GSEA35,36, to annotate the clusters with the corresponding cell types of each group of marker genes. GSEA37 is a computational tool that determines whether a predefined set of genes shows a statistically significant level of expression in a specific cell type, biological process, cellular component, molecular function, or biological pathway. The GSEA uses MSigDB, the Molecular Signature Database, to provide gene sets for the gene set enrichment analysis. Also, we employed ToppCluster38for Gene Ontology (GO) analysis. ToppCluster is a multi-gene list functional enrichment analysis online tool to identify the GO terms and pathways associated with the top gene lists extracted from each cluster. Pathways were extracted from the MSigDB C2 BIOCARTA (V7.3) database39. The corresponding networks are visualized using Cytoscape40. We decreased the minimum number of genes present in the corresponding annotations to achieve a better visualization.

Results and discussion

We first applied the Scanpy pipeline, including its clustering method (Leiden clustering), on the PBMC dataset. The corresponding results are presented in the Supplementary Fig. S3. Then, in order to obtain the most suitable clustering and dimensionality reduction method with higher performance, we employ Scanpy only for the pre-processing step. Since different tools use the same pre-processing approach, it facilitates new innovations emerging in scRNA-seq analysis.

We developed a well-constructed pipeline that can be applied to scRNA-seq data to discover individual cell types. Considering dimensionality reduction and clustering as two significant steps in the pipeline, we explored many ways of untangling the data in two and three dimensions. We found optimal parameters for both dimensionality reduction and clustering that achieve the meaningful separation of cell types and compact clusters. To demonstrate the applicability of our pipeline, we tested it on thirteen datasets of different sizes. Finally, we evaluated our method in terms of both computational and biological perspectives. As k-means and, generally, all distance-based methods are known to not work well with non-linear methods such as t-SNE, we employed DBSCAN clustering to investigate the behavior in the datasets used in this work. The resulting Silhouette score for the Calu3 dataset is 0.871, which is relatively lower than those of k-means whose score is 0.924 for the same datasets. These results along with further discussions are provided in the Supplementary Fig. S2.

Clustering and cell type discovery

To achieve optimized results, we experimented with all possible combinations of parameters as discussed in the Material and Methods section. As a result, the best parameters that we could obtain for each dataset are depicted in Table 2. In a few datasets, to achieve the best clustering score in the proposed approach, the data is reduced to lower dimensions, such as 5, 6, and 7. Afterward, the data is reduced to three dimensions to visualize and obtain better results. The results of k-means clustering combined with each dimensionality reduction method using the best parameters are listed in Table 3. The last column shows the result after applying ICA on the result of clustering combined with MLLE. The clustering scores range from 0 to 1. A score close to 1 represents good quality clustering, with 1 being the best, while a score near zero indicates that the clusters are not well defined.

Table 2 Parameters used for experiments; t is the number of neighbors and K is the number of clusters. These parameters are generated considering both dimensionality reduction and clustering together.
Table 3 Silhoutte scores comparison of proposed method with other dimensionality reduction techniques.

When testing other widely-used techniques such as t-SNE and PCA, we noticed that both methods were not efficient in separating the data into well-defined clusters. On the other hand, the results of Isomap and Laplacian Eigenmaps show slightly better performance comparatively. To demonstrate this statement graphically, we visualize the two-dimensional projection of cells resulting from different dimensionality reduction methods and colored by k-means clustering on the H1299 scRNA-seq dataset, in Figs. 5, 6, 7, 8, 9 and 10. Moreover, three-dimensional results on the same dataset are presented in Supplementary Fig. S5, Supplementary Material.

Finally, we investigated MLLE and found the most insightful cluster separation in most of the datasets. This outcome demonstrates the power of MLLE in exploring the data’s dense and complex relations, creating better embeddings in lower-dimensional spaces. We performed an additional dimensionality reduction step that uses ICA to enhance the visualization of the clusters. The last column of Table 3 shows that MLLE combined with ICA improves the overall results except for some datasets in which we do not notice much difference; very negligible difference of 0.004 (Baron_human1), 0.001 (Baron_human2), 0.014 (Baron_human3), 0.004 (Segerstolpe), and 0.011 (Xin) can ignore them. To achieve a better view of the impact of ICA on the MLLE transformation, we show a visual comparison of the clusters in Figs. 11 and 12. Two-dimensional ICA projection of the cells applied to the three-dimensional MLLE data shows the best visualization and clustering scores (Fig. 12). When applied alone, ICA performs very poorly with significantly inseparable clusters (Fig. 10). This result is because ICA is limited to linear transformations. On the other hand, manifold learning techniques consider data locally. As such, it can reveal complex relationships among the data points in higher-dimensional spaces. We instead applied ICA on the lower-dimensional data because we observed well-marked “lines” or “axes” in the three-dimensional data, which led us to conclude that we could apply ICA to learn the linearly independent components, not necessarily orthogonal. Applying ICA reveals some hidden, complex relationships among the cells in the clusters, which are not noticeable in three dimensions.

Figure 5
figure 5

Two-dimensional t-SNE projection of cells colored by k-means clustering applied on high-dimensional original data (H1299 scRNA-seq); outliers have been removed to enhance visualization.

Figure 6
figure 6

Two-dimensional PCA projection of cells colored by k-means clustering applied on high-dimensional original data (H1299 scRNA-seq).

Figure 7
figure 7

Two-dimensional Laplacian eigenmap projection of cells colored by k-means clustering applied on high-dimensional original data (H1299 scRNA-seq); outliers have been removed to enhance visualization.

Figure 8
figure 8

Two-dimensional Isomaps projection of cells colored by k-means clustering applied on high-dimensional original data (H1299 scRNA-seq).

Figure 9
figure 9

Two-dimensional Standard LLE projection of cells colored by k-means clustering applied on high-dimensional original data (H1299 scRNA-seq);outliers have been removed to enhance visualization.

Figure 10
figure 10

Two-dimensional ICA projection of cells colored by k-means clustering applied on high-dimensional original data (H1299 scRNA-seq).

Figure 11
figure 11

Three-dimensional MLLE projection of cells colored by k-means clustering applied on high-dimensional original data (H1299 scRNA-seq).

Figure 12
figure 12

Two-dimensional ICA projection of cells colored by k-means clustering applied to the three-dimensional points output by MLLE on the H1299 scRNA-seq dataset.

Biological assessment

As shown in Table 4, some of the pancreatic cell types are found for pancreas datasets, such as the Baron human dataset within well-defined gene sets in the C8 collection of MSigDB, which includes cell type signature’s gene sets. They include ’MURARO PANCREAS ALPHA CELL’, ’MURARO PANCREAS ENDOTHELIAL CELL’, ’MURARO PANCREAS MESENCHYMAL STROMAL CELL’, ’MURARO PANCREAS DUCTAL CELL’, and ’MURARO PANCREAS ACINAR CELL’.Other cell types such as HB2 is a cell line originated by epithelial cells.

Table 4 Identified cell types for Baron_human1 dataset.

Regarding the PBMC dataset, we identified gene sets from MSigDB based on the ranked gene sets, including TRAVAGLINI LUNG CD8 NAIVE T CELL, TRAVAGLINI LUNG PLATELET MEGAKARYOCYTE CELL, AIZARANI LIVER C18 NK NKT CELLS 5, DURANTE ADULT OLFACTORY NEUROEPITHELIUM DENDRITIC CELLS, TRAVAGLINI LUNG OLR1 CLASSICAL MONOCYTE CELL, FAN OVARY CL12 T LYMPHOCYTE NK CELL 2, and AIZARANI LIVER C34 MHC II POS B CELLS. The corresponding cell types are presented in the Table 5. Moreover, we observed some reported marker genes of the PBMC dataset in some clusters, which are shown in the same table as well.

Table 5 Identified cell types for PBMC dataset.

Additionally, the visualization of the networks of GO terms and pathways associated with the corresponding marker genes of the H1299 scRNA-seq dataset are depicted in Figs. 13 and 14, respectively. For each cluster, we identified a set of biological process or pathway terms that connect with a term that is significantly associated with the top 20 gene list in that cluster. By observing Fig. 14, some significant pathways are found to be enriched in immunity functions and signaling, including SARS-CoV-2 innate Immunity Evasion, Host–pathogen interaction of human coronaviruses, SARS coronavirus and innate immunity, Type II interferon signaling (IFNG), and the human immune response to tuberculosis. Also, Fig. 13 shows that most biological processes associated with immunity functions, including response to interferon-alpha, protection from a natural killer cell, type III interferon production, regulation by virus of viral protein levels in a host cell, and detection of virus, among others. In addition, we obtained a list of overlapping marker genes involved in Herpes simplex virus 1 (HSV-1) infection and the Influenza A pathway. These findings suggest potential markers for subsequent medical treatment or drug discovery by comparing similar diseases in terms of functionality. Moreover, although numerous findings suggest potential links between HSV-1 and Alzheimer’s disease, a causal relationship has not been demonstrated yet41.

Figure 13
figure 13

A set of biological process that are enriched by marker genes in H1299 scRNA-seq dataset. The numbers show the clusters and edges shows the link between a cluster and a biological process term.

Figure 14
figure 14

Pathways that are enriched by marker genes in H1299 scRNA-seq dataset. The numbers show the clusters and edges shows the link between a cluster and a pathway. The nodes that are highlighted in yellow show the SARS-CoV-2 cell-specific pathway. Most of the other green nodes reveal the shared and cluster-specific functional pathways in the immune system.

Conclusion and future work

This work focuses on the identification of different cell types using manifold learning combined with clustering techniques on scRNA-seq data. Identifying similarities that result from structural, functional, or evolutionary relationships among the genes is the primary goal of clustering the cells. Our proposed two-step representation learning approach demonstrated that k-means clustering technique combined with Modified LLE leads to improved clustering output and meaningful organization of cell clusters by “untangling” the complex, hidden relationship in a higher-dimensional space.

Non-linear dimensionality reduction methods have been shown to be very powerful as they preserve the locality of the data from higher to lower dimensions. UMAP is one of the most commonly-used non-linear dimensionality reduction technique, and has been shown to perform well on large-scale scRNA-seq data. However, for dimensionality reduction, UMAP is not as efficient as MLLE on high-dimensional cytometry, especially when combined with clustering to enhancing the visualization of the clustering results. This behavior of MLLE has been observed in our experiments. A comparative analysis with UMAP in the Supplementary Material, Supplementary Fig. S4, confirms this observation.

Moreover, performing ICA on transformed data after applying manifold learning techniques provides enhanced view of the data in a reduced space. Evaluating the incidence of ICA as a visualization scheme and further reduction step, after applying MLLE, shows better clustering and enhanced visualization simultaneously. This trend leads to a research avenue that involves a combination of non-linear manifold learning techniques followed by linear methods, which has shown to be more powerful than conventional methods such as PCA or ICA applied alone.

Using multiple benchmark datasets shows the effectiveness of our proposed method. Performing gene set enrichment analysis to annotate a set of HVGs obtained from each cluster reveals biomarker genes involved in different gene ontology terms.

There are some other potential applications for investigating scRNA-seq data, even beyond cell type identification. Using an extension of the proposed method by employing other manifold or deep learning techniques on the other epigenetic challenges in scRNA-seq data analysis, such as trajectory analysis, is our next step.