Abstract
Identifying relevant disease modules such as target cell types is a significant step for studying diseases. Highthroughput singlecell RNASeq (scRNAseq) technologies have advanced in recent years, enabling researchers to investigate cells individually and understand their biological mechanisms. Computational techniques such as clustering, are the most suitable approach in scRNAseq data analysis when the cell types have not been wellcharacterized. These techniques can be used to identify a group of genes that belong to a specific cell type based on their similar gene expression patterns. However, due to the sparsity and highdimensionality of scRNAseq data, classical clustering methods are not efficient. Therefore, the use of nonlinear dimensionality reduction techniques to improve clustering results is crucial. We introduce a method that is used to identify representative clusters of different cell types by combining nonlinear dimensionality reduction techniques and clustering algorithms. We assess the impact of different dimensionality reduction techniques combined with the clustering of thirteen publicly available scRNAseq datasets of different tissues, sizes, and technologies. We further performed gene set enrichment analysis to evaluate the proposed method’s performance. As such, our results show that modified locally linear embedding combined with independent component analysis yields overall the best performance relative to the existing unsupervised methods across different datasets.
Similar content being viewed by others
Introduction
Singlecell sequencing is an emerging technology used to capture cell information at a singlenucleotide resolution and by which individual cells can be analyzed separately^{1}. As of now, singlecell RNAseq (scRNAseq) datasets have been generated for different purposes^{2}. However, these highdimensional and sparse data lead to some analytical challenges. While many computational methods have been successfully proposed for analyzing scRNAseq data, there are still some open problems in this research area. One of the main challenges is the sparsity of the data and the curse of dimensionality present in scRNAseq data. Also, performing welldefined preprocessing steps leads to enhancing the quality of data and new biological insights. Analyzing scRNAseq data can be divided into two main categories: cell and gene levels. Finding cell subtypes or highly differentially expressed tissuespecific gene set is one of the common challenges at the cell level^{3}. Arranging cells into clusters to find the data’s heterogeneity is arguably the most significant step of any scRNAseq data downstream analysis. This step could be used to distinguish tissuespecific subtypes based on identified gene sets. Indeed, cell clustering aims to identify cell types based on the patterns embedded in gene expression without prior knowledge at the cell level. Since the number of genes that are profiled in scRNAseq data is typically large, cells tend to be located close to each other via nonmetric distances, but rather complex relationships in highdimensional spaces^{4}. Therefore, traditional dimensionality reduction and clustering algorithms are not efficient for these scenarios, and hence, they cannot efficiently separate individual cell types. Several algorithms have been proposed to this aim and alleviate the problem of the curse of dimensionality.
Dimensionality reduction techniques have been widely used in largescale scRNAseq data processing^{5}. Most of the previous studies use principal component analysis (PCA). However, one of the main drawbacks of PCA is that it cannot deal with sparse matrices and nonmetric relationships among highdimensional data points. Other works have also employed PCA as a preprocessing step to remove cell outliers for dimensionality reduction and visualization. Other methods proposed nonlinear dimensionality reduction, including tdistributed Stochastic Neighborhood Embedding (tSNE)^{6}. This method is able to preserve the local structure of data, although it is not efficient applicable when applied on very large datasets^{7}.
Moreover, various studies have used unsupervised clustering models to identify rare novel cell types. For instance, the hierarchical clustering algorithm divides large clusters into smaller ones or progressively merges each data point into larger clusters. This algorithm has been employed to analyze scRNAseq data by BackSPIN^{8} and pcaReduce^{9}, through dimensionality reduction after each division or combination in an iterative manner. kmeans, which is one of the most common clustering algorithms, has been employed in the Monocle, specifically for analyzing scRNAseq data^{10}. Also, the authors of^{11} used the Louvain algorithm, which is based on community detection techniques to analyze complex networks^{12}.
However, to achieve acceptable clustering performance on scRNAseq data, other comprehensive studies indicated that hybrid models, designed as a combination of clustering and dimensionality reduction techniques, tend to improve the clustering results^{13}. They learned 20 different models using four dimensionality reduction methods, including PCA, nonnegative matrix factorization (NMF), filterbased feature selection (FBFS), and Independent Component Analysis (ICA). They also used five clustering algorithms as kmeans, densitybased spatial clustering with noise (DBSCAN), fuzzy cmeans, Louvain, and hierarchical clustering. Their experiments highlighted the positive effect of hybrid models and showed that using featureextraction methods could be a decent way to improve clustering performance. Their experimental results indicate that Louvain combined with ICA performed well in small feature spaces.
This paper proposes a model to obtain the wellseparated and meaningful clusters of cells from largescale scRNAseq data. We focus on the combination of unsupervised dimensionality reduction followed by conventional clustering. We discovered a hybrid model of nonlinear dimensionality reduction technique, MLLE, and linear combination method, ICA for visualization. We used PCA, tSNE, Isomap, standard Locally Linear Embedding (LLE), and Laplacian eigenmaps to perform a comparative analysis on diverse methods. ICA is employed to enhance the visualization of clustered data. Parameter tuning or choosing the best parameters for dimensionality reduction and clustering has been one of the critical challenges, and that is well addressed in our work. Experimental results on thirteen different benchmark scRNAseq datasets show the power of modified LLE and ICA on the representation quality of the clustering data, providing very high accuracy and enhanced visualization. Confirmatory biological annotations were observed in the clusters using corresponding marker genes found by our method.
Materials and methods
The block diagram of the proposed pipeline is depicted in Fig. 1. First, the scRNAseq data is preprocessed based on the number of cells and the number of genes. Highly variable genes are extracted as part of the feature selection step after normalization and scaling of the filtered data. Linear regression is one of the most widelyused methods to regress out potential sources of variation presented in the data based on the total counts per cell and mitochondrial percentage as discussed in Refs.^{11,14}. The data obtained at this point is then processed to reduce the feature space into two or three dimensions; afterward, kmeans clustering is applied. In addition, we performed ICA on the lowerdimensional data followed by kmeans clustering to achieve improved visualization and meaningful clusters.
Datasets
To evaluate the performance of the proposed method, a total of thirteen benchmark datasets were used, which include singlecell gene expression profiles. The details of all datasets used in this work are given in Table 1. They vary across size, tissue (pancreas, lung, peripheral blood), sequencing protocol (three different protocols), and species (Human and Mouse). Peripheral blood dataset, 3k PBMC from a healthy donor, were downloaded from the 10\(X\) Genomics portal^{15}. Pancreas datasets include Baron (GSE84133)^{16} , Muraro (GSE85241)^{17}, Segerstolpe (EMTAB5061)^{18}, Xin (GSE81608)^{19}, and Wang^{20}. In lung datasets, H1299 scRNAseq (GSE148729)^{21} and Calu3 scRNAseq (GSE148729)^{21}, different cell lines were contaminated with SARSCoV1 and SARSCoV2 and sequenced at different time slots. These datasets are unlabeled and do not have any background knowledge of the data. In this case, we analyzed the data and provided useful information about the unknown data. On the other hand, the celltype annotation for PBMC, Baron, Segerstolpe, and Wang was provided with the data. We used these annotations as “background knowledge” during the evaluation of the clustering method. To make a fair assessment of the clustering methods, the cell annotations were removed, and the clustering was done. Then, the labels were considered to compare the biological results.
Data preprocessing and quality control
A common practice for generating RNAseq raw data is to use nextgeneration sequencing technologies to create read count matrices. The read count data matrix contains gene names and their expression levels across individual cells. Before analyzing scRNAseq data, one needs to ensure that gene expressions and cells are of standard quality. We followed a typical scRNAseq analysis workflow including quality control, as described in^{14,22}. A Python package, Scanpy, is used to perform preprocessing and quality control steps. Based on the expression levels, we filtered out weakly expressed genes and lowquality cells in which fewer reads are mapped, as shown in Fig. 1, the first step of preprocessing. Lowquality cells that are dead, degraded, or damaged during sequencing are represented by a low number of expressed genes. Genes expressed in less than three cells and cells with less than 200 expressed genes are removed.
We also investigated the distribution of the data (Fig. 2) as a dataspecific qualitycontrol step and filtered out lowquality cells and genes. We removed cells with a high percentage of mitochondrial gene counts. Mitochondrial genes do not contribute significant information to the downstream analysis^{22,23}. Also, some cells present large total counts compared with other cells, indicating potential sources of variation. To reduce this effect, we scaled the data to unit variance. Since scRNAseq data are expressed at different levels, normalization is a must. Normalization is the method of translating numeric columns’ values in a dataset to a standard scale without distorting the ranges of values. We normalize the data using the Counts Per Million (CPM) normalization combined with logarithmic scaling on the data:
where totalReads is the total number of mapped reads of a sample, and readsMappedToGene is the number of reads mapped to a selected gene.
At this point, we extracted highly variable genes (HVGs) as a part of the feature selection step, aiming at minimizing the search space, and only these genes are examined in further evaluation. HVGs are those genes that are expressed significantly more or less in some cells compared to other ones. This step in quality control makes sure that the differences occur because of biological differences and not technical noise. The simplest approach to compute such a variation is to quantify the variance of the expression values for each gene across all the samples. A good tradeoff between mean and variance would help select the subset of genes that keep useful biological knowledge, while removing noise. We use lognormalized data because we want to ensure having the same logvalues in the clustering and dimensionality reduction follow a consistent analysis through all steps. There are conventional approaches to find the best threshold. The normalized dispersion is obtained by scaling the mean and standard deviation of the dispersion for genes falling into a given bin for the mean of expressions (Fig. 3). This means that HVGs are selected for each bin.
Visualization of top genes in the dataset is shown in Fig. 4 before and after normalization.
Dimensionality reduction
The majority of reallife data is multidimensional, and the majority of the highdimensional data is complex and sparse. Ideally, understanding the data in such dimensions is tricky, and visualization is not possible. Dimensionality reduction is the process of transforming data from a highdimensional space to a lowdimensional space while retaining some of the original data’s meaningful properties. Working in highdimensional spaces may be inconvenient for various reasons. For example, data analysis is typically computationally intractable. Singlecell gene expression data is complex and should be wellexplored. Each gene is characterized as a dimension in a singlecell expression profile. As such, dimensionality reduction is very productive in summarizing biological attributes in fewer dimensions. Dimensionality reduction is divided into linear and nonlinear techniques. We discuss in details in the following subsections.
Modified locally linear embedding
Modified LLE is a nonlinear dimensionality reduction technique and the enhanced version of LLE. To understand the MLLE, we first need to know the algorithm of LLE. When used for dimensionality reduction, LLE attempts to reveal the manifold underlying structure based on simple geometric intuitions. LLE preserves the data locality in lower dimensions because it reconstructs each sample point from its neighbors. In other words, LLE focuses on finding the lowerdimension representation of highdimensional data that preserves the locally linear structure of neighboring point patterns most accurately.
In the simplest formulation of LLE, it first identifies tnearest neighbors per data point, as measured by Euclidean distance^{24}. One can choose the number of neighbors, t, based on rules, metrics, or simply a random number. Consider n sample points \(\mathbf {X}=\{\mathbf {x}_1,\mathbf {x}_2, \ldots ,\mathbf {x}_n\}\) in high dimensional space of \(R^d\). For each sample point \(x_i\) and its neighborhood set \(N_i = \{\mathbf {x}_t,t \in t_i\}\), one can form tNN graph to construct locally linear structure at that point using the combination of reconstruction weights, \(\mathbf {W=}\{w_{ti},t \in t_i\}\), \(i= {1, \ldots , n}\). On the basis of this, each data point viewed as a small linear patch of the submanifold.
To compute the weights \(w_{ti}\) for linear reconstruction of each point, we minimized the cost function with respect to two constraints: (1) each data point \(\mathbf {x_i}\) is reconstructed only from its neighbors imposing \(w_{ti}=0\) if \(\mathbf {x_i}\) does not belong to that set; (2) sum of the weights matrix rows is equal to one, that is \(\sum {w_{ti}}=1\). Optimal weights are calculated by solving Eq. (2), the constrained least squares problem:
Matrix \(\mathbf {G_i}=[..., x_t  x_i,...]_t \in t_i\) can help to formulate the local weight vector \(w_{ti}\); Hence, (2) can be reformulated as:
where \(1_{ti}\) shows the vector of all 1’s with \(t_i\)dimension.
Using this formulation, the embedding space Y can be computed by singular value decomposition of \(G_i\), with Y as a solution to the linear combination of \(G_i^T {G_i}Y = 1_{ti}\).
Finally, using the same weights computed in the input space, each highdimensional input sample \(\mathbf {x_i}\) is mapped to a lower dimensional point set \(\mathbf {Y}=\{\mathbf {y}_1,\mathbf {y}_2, \ldots ,\mathbf {y}_n\}\) in \(R^m\) (\(m<<d\)), representing the manifold’s global internal coordinates.
Equation (4) reflect the locality preservation property by solving a minimization problem over the output manifold.
Regularization is a wellknown problem in LLE, which manifests itself in the embedding that distorts the underlying geometry of the manifold. Standard LLE uses an arbitrary regularization parameter concerning the weight matrix’s local trace^{25}. MLLE overcomes this problem using multiple weight vectors in each neighborhood, discovering a more stable and enhanced embedding space. MLLE modifies or adjusts the reconstruction weights, which modifies the embedding cost function as follows:
where, \(s_i\) is the smallest right singular values of \(\mathbf {G_i}\).
This aims to take advantage of the dense relations that exist in the embedding space^{26}.
Independent component analysis
ICA is an independent and linear dimensionality reduction method. By using simple statistical properties assumptions, ICA learns an efficient linear transformation of the data and attempts to find the underlying structures are presented in the data^{27}. Based on the definitions given in^{28}, ICA is considered as a special case of projection pursuit; it is a technique for finding relevant projections of multidimensional data. Such projections can then be used for enhanced visualization of the clustered data. When ICA is used for visualizing the data, dimensionality reduction becomes its secondary objective^{28}. Unlike other approaches, the transformation’s underlying vectors are presumed to be independent of one another. It employs a nonGaussian data structure, which is crucial for retrieving the transformed underlying data components. ICA aims to find projections of the data that provides estimations of the independent components^{28}. When dealing with noisefree data, if there are no assumptions made on the data, ICA can be considered as a highperformance method of exploratory data analysis^{28}. Consider \(\mathbf {r}\) being a random vector whose elements are \(\{\mathbf {r}_1,\mathbf {r}_2, \ldots ,\mathbf {r}_n\}\), and similarly, random vector \(\mathbf {s}\) with its elements \(\{\mathbf {s}_1,\mathbf {s}_2, \ldots ,\mathbf {s}_n\}\), and also \(\mathbf {A}\) is the matrix with elements \(a_{ij}\). ICA is a generative model, which captures how the observed data are generated by mixing the components \(s_i\) (Eq. 6). The independent components are latent variables, \(\mathbf {B}\), which means they are unknown. Also, the mixing matrix (\(\mathbf {A}\)) is assumed to be unknown and \(\mathbf {V}\) is the observed matrix.
The rows of these matrices are orthogonal to each other. As such, it leads to more independent components than PCA. ICA requires knowing the structure of the data, which is hidden while being analyzed, to untangle their complex relationships and translate them into meaningful measurements. Thus, this feature of ICA is referred to as blind source separation^{29}.
Other dimensionality reduction methods
We used other dimensionality reduction techniques to compare our proposed method such as Standard LLE, Isomap, Laplacian eigenmap, PCA, and tSNE. Isomap stands for isometric mapping. Isomap is a nonlinear dimensionality reduction method based on the spectral theory that aims to preserve the lower dimension’s geodesic distances. Isomap starts by creating a neighborhood network. After that, it uses graph distance to estimate the geodesic distance among all pairs of points. The eigenvalue decomposition of the geodesic distance matrix finds the lowerdimensional embedding of the data^{30}. Laplacian eigenmap is another nonlinear technique. It is computationally efficient and maps nearby input patterns to nearby outputs by computing the lowerdimensional representation of a highdimensional data set. It focuses on preserving local proximity relations among input data points^{31}. tSNE is also a nonlinear dimensionality reduction technique that is commonlyused for visualization, and has extensively applied in genomic data analysis, and speech processing^{6}. On the other hand, PCA is a popular linear technique used for feature extraction or dimensionality reduction. Given a dataset composed of ddimensional points, PCA maps the data linearly to find a subspace in lowerdimensional space so that the dispersion of the data is maximized. It does so via eigen decomposition of the covariance matrix. The principal components (eigenvectors that correspond to the largest eigenvalues) are used to recreate a substantial portion of the original data’s variance^{30}.
Clustering
Performing clustering is one of the critical tasks in singlecell data analysis. Clusters are formed by grouping cells based on their similarity of the gene expression profiles. Distance functions are used to describe expression profile similarity, which employs dimensionalityreduced representations as input. We used the popular clustering technique, kmeans, an iterative clustering algorithm that groups the data into C separate groups of \(\mathbf {C}=\{\mathbf {c}_1,\mathbf {c}_2,\ldots ,\mathbf {c}_k\}\) by minimizing the withincluster dispersion while maximizing the intercluster distances. The number of clusters to be formed from the data needs to be specified as an input parameter to the algorithm.
\(\textit{k}\)means works in three key steps. The first step is to choose the initial centroids, and the simplest method is to choose \(\textit{k}\) samples from the dataset \(\mathbf {X}=\{\mathbf {x}_1,\mathbf {x}_2, \ldots ,\mathbf {x}_n\}\). Then, each point in the dataset is allocated to its nearest centroid. The next step involves taking the mean value of all samples allocated to each centroid to update centroids. The algorithm calculates the difference between the old and new centroids, then repeats the last two steps until that value falls below a certain threshold. In other words, it keeps iterating until almost no change in the centroids is observed. The points in the data tends mostly to the centroid which leads to a high degree of cluster compactness or a minimum sum of squared error (SSE), as shown in Eq. (7); wherein n is the number of samples in the data, \(C_j\) is the jth cluster, \(\mu\) is the mean of the samples, and x is the corresponding sample. How to choose the best number of clusters is explained in the next subsection, Parameter Optimization.
Parameter optimization
When applying MLLE, a neighborhood graph, tNN, is created by connecting points that are close to each other. Different measures are used for this purpose, including the number of neighbors, distance from each point to its neighbors, and others. A common measure to determine the sparsity of the neighbor graph is the tolerance factor, which makes the graph sparser or denser. In this regard, we tested different tolerance values on each dataset and selected those values that yielded the best validity index scores, as explained in the Performance Evaluation subsection. With the aim of preserving locality, the number of nearest neighbors, t, is a crucial parameter to construct the neighborhood graph. Another critical parameter is the number of clusters, k, in the clustering algorithm. The number of nearest neighbors, t, are examined within the range [8,24], and the number of clusters k for each value of t is also assessed, where k ranges from 4 to 14, and the validity of indices are calculated for each cluster.
We followed “elbow” method to select the best combination of t and k with the highest score, considering all the three validity indices, as explained in the Performance Evaluation subsection. By plotting all scores with different number of clusters on the xaxis, an elbowshape point is observed on the plot at a certain value of k. A decreasing trend can be observed on the scores, while k increases. At this point, which is the best number of clusters, the plot will change rapidly and the trend will lean towards a line almost parallel to the xaxis. The corresponding plots are provided in the Supplementary Fig. S1.
Performance evaluation
Generally speaking, the best clustering is the one that maintains high intracluster distance and gives the most compact clusters. In this work, we used the Silhouette coefficient^{32}, an evaluation metric that measures either the mean distance between a sample point and all other points in the same cluster or all other points in the next nearest neighbor cluster. Consider a set of clusters \(\mathbf {C}\) = \(\left\{ \mathbf {C}_1, \mathbf {C}_2,\ldots ,\mathbf {C}_k \right\}\) as the output by a clustering algorithm, \(\textit{k}\)means in our case. The Silhouette coefficient, SH, for the \(i^{th}\) sample point in cluster \(\mathbf {C}_j\), where \(j = 1, \ldots , k\), can be defined as follows:
where \(w_c\) is the mean distance between point \(\mathbf {x}_i\) and all other points inside the cluster, withincluster distance, and \(b_c\) is the minimum mean value of the distance between a sample point \(\mathbf {x}_i\) and the nearest neighbor cluster, betweencluster distance, which are calculated as:
We also used Calinski–Harabasz (CH) and Davies–Bouldin (DB) validity of indices to assess the clustering performance. The CH score is used to evaluate the model, where a higher score tells betterdefined clusters^{33}. CH is the ratio of the sum of betweencluster and withincluster dispersion for all clusters, calculated as follows:
where n is size of input samples, \(tr(\mathbf {S}_B)\) is the trace of the betweengroup dispersion matrix and \(tr(\mathbf {S}_W)\) is the withincluster dispersion.
The Davies–Bouldin (DB) index^{34} is another validity measure that is defined as the average of the similarity measure of each cluster. DB is computed as follows:
where \(s_{ij}\) is the ratio between withincluster distances and between cluster distances, and is calculated as \(s_{ij}=\frac{w_i+w_j}{d_{ij}}\). The smaller the DB value is, the better the clustering, and as such, we aim to minimize Eq. (11). Here, \(d_{ij}\) is the Euclidean distance between cluster centroids \(\mu _i\) and \(\mu _j\); and \(w_i\) is the withincluster distance of cluster \(\mathbf {C}_k\).
Overall, we used the Silhouette score to evaluate the clustering performance, whereas CH and DB indices were used to verify and find the optimal parameters, namely the best number of clusters for our experiments.
Cluster annotation
To validate the obtained clusters, we first identified the top 20 differentially expressed genes in each cluster based on the Wilcoxon test and considered them as marker genes that drive high separation among clusters. Marker genes are up or downregulated in different individual cells, pathways, or GO terms. We used Gene Set Enrichment Analysis, GSEA^{35,36}, to annotate the clusters with the corresponding cell types of each group of marker genes. GSEA^{37} is a computational tool that determines whether a predefined set of genes shows a statistically significant level of expression in a specific cell type, biological process, cellular component, molecular function, or biological pathway. The GSEA uses MSigDB, the Molecular Signature Database, to provide gene sets for the gene set enrichment analysis. Also, we employed ToppCluster^{38}for Gene Ontology (GO) analysis. ToppCluster is a multigene list functional enrichment analysis online tool to identify the GO terms and pathways associated with the top gene lists extracted from each cluster. Pathways were extracted from the MSigDB C2 BIOCARTA (V7.3) database^{39}. The corresponding networks are visualized using Cytoscape^{40}. We decreased the minimum number of genes present in the corresponding annotations to achieve a better visualization.
Results and discussion
We first applied the Scanpy pipeline, including its clustering method (Leiden clustering), on the PBMC dataset. The corresponding results are presented in the Supplementary Fig. S3. Then, in order to obtain the most suitable clustering and dimensionality reduction method with higher performance, we employ Scanpy only for the preprocessing step. Since different tools use the same preprocessing approach, it facilitates new innovations emerging in scRNAseq analysis.
We developed a wellconstructed pipeline that can be applied to scRNAseq data to discover individual cell types. Considering dimensionality reduction and clustering as two significant steps in the pipeline, we explored many ways of untangling the data in two and three dimensions. We found optimal parameters for both dimensionality reduction and clustering that achieve the meaningful separation of cell types and compact clusters. To demonstrate the applicability of our pipeline, we tested it on thirteen datasets of different sizes. Finally, we evaluated our method in terms of both computational and biological perspectives. As kmeans and, generally, all distancebased methods are known to not work well with nonlinear methods such as tSNE, we employed DBSCAN clustering to investigate the behavior in the datasets used in this work. The resulting Silhouette score for the Calu3 dataset is 0.871, which is relatively lower than those of kmeans whose score is 0.924 for the same datasets. These results along with further discussions are provided in the Supplementary Fig. S2.
Clustering and cell type discovery
To achieve optimized results, we experimented with all possible combinations of parameters as discussed in the Material and Methods section. As a result, the best parameters that we could obtain for each dataset are depicted in Table 2. In a few datasets, to achieve the best clustering score in the proposed approach, the data is reduced to lower dimensions, such as 5, 6, and 7. Afterward, the data is reduced to three dimensions to visualize and obtain better results. The results of kmeans clustering combined with each dimensionality reduction method using the best parameters are listed in Table 3. The last column shows the result after applying ICA on the result of clustering combined with MLLE. The clustering scores range from 0 to 1. A score close to 1 represents good quality clustering, with 1 being the best, while a score near zero indicates that the clusters are not well defined.
When testing other widelyused techniques such as tSNE and PCA, we noticed that both methods were not efficient in separating the data into welldefined clusters. On the other hand, the results of Isomap and Laplacian Eigenmaps show slightly better performance comparatively. To demonstrate this statement graphically, we visualize the twodimensional projection of cells resulting from different dimensionality reduction methods and colored by kmeans clustering on the H1299 scRNAseq dataset, in Figs. 5, 6, 7, 8, 9 and 10. Moreover, threedimensional results on the same dataset are presented in Supplementary Fig. S5, Supplementary Material.
Finally, we investigated MLLE and found the most insightful cluster separation in most of the datasets. This outcome demonstrates the power of MLLE in exploring the data’s dense and complex relations, creating better embeddings in lowerdimensional spaces. We performed an additional dimensionality reduction step that uses ICA to enhance the visualization of the clusters. The last column of Table 3 shows that MLLE combined with ICA improves the overall results except for some datasets in which we do not notice much difference; very negligible difference of 0.004 (Baron_human1), 0.001 (Baron_human2), 0.014 (Baron_human3), 0.004 (Segerstolpe), and 0.011 (Xin) can ignore them. To achieve a better view of the impact of ICA on the MLLE transformation, we show a visual comparison of the clusters in Figs. 11 and 12. Twodimensional ICA projection of the cells applied to the threedimensional MLLE data shows the best visualization and clustering scores (Fig. 12). When applied alone, ICA performs very poorly with significantly inseparable clusters (Fig. 10). This result is because ICA is limited to linear transformations. On the other hand, manifold learning techniques consider data locally. As such, it can reveal complex relationships among the data points in higherdimensional spaces. We instead applied ICA on the lowerdimensional data because we observed wellmarked “lines” or “axes” in the threedimensional data, which led us to conclude that we could apply ICA to learn the linearly independent components, not necessarily orthogonal. Applying ICA reveals some hidden, complex relationships among the cells in the clusters, which are not noticeable in three dimensions.
Biological assessment
As shown in Table 4, some of the pancreatic cell types are found for pancreas datasets, such as the Baron human dataset within welldefined gene sets in the C8 collection of MSigDB, which includes cell type signature’s gene sets. They include ’MURARO PANCREAS ALPHA CELL’, ’MURARO PANCREAS ENDOTHELIAL CELL’, ’MURARO PANCREAS MESENCHYMAL STROMAL CELL’, ’MURARO PANCREAS DUCTAL CELL’, and ’MURARO PANCREAS ACINAR CELL’.Other cell types such as HB2 is a cell line originated by epithelial cells.
Regarding the PBMC dataset, we identified gene sets from MSigDB based on the ranked gene sets, including TRAVAGLINI LUNG CD8 NAIVE T CELL, TRAVAGLINI LUNG PLATELET MEGAKARYOCYTE CELL, AIZARANI LIVER C18 NK NKT CELLS 5, DURANTE ADULT OLFACTORY NEUROEPITHELIUM DENDRITIC CELLS, TRAVAGLINI LUNG OLR1 CLASSICAL MONOCYTE CELL, FAN OVARY CL12 T LYMPHOCYTE NK CELL 2, and AIZARANI LIVER C34 MHC II POS B CELLS. The corresponding cell types are presented in the Table 5. Moreover, we observed some reported marker genes of the PBMC dataset in some clusters, which are shown in the same table as well.
Additionally, the visualization of the networks of GO terms and pathways associated with the corresponding marker genes of the H1299 scRNAseq dataset are depicted in Figs. 13 and 14, respectively. For each cluster, we identified a set of biological process or pathway terms that connect with a term that is significantly associated with the top 20 gene list in that cluster. By observing Fig. 14, some significant pathways are found to be enriched in immunity functions and signaling, including SARSCoV2 innate Immunity Evasion, Host–pathogen interaction of human coronaviruses, SARS coronavirus and innate immunity, Type II interferon signaling (IFNG), and the human immune response to tuberculosis. Also, Fig. 13 shows that most biological processes associated with immunity functions, including response to interferonalpha, protection from a natural killer cell, type III interferon production, regulation by virus of viral protein levels in a host cell, and detection of virus, among others. In addition, we obtained a list of overlapping marker genes involved in Herpes simplex virus 1 (HSV1) infection and the Influenza A pathway. These findings suggest potential markers for subsequent medical treatment or drug discovery by comparing similar diseases in terms of functionality. Moreover, although numerous findings suggest potential links between HSV1 and Alzheimer’s disease, a causal relationship has not been demonstrated yet^{41}.
Conclusion and future work
This work focuses on the identification of different cell types using manifold learning combined with clustering techniques on scRNAseq data. Identifying similarities that result from structural, functional, or evolutionary relationships among the genes is the primary goal of clustering the cells. Our proposed twostep representation learning approach demonstrated that kmeans clustering technique combined with Modified LLE leads to improved clustering output and meaningful organization of cell clusters by “untangling” the complex, hidden relationship in a higherdimensional space.
Nonlinear dimensionality reduction methods have been shown to be very powerful as they preserve the locality of the data from higher to lower dimensions. UMAP is one of the most commonlyused nonlinear dimensionality reduction technique, and has been shown to perform well on largescale scRNAseq data. However, for dimensionality reduction, UMAP is not as efficient as MLLE on highdimensional cytometry, especially when combined with clustering to enhancing the visualization of the clustering results. This behavior of MLLE has been observed in our experiments. A comparative analysis with UMAP in the Supplementary Material, Supplementary Fig. S4, confirms this observation.
Moreover, performing ICA on transformed data after applying manifold learning techniques provides enhanced view of the data in a reduced space. Evaluating the incidence of ICA as a visualization scheme and further reduction step, after applying MLLE, shows better clustering and enhanced visualization simultaneously. This trend leads to a research avenue that involves a combination of nonlinear manifold learning techniques followed by linear methods, which has shown to be more powerful than conventional methods such as PCA or ICA applied alone.
Using multiple benchmark datasets shows the effectiveness of our proposed method. Performing gene set enrichment analysis to annotate a set of HVGs obtained from each cluster reveals biomarker genes involved in different gene ontology terms.
There are some other potential applications for investigating scRNAseq data, even beyond cell type identification. Using an extension of the proposed method by employing other manifold or deep learning techniques on the other epigenetic challenges in scRNAseq data analysis, such as trajectory analysis, is our next step.
Code availability
The source code is available in the GitHub repository, at https://github.com/saitejadanda/Nonlinearandlineartechniquesfordimensionalityreductionvisualizationonsinglecelldata.git and is released under MIT license.
References
Grun, D. et al. Singlecell messenger RNA sequencing reveals rare intestinal cell types. Nature 525(7568), 251–255 (2015).
Hwang, B., Lee, J. H. & Bang, D. Singlecell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50(8), 1–14 (2018).
Sandberg, R. Entering the era of singlecell transcriptomics in biology and medicine. Nat. Methods 11(1), 22–24 (2014).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of singlecell RNAseq data. Nat. Rev. Genet. 20(5), 273–282 (2019).
Dong, C. et al. Comprehensive review of the identification of essential genes using computational methods: Focusing on feature implementation and assessment. Brief. Bioinform. 21(1), 171–181 (2020).
Van der Maaten, L. & Hinton, G. Visualizing data using tSNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008).
Becht, E., McInnes, L., Healy, J. et al. Dimensionality reduction for visualizing singlecell data using UMAP. Nat Biotechnol 37, 38–44 https://doi.org/10.1038/nbt.4314 (2019).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science 347(6226), 1138–1142 (2015).
Yau, C. et al. pcaReduce: Hierarchical clustering of single cell transcriptional profiles. BMC 20 Bioinform. 17(1), 1–11 (2016).
Qiu, X. et al. Singlecell mRNA quantification and differential analysis with Census. Nat. Methods 14(3), 309–315 (2017).
Alexander Wolf, F., Angerer, P. & Theis, F. J. SCANPY: Largescale singlecell gene expression data analysis. Genome Biol. 19(1), 1–5 (2018).
Guerrero, Manuel et al. Adaptive community detection in complex networks using genetic algorithms. Neurocomputing 266, 101–113 (2017).
Feng, C. et al. Dimension reduction and clustering models for singlecell RNA sequencing data: A comparative study. Int. J. Mol. Sci. 21(6), 2181 (2020).
Luecken, M. D. & Theis, F. J. Current best practices in singlecell RNAseq analysis: A tutorial. Mol. Syst. Biol. 15(6), e8746 (2019).
10X Genomics. Single Cell Gene Expression Dataset by Cell Ranger 1.1.0. (2016).
Baron, M. et al. A singlecell transcriptomic map of the human and mouse pancreas reveals interand intracell population structure. Cell Syst. 3(4), 346–360 (2016).
Muraro, M. J. et al. A singlecell transcriptome atlas of the human pancreas. Cell Syst. 3(4), 385–394 (2016).
Segerstolpe, A. et al. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24(4), 593–607 (2016).
Xin, Y. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab. 24(4), 608–615 (2016).
Wang, Y. J. et al. Singlecell transcriptomics of the human endocrine pancreas. Diabetes 65(10), 3028–3038 (2016).
Wyler, E. et al. Transcriptomic profiling of SARSCoV2 infected human cell lines identifies HSP90 as target for COVID19 therapy. iScience 24, 102151 (2021).
Ilicic, T. et al. Classification of low quality cells from singlecell RNAseq data. Genome Biol. 17(1), 1–15 (2016).
Islam, S. et al. Quantitative singlecell RNAseq with unique molecular identifiers. Nat. Methods 11(2), 163 (2014).
Roweis, Sam T., & Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000).
Zhang, Z. & Wang, J. MLLE: Modified locally linear embedding using multiple weights. Adv. Neural Inf. Process. Syst. 2007, 1593–1600 (2007).
Wang, J. Laplacian eigenmaps. In Geometric Structure of HighDimensional Data and Dimensionality Reduction 235–247 (Springer, 2012).
Hyvarinen, A. Independent component analysis: Recent advances. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 371(1984), 20110534 (2013).
Hyvärinen, A. Survey on independent component analysis. Neural Computing Surveys, 2, 94–128 (1999).
Hyvarinen, A. & Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 13(4–5), 411–430 (2000).
Ghodsi, A. Dimensionality reduction a short tutorial. In Department of Statistics and Actuarial Science, vol. 37.38 2006 (Univ. of Waterloo, 2006).
Belkin, M. & Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Calinski, T. & Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974).
Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979).
Mootha, V. K. et al. PGC1aresponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34(3), 267–273 (2003).
Subramanian, A. et al. Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. Proc. Natl. Acad. Sci. 102(43), 15545–15550 (2005).
Subramanian, A. et al. GSEAP: A desktop application for Gene Set Enrichment Analysis. Bioinformaticshttps://doi.org/10.1093/bioinformatics/btm369 (2007).
Chen, J. et al. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37(suppl 2), W305–W311 (2009).
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27(12), 1739–1740 (2011).
Shannon, P. et al. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13(11), 2498–2504 (2003).
De Chiara, G. et al. Recurrent herpes simplex virus1 infection induces hallmarks of neurodegeneration and cognitive deficits in mice. PLoS Pathog. 15(3), e1007617 (2019).
Acknowledgements
This research work has been partially supported by the Natural Sciences and Engineering Research Council of Canada, NSERC. The work has also been partially supported by the Mitacs Internship Program (www.mitacs.ca). The authors would like to thank the University of Windsor, Office of Research Services and Innovation.
Author information
Authors and Affiliations
Contributions
A.V. contributed with literature review, data collection, conducted the biological analysis of the results. S.D. implemented the method, data preprocessing, and experimental design. L.R. supervised the project, contributed with initial thoughts and the main ideas, and assisted in elaborating on the new novel approach implemented in this work. All the authors contributed in writing, finalizing the idea and followup discussions.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Vasighizaker, A., Danda, S. & Rueda, L. Discovering cell types using manifold learning and enhanced visualization of singlecell RNASeq data. Sci Rep 12, 120 (2022). https://doi.org/10.1038/s41598021036130
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021036130
This article is cited by

A new method for identifying industrial clustering using the standard deviational ellipse
Scientific Reports (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.