Ensemble dimensionality reduction and feature gene extraction for single-cell RNA-seq data

Single-cell RNA sequencing (scRNA-seq) technologies allow researchers to uncover the biological states of a single cell at high resolution. For computational efficiency and easy visualization, dimensionality reduction is necessary to capture gene expression patterns in low-dimensional space. Here we propose an ensemble method for simultaneous dimensionality reduction and feature gene extraction (EDGE) of scRNA-seq data. Different from existing dimensionality reduction techniques, the proposed method implements an ensemble learning scheme that utilizes massive weak learners for an accurate similarity search. Based on the similarity matrix constructed by those weak learners, the low-dimensional embedding of the data is estimated and optimized through spectral embedding and stochastic gradient descent. Comprehensive simulation and empirical studies show that EDGE is well suited for searching for meaningful organization of cells, detecting rare cell types, and identifying essential feature genes associated with certain cell types.


Supplementary notes
Additional simulation studies. In the following two scenarios, we let the numbers of cells and genes be 10, 000 and 500, and the proportion of differentially expressed (DE) genes be 35%. We generated 100 simulated datasets using Splatter. The performance of EDGE, together with t-SNE and UMAP, was measured by the prediction accuracy of the rare population through out-of-bag (OOB) prediction errors in random forests over 100 simulation replicates. Supplementary Scenario 1. We designed this scenario to study the minimal population percentage that can be detected by EDGE. The total number of cell types was 5, among which one was a rare cell type with the percentage ranging from 0.1% to 5%. The other four major cell types were set to have equal proportions. For settings with different rare cell percentages, EDGE achieved the highest prediction accuracy on average (Supplementary Figure 2). When the rare population percentage was greater than 0.5%, EDGE obtained a prediction accuracy above 80% on average. To explore the performance of EDGE in detecting multiple rare populations, we further let the number of rare populations be three, and all other settings remained the same. EDGE ranked first in terms of prediction accuracy in all three rare populations (Supplementary Figure 3). Supplementary Scenario 2. In this scenario, we investigated the prediction accuracy of the rare cell type when the total number of cell types changed from 2 to 10. Among all the cell types, one was the rare population with a percentage of 1%. The major cell types had equal proportions. Across all settings, EDGE maintained the highest prediction accuracy on average (Supplementary Figure 4). We further increased the number of rare populations to three, and the total number of cell types changed from 4 to 10. All other settings remained the same. EDGE outperformed other methods in all three rare populations (Supplementary Figure 5).
Supplementary Figure 2 The accuracy of random forests classifiers in predicting the labels of a rare cell type using the learnt embeddings as input. The percentage of rare population varies from 0.1% to 5%, and the results are summarized from 100 simulation runs. In the boxplots, we show the median (central lines), first and third quartile (box limits), and the whiskers extended to the lowest and highest points within 1.5 interquartile range of the first and third quartiles, respectively.
Supplementary Figure 3 The accuracy of random forests classifiers in predicting the labels of three rare cell groups using the learnt embeddings as input. The percentage of each rare cell group varies from 0.1% to 5%, and the results are summarized from 100 simulation runs. In the boxplots, we show the median (central lines), first and third quartile (box limits), and the whiskers extended to the lowest and highest points within 1.5 interquartile range of the first and third quartiles, respectively.
Supplementary Figure 4 The accuracy of random forests classifiers in predicting the labels of a rare cell type using the learnt embeddings as input. The total number of cell types varies from 2 to 10, and the results are summarized from 100 simulation runs. In the boxplots, we show the median (central lines), first and third quartile (box limits), and the whiskers extended to the lowest and highest points within 1.5 interquartile range of the first and third quartiles, respectively.
Supplementary Figure 5 The accuracy of random forests classifiers in predicting the labels of three rare cell groups using the learnt embeddings as input. The total number of cell types varies from 4 to 10, and the results are summarized from 100 simulation runs. In the boxplots, we show the median (central lines), first and third quartile (box limits), and the whiskers extended to the lowest and highest points within 1.5 interquartile range of the first and third quartiles, respectively.
Feature genes identification in real datasets. The violin plots of feature genes identified from the Jurkat and mouse brain scRNA-seq datasets are displayed here. In the Jurkat dataset, EDGE identified 17 feature genes. Each cell type was found to have a group of genes that displayed up-regulated expression patterns (Supplementary Figure 6). For example, CD1E and CD3D were highly expressed in Jurkat cells, while HAND1, GADD45B, and BAMBI were up-regulated in 293T cells. Furthermore, some top-ranked genes were marker genes for T cells. For instance, the CD3D gene engaged in T-cell development and signal transduction [1]. In the mouse brain dataset, EDGE detected 43 feature genes (Supplementary Figure 7). The genes, CCL2 and CCL7, known as the two most common activators of microglia during the process of developing neuropathic pain [2], were ranked as top feature genes and only up-regulated in microglia cells. Another three genes, FCGR1, FCRLS, C1QC, were top enriched genes in microglia cells [3]. Among the feature genes for interneurons, GAD1 and HTR3A were known marker genes for interneuron and its subclasses [4]. Astrocytes-Ependymal cells were featured by PRDX6, CNP, and ACSBG1. The gene PRDX6 was found primarily in astrocytes cells [5,6], and the gene ACSBG1 was a known marker for astrocytes cells [7].
Supplementary Figure 6 Normalized expression levels of 17 top-ranked feature genes detected by EDGE for 293T and Jurkat cells. Genes are ordered by their importance scores with HAND1 having the highest importance score.
Supplementary Figure 7 Normalized expression levels of 43 top-ranked feature genes detected by EDGE for mouse brain dataset. Genes are ordered by their importance scores with CCL2 having the highest importance score.
Feature genes identification in simulated datasets. There were two scenarios for the simulated datasets containing three cell types, with the ratio of 30:30:40 in group size. Since each cell type was associated with 15 genes, the number of true feature genes was 45 in the three-group scenarios. The results are reported in Supplementary Table 1 The standard deviation of the number of identified true feature genes is shown in the parentheses.

Supplementary discussion
From t-SNE and UMAP to EDGE. The dimensionality reduction methods convert the high-dimensional dataset X = {x 1 , · · · , x C } into the low-dimensional dataset Y = {y 1 , · · · , y C } [8]. In short, EDGE, t-SNE, and UMAP share the following procedures.
Calculate the similarity probability p ij for ith and jth cell, i = 1, · · · , C, j = 1, · · · , C, in the high-dimensional space based on the gene expression matrix X.
Estimate the similarity probability q ij , i = 1, · · · , C, j = 1, · · · , C in the low-dimensional space. The probability q ij constructed by the embedding matrix Y matches p ij as close as possible by minimizing a loss function.
The first step is critical in dimensionality reduction. If the similarity probabilities p ij cannot preserve the similarity structures between cells faithfully, it is impossible to obtain an accurate similarity probability q ij . One of our major contributions is proposing a novel method that is accurate in preserving the similarity structures in the high-dimensional space.
EDGE, t-SNE, and UMAP utilize three different ways to calculate p ij 's in the first step. To compare these methods, we now present details of the calculation of similarity probabilities. In t-SNE, the conditional probability p j|i is calculated to represent the similarity of cell j to cell i. Mathematically, the conditional probability is defined by where σ i is determined by the perplexity parameter [9]. The conditional probabilities in (1) can be symmetrized by defining the pairwise similarity p ij = p j|i +p i|j 2C . In the UMAP algorithm, the following conditional probability is defined where d(x i , x j ) is a distance measure, e.g., Euclidean distance, ρ i is the distance from the ith cell to its first nearest neighbor, and σ i is determined by the number of nearest neighbors k. In UMAP, a different symmetrization method for the conditional probabilities in (2) is considered, that is, p ij = p j|i + p i|j − p j|i p i|j [10]. Our similarity learning strategy is entirely different from the ones implemented in t-SNE and UMAP. We combine ensemble learning with the sketching technique to calculate similarity probabilities in the high-dimensional space. We designed a simulation study using Splatter to demonstrate the contributions of the first step in EDGE. To better present results graphically, we generated two cell types with a ratio of 98:2 in the simulated data that contained 500 cells and 200 genes. The default parameters in Splatter were used. The similarity probabilities for EDGE, UMAP, and t-SNE were obtained in the high-dimensional space. We then applied the same embedding method Supplementary Figure 8 Comparison of similarity learning strategies for EDGE, t-SNE and UMAP. Two types of cells are represented by red triangles and blue dots.
(spectral embedding) to map the data into a two-dimensional space based on three similarity matrices. For EDGE, rare and major cell types were well-separated; that was, the computed similarity matrix could preserve within-and between-cluster distances ( Supplementary Figure 8). However, for embeddings constructed by UMAP and t-SNE, some rare and major cells were mixed (Supplementary Figure 8). To further verify this, we also replaced the similarity probabilities from UMAP and t-SNE with EDGE's similarity probabilities. If the EDGE's similarity probabilities were used in UMAP and t-SNE, rare and major cells were mapped to two different compact locations (Supplementary Figure 9). Nevertheless, based on the original similarity probabilities from UMAP and t-SNE, three to five rare and major cells were misclassified in the low-dimensional spaces. Therefore, our similarity learning algorithm has better performance in preserving within-and between-cluster similarities.
In the second step, t-SNE declares the Student's t-distribution for the distances (similarity probabilities) between the pairs of points y i and y j , i, j = 1, · · · , C, in the low-dimensional space. UMAP employs a similar distance distribution in the low-dimensional embedding, see details in Methods. We choose the same distance distribution used in UMAP for the low-dimensional embedding and the symmetrization method implemented in t-SNE as they are computationally efficient. The loss function is critical in preserving the global structure.

Original Similarity EDGE's Similarity
Supplementary Figure 9 Embedding results of UMAP and t-SNE by using the original (left panel) and EDGE's (right panel) similarity measures. Rare and major cells are represented by red triangles and blue dots.
Our goal is to project the high-dimensional probabilities onto the low-dimensional ones faithfully by minimizing the loss function, which measures the distance between the similarity distributions for low-and high-dimensional spaces. The t-SNE method implements the Kullback-Leibler (KL) divergence loss function, which is defined as The UMAP method uses binary cross-entropy (CE) as a loss function. Compared to (3), an additional term, (1 − p ij ) log((1 − p ij )/(1 − q ij )), is present in the CE function. For an ideal case, we could have p ij = q ij for all i and j. Under this situation, the values of KL and CE functions are zeros. In reality, it is unlikely to achieve such a perfect match. However, the probability p ij in the high-dimensional space should be positively correlated with the probability q ij in the low-dimensional space for all i and j. Thus, the value of loss function should be large if p ij is large, while q ij is low or vice versa. The additional term in the CE function is critical in preserving global distances. For instance, let p ij = 0.01 and q ij = 0.99, the value of the KL function is about zero, whereas, the value of the CE function is about 4.5. The CE function imposes a penalty for the case of low p ij at large q ij . We implement the CE function as a loss function in EDGE to preserve global distances. In summary, in the first step, our novel similarity learning strategy can accurately preserve within-and between-cluster distances in the high-dimensional space. In the second step, computationally efficient techniques, i.e., the symmetrization method from t-SNE and the stochastic gradient descent algorithm, are implemented. To preserve global structures in the low-dimensional space, we employ the CE loss function in the second step.
varied from 10 to 50 (Supplementary Figure 11). In simulated datasets (Scenario 4 with DE gene proportion being 25%), EDGE maintained a high prediction accuracy with varying L, k, and B (Supplementary Figure 12). The computational time was also reported in the rightmost panel of Supplementary Figure 12. As L grew, the runtime increased but was no more than 15 seconds.