GapClust is a light-weight approach distinguishing rare cells from voluminous single cell expression profiles

Single cell RNA sequencing (scRNA-seq) is a powerful tool in detailing the cellular landscape within complex tissues. Large-scale single cell transcriptomics provide both opportunities and challenges for identifying rare cells playing crucial roles in development and disease. Here, we develop GapClust, a light-weight algorithm to detect rare cell types from ultra-large scRNA-seq datasets with state-of-the-art speed and memory efficiency. Benchmarking on diverse experimental datasets demonstrates the superior performance of GapClust compared to other recently proposed methods. When applying our algorithm to an intestine and 68 k PBMC datasets, GapClust identifies the tuft cells and a previously unrecognised subtype of monocyte, respectively.


GapClust is a light-weight approach distinguishing rare cells from voluminous single cell expression profiles Supplementary Material
Fa et al.
Supplementary Figure 1 Three scenarios of the gap between two clusters. a There is only one cell in Cn. b There are a few cells in rare cell type Cn. c There are many cells in the abundant cell type Cn. D<m, n> denotes the distance between each cell m in Cn and its nth neighbour, which is the nearest neighbour of cell m in Cneighbour. The clusters in different colours denote different cell types. In scenario a where there is only 1 cell (singleton) in Cn, the gap can be directly obtained by calculating the distance D<m, n> of the single cell m in Cn to its nearest neighbour cell in Cneighbour; In scenario b, the rare cell type Cn comprises a few cells, although the gap cannot be directly obtained, we can still approximate the gap by calculating the distance D<m, n> of any given cell m in Cn to its nearest neighbour cell in Cneighbour; In scenario c, there are many cells in the abundant cell type Cn, we cannot approximate the gap like the above two scenarios again, as D<m, n> is far larger than the gap for most of cells in Cn. Therefore, in the scenario b of rare cell type, the gap can be approximated by calculating the distance D<m, n> of any given cell in Cn to its nearest neighbour cell in Cneighbour. Figure 3 Sensitivity comparison of GapClust to RaceID, GiniClust, CellSIUS and FiRE for varying incidence of the rare cell type in the simulated datasets by Splatter. Here we generate datasets comprising two abundant cell types of 500 cells and from 2 up to 100 rare cells. The five subplots (a-e) represent the results from the datasets with different levels of differential expression between the rare cell type and the major cell type by varying the de.prob parameter of Splatter: a de.prob=0.1, b de.prob=0.2, c de.prob=0.4, d de.prob=0.6, e de.prob=0.8. Figure 4 Specificity comparison of GapClust to RaceID, GiniClust, CellSIUS and FiRE for varying incidence of the rare cell type in the simulated datasets by Splatter. Here we generate datasets comprising two abundant cell types of 500 cells and from 2 up to 100 rare cells. The five subplots (a-e) represent the results from the datasets with different levels of differential expression between the rare cell type and the major cell type by varying the de.prob parameter of Splatter: a de.prob=0.1, b de.prob=0.2, c de.prob=0.4, d de.prob=0.6, e de.prob=0.8. Figure 5 Performance comparison of GapClust to FiRE for rare cell detection in simulated datasets subsampled from the 68 k PBMC dataset, quantified by F1 score. FiRE method is applied with three different IQR-thresholding-criteria including q3 + 1.5 × IQR, q3 + 1.0 × IQR and q3 + 0.5 × IQR, where q3 and IQR denote the third quartile and the interquartile range (75th percentile -25th percentile), respectively, of the number of FiRE scores across all cells. Figure 6 Comparison of GapClust to RaceID, GiniClust, CellSIUS and FiRE for varying incidence of the rare cell type in the 99 subsampled datasets from the 68 k PBMC dataset. Here we use 500 CD19+ B cells as type 1, 500 CD56+ NK cells as type 2, and between 2 and 100 CD14+ monocytes as the rare cell type. a Sensitivity comparison among the five approaches, b Specificity comparison among the five approaches.

Supplementary Figure 7
Performance comparison of GapClust to RaceID, GiniClust and FiRE in detection of rare Jurkat cells in the ten simulated datasets subsampled from the 293T-Jurkat dataset with the rare cell type proportions ranging from 0.5% to 5.0%, quantified by F1 score. Figure 8 Performance comparison of GapClust to RaceID, GiniClust, CellSIUS and FiRE in detection of monocyte doublets in 140 simulated datasets subsampled from the 68 k PBMC dataset with the rare cell type proportions ranging from 0.08% to 5.0%, quantified by F1 score.

Supplementary Note 1: Comparison with EDGE in rare cell type identification
Firstly, we compare GapClust with EDGE 6 using simulated datasets with balanced and unbalanced major cell types. We generated the simulated data using Splatter 5 under four scenarios (Supplementary Table 1). Scenario 1: One thousand cells in three groups, with the ratio of 49.5:49.5:1 in group size (number of cells) and a low dropout rate in the expression of 5,000 genes; Scenario 2: Five thousand cells in three groups, with the ratio of 49.9:49.9:0.2 in group size (number of cells) and a low dropout rate in the expression of 5,000 genes; Scenario 3: One thousand cells in three groups, with the ratio of 66:33:1 in group size (number of cells) and a low dropout rate in the expression of 5,000 genes; Scenario 4: Five thousand cells in three groups, with the ratio of 66.4:33.4:0.2 in group size (number of cells) and a low dropout rate in the expression of 5,000 genes. The proportions of differentially expressed (DE) genes in Scenario 1 to 4 were fixed at 40%. Of note, in Scenario 1 and Scenario 2, the two dominant cell types are balanced in cell numbers, while in Scenario 3 and Scenario 4, the two dominant cell types are unbalanced in cell numbers (Supplementary Table 1). Of note, EDGE algorithm provides only the two-dimensional embeddings of the cells in the input single cell RNA sequencing (scRNA-seq) dataset, without reporting the group information of the rare cell types directly. User has to learn the rare cell types visually by the minor clusters in the two-dimensional embeddings, which can be biased. We use DBSCAN algorithm to identify the clusters in the two-dimensional embeddings learned by EDGE with tuned parameters (eps = 0.5, minPts = 5). Notably, DBSCAN can capture minor clusters through density-based spatial clustering with tuned parameters, while distance-based clustering algorithms such as k-means always neglect minor clusters.

Supplementary
The Supplementary Figure 9 displays the two-dimensional embeddings of simulated datasets learnt by EDGE in four scenarios, demonstrating that EDGE successfully separates all rare cells (the rare cell type, in red) from the abundant cells (the dominant type, in gray). We observe that rare cells identified by GapClust is perfectly consistent with the true rare cell type in all four scenarios (Supplementary Table 2). Meanwhile, EDGE can also identify the rare cell type perfectly in combination with DBSCAN method (Supplementary Table 2). The results show that both GapClust and EDGE are capable of accurate identification of rare cells in simulated datasets. Figure 9 The two-dimensional embeddings learnt by EDGE in Scenario 1, Scenario 2, Scenario 3, and Scenario 4. In each panel, the dots in red colour represent the rare cells, and the gray dots denote the abundant cell type.

Supplementary
Secondly, we compare GapClust with EDGE using simulated datasets with different differential expression levels between the rare cell type and the abundant cell type. Here, we generate eight simulation datasets consisting of 3 rare cells and 500 abundant cells by Splatter with varying de.prob parameters (the probability that a gene is differentially expressed in the rare cell group) including extremely low: 0.01, 0.02, 0.05, 0.1, moderate: 0.2, 0.4 and high: 0.6, 0.8 to mimic the different differential expression levels. Then EDGE and GapClust are both applied to the datasets. DBSCAN is used with the tuned parameters (eps = 0.5, minPts = 3). Here, the minPts is set as 3 rather than the default value 5, as the rare cell type contains only 3 cells. In the two-dimensional embeddings, we find that EDGE clearly differentiates the three rare cells (red) from the remaining cells (gray) in the datasets with high differential expression levels (de.prob = 0.4, 0.6 and 0.8) (Supplementary Figure 10). In the scenarios with lower differential expression levels, the two-dimensional embeddings learnt by EDGE cannot distinguish the three rare cells from the remaining cells (Supplementary Figure 10). GapClust cannot identify the rare cell type when de.prob parameter is 0.01 and 0.02, but obtain improved performance with higher differential expression level (de.prob = 0.05) (Supplementary Figure 11). When de.prob is higher than 0.05, GapClust can detect the rare cell type perfectly (Supplementary Figure 11). In contrast, EDGE can detect the rare cell type perfectly only when de.prob is higher than 0.1, but fails when de.prob is 0.05 and 0.1(Supplementary Figure 11). Taken together, GapClust outperforms EDGE in scenarios with low differential expression level (de.prob = 0.05, 0.1) (Supplementary Figure 11), implying higher sensitivity of GapClust to differential expressed genes between the rare cell type and abundant cell type.

Supplementary Figure 10
The two-dimensional embeddings learnt by EDGE in the eight simulation datasets. The subplots represent different differential expression levels between the rare cell type and the abundant cell type. In each panel, the dots in red colour represent the rare cells, and the gray dots denote the abundant cell type. Figure 11 The performance of GapClust and EDGE in the eight simulation datasets. The horizontal axis represents the differential expression level (de.prob parameter in Splatter) between the rare cell type and abundant cell type. The vertical axis represents the performance in the F1 score metrics.

Supplementary
Thirdly, we compare GapClust with EDGE using simulated datasets with different rates of dropout events. Here, we generate nine simulation datasets consisting of 3 rare cells and 500 abundant cells by Splatter with different rates of dropout events, including 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9. Then EDGE and GapClust are both applied to the datasets. Similarly, DBSCAN is used with the tuned parameters (eps = 0.5, minPts = 3). Here, the minPts is set as 3 rather than the default value 5, as the rare cell type contains only 3 cells.
In the two-dimensional embeddings, we find that EDGE clearly differentiates the three rare cells (red) from the abundant cell type (gray) in the datasets with low rates of dropout events (0.1, 0.2, 0.3) (Supplementary Figure 12). In the scenarios with higher rates of dropout events, the two-dimensional embeddings learnt by EDGE cannot distinguish the three rare cells from the remaining cells (Supplementary Figure  12). GapClust cannot identify the rare cell type only when the rate of dropout events is extremely high (0.9), but detect the rare cell type perfectly in other datasets (Supplementary Figure 13). In contrast, EDGE can only detect the rare cell type perfectly in datasets with low and moderate rates of dropout events (0.1, 0.2, 0.3, 0.5) (Supplementary Figure 13). Taken together, GapClust performs better than EDGE in scenarios with high rates of dropout events (0.6, 0.7, 0.8).

Supplementary Figure 12
The two-dimensional embeddings learnt by EDGE in the nine simulation datasets with different rates of dropout events by Splatter. The subplots represent different rates of dropout events ranging from 0.1 to 0.9. In each panel, the dots in red colour represent the rare cells, and the gray dots denote the abundant cell type. Figure 13 The performance of GapClust and EDGE in the nine simulation datasets. The horizontal axis represents the different rates of dropout events ranging from 0.1 to 0.9. The vertical axis represents the performance in the F1 score metrics. Table 3 GapClust is fast. Execution time (in minutes) recorded for the five methods while varying the number of cells from 1k to ~68k. RaceID is extremely slow for datasets with over 5 k cells, and thus applied to datasets with no larger than 35 k cells. GiniClust reported a runtime error, when the input expression profiles increased beyond ~45 k. CellSIUS failed when the input expression profiles increased beyond ~60 k. We apply GapClust to an existing scRNA-seq data containing ~30 k cells profiled from the human hippocampus (region of the brain that is associated primarily with memory) at gestational weeks 16-27 7 . Besides the doublets, GapClust identifies 390 rare cells yielding 3 cell subpopulations. We labeled these cell groups as R1-3 (Supplementary Figure 14a). Cell counts in these clusters are 94, 107 and 189, respectively.

Supplementary
The cell-type identity of the newly detected clusters was discovered by referring to the expression of known cell-type markers reported by Zhong et al 7 . (Supplementary Figure 14b). The markers ASCL1, NEUROD2, GAD1, OLIG2, MBP, AQP4, SPARC and PTPRC represent progenitor cells, excitatory neurons, inhibitory neurons, oligodendrocyte progenitor cells, oligodendrocytes, astrocytes, endothelial cells, and microglia, respectively 7 . Rare clusters R1 and R2 originate from the endothelial cells with SPARC gene highly expressed, and rare cluster R3 emerged from the cluster containing astrocytes with positive expression of the AQP4 gene.
We perform differential expression analysis to find cell-type specific genes for the rare cell populations. Up-regulated genes are found in each cluster, clearly distinguishing it from the remaining cell types (Supplementary Figure 14c). Cells from the cluster R1 show high expression of CLDN5 8 and MFSD2A 9 , which encode proteins involving the establishment of the blood-brain barrier, implying the crucial role of this cell population in the central nervous system. The rare cluster R2 emerges as a sub-type of endothelial cells by expressing growth factor encoding gene IGF2 and PTN, a secreted growth factor that induces neurite outgrowth 10 and is mitogenic for fibroblasts, epithelial, and endothelial cells 11 , demonstrating the growth-promoting capability of the cluster R2. Also, the high-level expressions of decorin encoding gene DCN and collagen type genes including COL1A2 and COL3A1 imply that these cells are closely associated with the setup of extracellular matrix (ECM) 12 . We find the rare cluster R3 with a wide spectrum of astrocyte markers including CKB 13 , PHDGH 14 and FOXJ1 15 . Figure 14 Identification of rare cell types from the hippocampus dataset. a The tSNE plot of the full hippocampus dataset with the three rare cell clusters labeled in different colours. b The tSNE plots of the full hippocampus dataset displaying the expression distribution of eight cell-type specific markers across the whole cell population. c Heatmap of top differentially expressed genes between the three rare cell types.

Supplementary Note 3: The Drosophila wing disc dataset (GSE155543)
To evaluate the influence of batch effect, we have applied GapClust to a single-cell transcriptomics of a temporal cell atlas of the Drosophila wing disc from two developmental time points (96h and 120h) 16 . Two biological replicates were obtained at each time point that, generated data from 9,181 and 9,255 cells in the 96h samples, 10,401 and 7,160 cells in the 120h samples. The authors recovered two major cell types including adult muscle precursors and wing disc epithelial cells, and small numbers of tracheal cells and hemocytes in the combined dataset. Meanwhile, the authors reported the marker genes of hemocytes including He and regucalcin, and the marker genes of tracheal cells including wat and tpr 16 . Taken together, these datasets with replicates are ideal to benchmark GapClust on batch effect.
Besides the doublets identified, GapClust identifies 1, 1, 3 and 1 rare cell type in the four datasets, respectively. For the third dataset (120h), the rare cell types R2 and R3 express none of the marker genes reported by the authors, we thus focus on the rare cell type R1 of the four datasets. First, we use the marker genes mentioned above to determine the cell type of the rare cells in R1 identified in each dataset. High expressions of marker genes He and regucalcin in the cluster R1 of all datasets confirm that the cells of the cluster R1 are hemocytes across the datasets (Supplementary Figure 15a-c). However, the rare cell type of tracheal cells reported by the authors is not detected by GapClust in all datasets. To demonstrate whether small numbers of tracheal cells exist in these datasets, we explore the expression distribution of the reported marker genes including wat and tpr, and observe that these two genes are highly expressed in different patterns (Supplementary Figure 15d, e): the wat gene is highly expressed in only a few cells (Supplementary Figure 15d), whereas the expression of tpr can be observed in most of the cells (Supplementary Figure 15e). Therefore, the wat gene is more reliable for the tracheal cell type identification.
Intuitively, we can observe the transcription of wat gene only in very few cells in the first (96h) and fourth (120h replicate) datasets (Supplementary Figure 15d). As none of the rare cell types identified express tracheal cell marker genes wat and tpr, we have thus examined the doublets in each dataset identified by GapClust and found that marker gene wat is highly expressed in the doublets (R2) of the fourth datasets (120h replicate) (Supplementary Figure 16a, b). In addition, we have explored the first dataset (96h), and confirmed that only one cell expressing high level of wat gene (normalized count > 10) (Supplementary Figure 16c), making a singleton rather than a cluster. Compared with the hemocyte marker gene He, the expression level of wat gene is considerably lower (Supplementary Figure 16c), making it difficult to identify the rare cell type of tracheal cells.
Except for the rare cell type R1, GapClust identifies other two rare cell types R2 and R3 expressing none of the marker genes above in the third dataset (120h) (Supplementary Figure 16d), which are not reported by the authors. We apply differential expression analysis between the three rare cell types (Supplementary Figure 16e): high expression of regucalcin gene is observed in the rare cell type R1 of hemocytes, which has been validated (Supplementary Figure 15c); the transcription of the enhancer of yellow genes Dmel/yellow-e3 and Dmel/yellow-e2 in the rare cell type R2 implies the effect of these cells on Drosophila pigmentation, which is crucial in male mating success 17 ; the cells in the rare cell type R3 display high expression of multiple functionally unknown genes including Dmel/CG13044 and Dmel/CG15353.
Taken together, we have successfully benchmarked GapClust on the Drosophila wing disc datasets with replicates, and identified the reported rare hemocytes in all four datasets from different batches, demonstrating the reliability of GapClust in regard to batch effect. Figure 15 Rare cell type identification with GapClust in Drosophila wing disc datasets. a The two-dimensional embeddings learnt by EDGE in the four Drosophila wing disc datasets. The dots in red colour represent the rare cell type (R1), while the dots in gray colour represent the abundant cell types. The two-dimensional embeddings learnt by EDGE of the four full datasets showing expression levels for the following genes: b hemocyte markers Hemese (He), c regucalcin, d tracheal cell markers waterproof (wat) and e tracheal-prostasin (tpr). Colour scales correspond to normalized (by total UMI) counts on a log2 scale.

Supplementary
Supplementary Figure 16 Rare cell type identification with GapClust in Drosophila wing disc datasets. a The two-dimensional embeddings learnt by EDGE in the fourth Drosophila wing disc dataset (120h replicate). The dots in red represent the rare cell type (R2). b The two-dimensional embeddings learnt by EDGE of the fourth dataset (120h replicate) showing expression levels for the tracheal cell marker waterproof (wat). c The expression levels of marker genes He and wat in all four datasets. d The two-dimensional embeddings by EDGE in the third dataset (120h). The dots in red, green and blue colours represent the three rare cell types R1, R2 and R3. e The heatmap of top differentially expressed genes between the three rare cell types in the third dataset (120h).

Supplementary Note 4: The mouse ESC (embryonic stem cells) dataset (GSE65525)
The mouse embryonic stem cells (ESCs) were profiled at three time points: Day 0, Day 2, and Day 4 after leukemia inhibitory factor (LIF) removal induced differentiation 18 . We focused on a subset of 2,509 cells obtained from the Day 0 stage, where the cells remained undifferentiated. GapClust identified one rare cluster (R1) comprising 3 cells with high expression of genes from the Zscan4 gene family (Supplementary Figure 17), which is consistent with the findings by the authors of GiniClust 2 and FiRE 4 . Expression of these genes has previously been observed in the two-cell embryo stage 19 , although their potential function in pluripotency remains unknown.
Supplementary Figure 17 Rare cell type identification in the mouse ESC (embryonic stem cells) dataset. a The two-dimensional embeddings learnt by EDGE in the mouse ESC dataset. The dots in red colour represent the rare cell type (R1), and the dots in gray represent the abundant cell type. b Violin plots of top differentially expressed genes between the rare cell type R1 and abundant cell type. In each subplot, the gray group on the left represents the abundant cell type, and the red group on the right side represents the rare cell type R1. The genes in Zscan4 gene family are labeled in blue colour. The same colour-coding scheme is used in all panels

Supplementary Note 5: The mouse somatosensory cortex dataset (GSE60361)
The 3,005 single cells were obtained from the mouse somatosensory cortex and hippocampus CA1 region 20 . GapClust identified six rare clusters (R1-6), comprising 3, 4, 6, 15, 26 and 38 cells, respectively (Supplementary Figure 18a). We have applied differential expression analysis between the rare cell types (Supplementary Figure  18b). Although the 3 (100%) cells in R1, 3 (50%) cells in R3 and 12 (80%) cells of R4 were annotated as astrocytes ependymal type by the authors (Supplementary Table 5), their expression profiles were distinct from each other. High expression of Ttr, Igf2, Igfbp2 and Enpp2 in R3 indicates the functional role in brain regions such as hippocampus [21][22][23] . We compare the 25 cells of R3 with the remaining oligodendrocytes and find that these cells display high expression of chemokine genes such as Ccl2, Ccl3 and Ccl7 (Supplementary Figure 18c), implying inflammatory roles in brain regions. Additionally, the 31 microglia cells of R6 are distinct from the remaining microglia cells in expression of genes such as Ccl7, Pf4 and Lyz2 (Supplementary Figure 18d). Figure 18 GapClust identifies six rare cell types in the mouse somatosensory cortex dataset. a The two-dimensional embeddings learnt by EDGE in the mouse somatosensory cortex dataset. b The heatmap of top differentially expressed genes between the six rare cell types. c The heatmap of top differentially expressed genes between the rare cell type R5 and remaining oligodendrocytes in the mouse somatosensory cortex dataset. d The heatmap of top differentially expressed genes between the rare cell type R6 and remaining microglia cells in the mouse somatosensory cortex dataset.

Supplementary
Supplementary Table 5 The cell types of the six rare cell clusters predefined by the  authors 20 .  R1  R2  R3  R4  R5 R6  astrocytes_ependymal  3  0  3

Supplementary Note 6: The Jurkat dataset
We evaluate GapClust using the dataset which comprises 293T and Jurkat cells at a rare cell (Jurkat cell) concentration of 2.5% 4 . GapClust identifies three rare clusters (R1, R2 and R3) in this dataset, which comprise 2, 13 and 45 cells, respectively (Supplementary Figure 19a-c). The expression profiles of the three rare types and the abundant cell type are distinct from each other (Supplementary Figure 19a, b). The R1 cluster includes only two cells, making itself a doublet. The rare cell type R2 comprising 13 cells was not identified by the authors of FiRE 4 . According to the rare cell concentration mentioned by the authors 4 , the cells of R3 cluster belong to Jurkat cell type. The contributor mentioned that this data was mixed with only two cell types 4 . However, the three rare cell types detected by GapClust together with the abundant cell type suggest undiscovered heterogeneity within the dataset. Figure 19 Detection of rare cell types in subsampled 293T-Jurkat dataset of 1,580 cells. a The heatmap of top variable genes for the subsampled 293T-Jurkat dataset using the variance of standardized values. b The heatmap of top variable genes for the rare cell types in the subsampled 293T-Jurkat dataset using the variance of standardized values. c The two-dimensional embeddings learnt by EDGE in the subsampled 293T-Jurkat dataset. In each panel, the dots in red, blue, and magenta colours represent the three rare cell types (R1, R2 and R3), respectively.

Supplementary Note 7: The Bacillus subtilis cells dataset (GSE151940)
We apply GapClust to scRNA-seq data from the Bacillus subtilis cells sampled at different growth stages, creating a detailed atlas of changes in metabolism and lifestyle 24 . A total of four minor clusters comprising 2, 3, 4 and 189 cells are identified, and the first two rare cell types are similar to each other. We have thus combined these two minor clusters of cells into one single cluster of 5 cells for convenience. Here we denote the three clusters as R1, R2 and R3 (Supplementary Figure 20a). The OD (optical density) values are used to represent the growth curve. The five cells of R1 were sampled at OD2.0 to 2.7, and the four cells of R2 were profiled at early stages (OD0.4 to 1.0). The majority of the 189 cells in R3 were sampled at OD1.7.
We apply differential expression analysis among the three rare cell types (Supplementary Figure 20b). Transcription of genes in the inositol metabolism (iolA-G, J) in cluster R1 indicates the heterogeneous fashion in a subpopulation of cells between logarithmic and early stationary phases 24 . Most of the genes specifically expressed in cluster R2 represent known PBSX prophage genes with functions in phage replication (xtmA and xtmB), as well as PBSX associated genes of unknown function. Thus, we identified a rare subpopulation of cells in the state of prophage induction, which is consistent with the findings in the original research 24 . High expression of genes in the oligo-β-glucoside metabolism (licA, licC and licH) distinguishes the cells in the cluster R3 from the remaining cells, implying the cellular heterogeneity in logarithmic phase.
Furthermore, we investigate the change in cellular composition after filtering cells expressing at least 100 genes. For the collected OD points 0.4, 1.0, 1.7, 2.0, 2.7 and 3.2, the total numbers of cells sampled at each OD point are 5,202, 5,180, 4,699, 4,241, 3,437 and 2,698, while the numbers of the cells passing the criteria are 2,275, 1,920, 199, 37, 19 and 8, respectively, implying that the cells grown at the late stage of growth curve tend to express less genes.

Supplementary Figure 20
GapClust identifies three rare cell types in the Bacillus subtilis cells dataset. a The two-dimensional embeddings learnt by EDGE in the Bacillus subtilis cells dataset. b The heatmap of top differentially expressed genes between the three rare cell types.

Supplementary Note 8: The mouse tracheal epithelial cells dataset (GSE103354)
We explore the heterogeneity of mouse tracheal epithelial cells by analyzing the scRNA-seq dataset (GSE103354) 25 comprising 7,194 cells using GapClust. In total, we have identified six rare cell clusters (R1-6) comprising 3, 7, 13, 22, 54 and 113 cells, respectively (Supplementary Figure 21a). The cluster R1 and R6 were annotated as tuft cells by the authors, and consolidated by the transcription of marker gene Trpm5 26 (Supplementary Figure 21b). However, differential expression of another tuft cell marker gene Dclk1 as well as other genes including Atp1a2, Vav1 and Nkd1 demonstrates that these two rare cell types represent distinct tuft cell populations. The cells in cluster R2 were labeled as ciliated cells, which distinguished them from the remaining ciliated cells by high expression of cell cycle related genes such as Cdk1 27 , Smc4 28 , Cdc20 27 , Cks1b 29 and H2afx 30 (Supplementary Figure 21c). Although the cells in the two clusters R3 and R4 were annotated as goblet cell type by the authors, the differential expression of Dcpp3, Sbp1, Tff2, BC048546 and Muc5 genes implied different sub-types. The cells in cluster R5 were grouped into pulmonary neuroendocrine cells. These rare cell types were reported by the authors 25 , implying the reliability of GapClust in real dataset application.

Supplementary Figure 21
GapClust identifies six rare cell types in the mouse tracheal epithelial cells dataset. a The two-dimensional embeddings learnt by EDGE in the mouse tracheal epithelial cells dataset. b The heatmap of top differentially expressed genes between the six rare cell types. c The heatmap of top differentially expressed genes between the rare cell type R2 and remaining ciliated cells in the mouse tracheal epithelial cells dataset. Figure 22 The performance of GapClust using Euclidean and Manhattan distance. The five columns represent the results from the datasets with different levels of differential expression between the rare cell type and the abundant cell type by varying the de.prob parameter (0.1, 0.2, 0.4, 0.6 and 0.8) of Splatter.