Introduction

Heterogeneity in single-cell RNA sequencing (scRNA-seq) datasets is frequently characterized by identifying cell clusters in gene expression space, wherein each cluster represents a distinct cell type or cell state. In particular, numerous studies have used unsupervised clustering to discover novel cell populations in heterogeneous samples1. The steps involved in unsupervised clustering of scRNA-seq data have been well documented2. (i) Low-quality cells are first discarded in a quality control step3. (ii) Reads obtained from the remaining cells are then normalized to remove the influence of technical effects, while preserving true biological variation4. (iii) After normalization, feature selection is performed to select the subset of genes that are informative for clustering, (iv) which are then typically reduced to a small number of dimensions using principal component analysis (PCA)5. (v) In the reduced principal component (PC) space, cells are clustered based on their distance from one another (typically, Euclidean distance), and (vi) the corresponding clusters are assigned a cell type or state label based on the known functions of their differentially expressed (DE) genes6.

Although feature selection is a critical step in the canonical clustering workflow described above, only a few different approaches have been developed in this space. Moreover, there have been only a handful of systematic benchmarking studies of scRNA-seq feature selection methods7,8,9. A good feature selection algorithm is one that selects cell-type-specific (DE) genes as features, and rejects the remaining genes. More importantly, the algorithm should select features that optimize the separation between biologically distinct cell clusters. A comprehensive benchmarking study of feature selection methods would ideally use both of these metrics.

The most widely used approach for feature selection is mean-variance modeling: genes whose variation across cells exceeds a data-derived null model are selected as features10,11. Such genes are described as highly variable genes (HVGs)12. Some earlier single-cell studies instead selected genes with high loading on the top principal components of the gene expression matrix (high loading genes, or HLGs) as features13. M3Drop, a more recent method, selects genes whose dropout rate (number of cells in which the gene is undetected) exceeds that of other genes with the same mean expression9. As an alternative approach to detect rare cell types, GiniClust uses a modified Gini index to identify genes whose expression is concentrated in a relatively small number of cells14. All of the above feature selection methods test genes individually, without considering expression relationships between genes. Another drawback is that existing methods for determining the size of the feature set do not bear direct relation to the separation of cells in the resulting space.

Here, we present Determining the Underlying Basis using Stepwise Regression (DUBStepR), an algorithm for feature selection based on gene–gene correlations. A key feature of DUBStepR is the use of a stepwise approach to identify an initial core set of genes that most strongly represent coherent expression variation in the dataset. Uniquely, DUBStepR defines a novel graph-based measure of cell aggregation in the feature space (termed density index (DI)), and uses this measure to optimize the number of features. The complete DUBStepR workflow is shown in Fig. 1.

Fig. 1: Overview of DUBStepR workflow.
figure 1

After filtering out mitochondrial, ribosomal, spike-in, and pseudogenes, DUBStepR constructs a GGC matrix and bins genes by expression to compute their correlation range z-scores, which are used to select well-correlated genes. DUBStepR then performs stepwise regression on the GGC matrix to identify a minimally redundant subset of seed features, which are then expanded by adding correlated features (guilt-by-association). The optimal feature set size is determined using the Density Index metric.

We benchmarked DUBStepR against 7 commonly used feature selection algorithms on datasets from four different scRNA-seq protocols (10x Genomics, Drop-Seq, CEL-Seq2, and Smart-Seq2) and found that it substantially outperformed other methods on quantitative measures of cluster separation and marker gene detection. Further, DUBStepR uniquely deconvolved T and NK cell heterogeneity by identifying disease-pertinent clusters (of both rare and common abundances) in PBMCs from rheumatoid arthritis patients. Finally, we show that DUBStepR could potentially be applied even to single-cell ATAC sequencing data.

Results

Gene–gene correlations predict cell-type-specific DE genes

The first step in DUBStepR is to select an initial set of candidate features based on known properties of cell-type-specific DE genes (marker genes). DE genes specific to the same cell types would tend to be highly correlated with each other, whereas those specific to distinct cell types are likely to be anti-correlated (Fig. 2a, b; see the “Methods” section). In contrast, non-DE genes are likely to be only weakly correlated (Fig. 2c). We therefore hypothesized that a correlation range score derived from the difference between the strongest positive and strongest negative correlation coefficients of a gene (“Methods”), would be substantially elevated among DE genes. Indeed, we found that the correlation range score was significantly higher for DE genes relative to non-DE genes (Fig. 2d). Moreover, the correlation range score of a gene was highly predictive of its greatest fold change between cell types, and also its most significant differential expression q-value (Fig. 2e, f). Due to the strong association between correlation range and marker gene status, DUBStepR selects genes with high correlation range score as the initial set of candidate feature genes (“Methods”).

Fig. 2: Expression correlations of DE genes: scRNA-seq data from five lung adenocarcinoma cell lines28.
figure 2

a Average expression of top 10 DE genes for each cell line. b Gene–gene correlations of the same genes. c Gene–gene correlations for non-DE genes. d Boxplot showing correlation range scores for non-DE and DE genes. DE gene computation is detailed in Supp. Note 1A. The middle line is the median, the lower and upper box limits correspond to the first and third quartiles, the upper and lower whiskers extend upto 1.5*IQR from the top and bottom of the box, respectively (where IQR is the inter-quartile range), and data beyond the ends of the whiskers are outlying points that are plotted individually. P-value was calculated using two-sided Wilcoxon test (n = 8602 expressed genes). e, f Scatter plot of genes showing correlation between e log2(fold change) of cell-type-specific expression and f −log10(q-value) of cell-type-specific expression with correlation range score. ρ: Spearman correlation.

Stepwise regression identifies a minimally redundant feature subset

We observed that candidate feature genes formed correlated blocks of varying size in the gene–gene correlation (GGC) matrix (Fig. 3a), with each block presumably representing a distinct pattern of expression variation across the cells. To ensure a more even representation of the diverse cell-type-specific expression signatures within the candidate feature set, we sought to identify a representative minimally redundant subset, which we termed “seed” genes. For this purpose, DUBStepR performs stepwise regression on the GGC matrix, regressing out, at each step, the gene explaining the largest amount of variance in the residual from the previous step (Fig. 3b–d). We devised an efficient implementation of this procedure that requires only a single matrix multiplication at each step (“Methods”).

Fig. 3: Stepwise regression to identify a minimally redundant feature subset.
figure 3

a Gene–gene correlation matrix of candidate feature genes (high correlation range score). bd Residuals from stepwise regression on the gene–gene correlation matrix. e UMAP visualization of cells in an optimal feature space, colored by cell line. fh Same UMAP, colored by expression of genes regressed out in the first three steps. i Scree plot: variance in GGC matrix explained by the gene regressed out at each step. j Standardized average expression of the final seed gene set in each of the 5 cell lines. r: Pearson correlation.

This approach selects seed genes with diverse patterns of cell-type-specificity (Fig. 3e–h). DUBStepR then uses the elbow point of the stepwise regression scree plot to determine the optimal number of steps (“Methods”), i.e., the size of the seed gene set (Fig. 3i, j).

Guilt-by-association expands the feature set

Although the seed genes in principle span the major expression signatures in the dataset, each individual signature (set of correlated genes) is now represented by only a handful of genes (2–5 genes, in most cases). Given the high level of noise in scRNA-seq data, it is likely that this is insufficient to fully capture coherent variation across cells. DUBStepR therefore expands the seed gene set by iteratively adding correlated genes from the candidate feature set using a guilt-by-association approach. Guilt-by-association has previously been employed for feature selection on mass spectrometry data15, and provides a robust solution to order candidate feature genes by their association to the seed gene set (Supp. Fig. S2; “Methods”). This approach allows DUBStepR to prioritize genes that more strongly represent an expression signature (i.e., are better features for clustering). Candidate genes are added until DUBStepR reaches the optimal number of feature genes (see below).

Benchmarking

To benchmark the performance of DUBStepR, we compared it against 6 other algorithms for feature selection in scRNA-seq data: three variants of the HVG approach (HVGDisp, HVGVST, trendVar), deviance-based feature selection (devianceFS), HLG, and M3Drop/DANB (Table 1). For completeness, we also benchmarked GiniClust, though it was designed only for identifying markers of rare cell types. Each algorithm was benchmarked on 7 datasets spanning 4 scRNA-seq protocols: 10x Genomics, Drop-Seq, CEL-Seq2, and Smart-Seq on the Fluidigm C1 (Supp. Note 2A). These datasets were selected because the true cell type could be independently ascertained based on cell line identity or FACS gate. Our benchmarking approach thus avoids the circularity of using algorithmically defined cell type labels as ground truth.

Table 1 Feature selection methods used for the benchmarking comparison.

To evaluate the quality of the selected features, we used the well-established Silhouette index (SI), which quantifies cluster separation, i.e., closeness between cells belonging to the same cluster, relative to the distance to cells from other clusters16 (Supp. Note 3A). In addition to being a well-established measure of single-cell cluster separation17,18,19, the SI has the advantage of being independent of any downstream clustering algorithm. We evaluated the SI of each algorithm across a range of feature set sizes (50–4000), scaled the SI values to a maximum of 1 for each dataset, and then averaged the scaled SIs across the 7 datasets (Fig. 4a; Supp. Fig. S3). Remarkably, HLG, an elementary PCA-based method that predates scRNA-seq technology, achieved greater average cell type separation than existing single-cell algorithms at most feature set sizes. In contrast to DUBStepR, which showed maximal performance at 200–300 features, the other methods remained close to their respective performance peaks over a broad range from 300 to 2000 features and dropped off on either side of this range. DUBStepR substantially outperformed all other methods across the entire range of feature set size (Fig. 4a). Moreover, DUBStepR was the top-ranked algorithm on 5 of the 7 datasets (Fig. 4b).

Fig. 4: Benchmarking feature selection methods.
figure 4

a Mean scaled silhouette index of feature sets ranging from 50 to 4000 features. b Rank distribution of feature selection methods. For each dataset, the methods are ranked from 1 to 9 by their best SI across all feature set sizes. c AUROC of DE gene detection. For all boxes, the middle line represents the median, the lower and upper box limits correspond to the first and third quartiles, the upper and lower whiskers extend upto 1.5*IQR from the top and bottom of the box, respectively (where IQR is the inter-quartile range). Points beyond the whiskers are outliers and are plotted individually. All p-values were calculated using two-sided Student’s T-tests (n = 7). Refer to Supp. Table S6 for details regarding the datasets.

For optimal cell type clustering, a feature selection algorithm should ideally select only DE genes, i.e., genes specific to cell types or subtypes, as features. As an independent benchmark, we therefore quantified the ability of feature selection algorithms to discriminate between DE and non-DE genes. To minimize the effect of ambiguously classified genes, we designated the top 500 most differentially expressed genes in each dataset as DE, and the bottom 500 as non-DE (“Methods”), and then quantified performance using the area under the receiver operating characteristic (AUROC; Supp. Note 3B). Remarkably, DUBStepR achieved an AUROC in excess of 0.97 on all 7 datasets, indicating near-perfect separation of DE and non-DE genes (Fig. 4c). devianceFS was able to exceed the same performance threshold on 4 of the 7 datasets and HLG on only one. All other methods demonstrated significantly lower performance (Fig. 4c). Thus, DUBStepR greatly improves our ability to select cell type/subtype-specific marker genes (DE genes) for clustering scRNA-seq data.

With the exponential increase in the size of single-cell datasets, any new computational approach in the field must be able to scale to over a million cells. To improve DUBStepR’s ability to efficiently process large datasets, we identified a technique to reduce a key step in stepwise regression to a single matrix multiplication, sped up calculation of the elbow point, and implemented the entire workflow on sparse matrices (“Methods”). To benchmark scalability, we profiled execution time and memory consumption of DUBStepR, as well as the other aforementioned feature selection methods, on a recent mouse organogenesis dataset of over 1 million cells20. This dataset was downsampled to produce two additional datasets of 10k and 100k cells, respectively, while maintaining cell-type diversity (Supp. Note 2B). DUBStepR, HVGDisp, HVGVST, trendVar, devianceFS, and M3DropDANB were able to process the entire 1 million cell dataset, while GiniClust and HLG could not scale to 100k cells (Supp. Fig. S4). On the largest dataset, DUBStepR ranked fourth out of the eight tested methods in memory consumption and compute time. In terms of memory scalability, DUBStepR used 6.4x more memory to process the 1M cell dataset as compared to the 100k dataset. In contrast, HVGDisp, HVGVST, trendVar, devianceFS, and M3DropDANB all increased their memory consumption by 12.5x. Thus, DUBStepR is scalable to over a million cells and shows promise for even larger datasets.

Density index predicts the optimal feature set

As shown above, selecting too few or too many feature genes can result in sub-optimal clustering (Fig. 4a). Ideally, we would want to select the feature set size that maximized cell type separation (i.e., the SI) in the feature space. However, since the feature selection algorithm by definition does not know the true cell type labels, it is not possible to calculate the SI for any given feature set size. We therefore endeavored to define a proxy metric that would approximately model the SI without requiring knowledge of cell-type labels. To this end, we defined a measure of the inhomogeneity or “clumpiness” of the distribution of cells in feature space. If each cell clump represented a distinct cell type, then this measure would tend to correlate with the SI. The measure, which we termed the density index (DI), equals the root-mean squared distance between all cell pairs, divided by the mean distance between a cell and its k nearest neighbors (“Methods”). Intuitively, when cells are well clustered and therefore inhomogeneously distributed in feature space, the distance to nearest neighbors should be minimal relative to the distance between random pairs of cells, and thus DI should be maximal (Fig. 5a, b). Empirically, we found that DI and SI were indeed positively correlated and tended to reach their maxima at approximately the same feature set size (Fig. 5c). Further, for 5 out of the 7 benchmarking datasets, the feature set with the highest DI also maximized SI (Fig. 5d). Since our earlier analysis only tested a discrete number of feature set sizes (Fig. 4a; Supp. Note 3A, B), the DI-guided approach even improved on the maximum SI in 2 cases (Fig. 5d). One additional advantage of the DI is that it is relatively straightforward to compute, since the numerator is proportional to the square of the Frobenius norm of the gene expression matrix (“Methods”). By default, DUBStepR therefore selects the feature set size that maximizes DI.

Fig. 5: The density index.
figure 5

a, b UMAP visualizations of the 5 cell lines comparing the local neighborhood density of feature spaces in (a) good feature selection versus (b) poor feature selection. c Comparison of optimal density index and silhouette index over feature set size range. d Fraction of the maximum benchmarking SI achieved using feature set sizes determined by the density index as compared to fixing feature set sizes at 500 genes and 2000 genes (refer to Supp. Table S6 for details regarding the datasets).

DUBStepR robustly detects rare cell types and cryptic cell states in rheumatoid arthritis samples

The above quantitative benchmarking analyses were largely based on detection of common cell types (>10% of all cells) in cell lines or FACS-purified cell populations from healthy donors. To demonstrate the ability of DUBStepR to cluster cells from a complex primary sample, we generated scRNA-seq data from 8312 PBMCs from four rheumatoid arthritis (RA) patients (“Methods”). In this case, since the “true” cell type labels were unknown, our objective was to qualitatively compare results from the various feature selection methods. We used SingleR21 to select the T and NK cell subset (5329 cells; “Methods”) since this cell population is challenging to sub-cluster by conventional methods, despite its relevance to inflammatory phenotypes. DUBStepR (with DI optimization) identified 10 discrete subtypes in this dataset, with sharply distinct gene expression signatures (Fig. 6a; Supp. Fig. S5). These included four rare cell clusters that were undetected or only partially detected by the other feature selection methods: red blood cells (RBCs, 1.8%), proliferating cells (2%), platelet-T doublets (3.4%), and platelet-NK doublets (3%) (Fig. 6b; Supp. Figs. S5, S6). While RBCs reflect contamination during PBMC isolation, platelet-lymphocyte complexes and proliferating T cells regulated by KIAA0101 are thought to play a role in the pathophysiology of RA22,23,24 (Supp. Fig. S5).

Fig. 6: Analysis of lymphocyte population from rheumatoid arthritis patient PBMCs.
figure 6

a UMAP visualization of clusters identified by DUBStepR. bh UMAP visualization of cell clusters identified by DUBStepR using features selected by the 7 other feature selection methods; b HVGDisp, c HVGVST, d trendVar, e devianceFS, f M3DropDANB, g GiniClust, and h HLG.

In addition to detecting multiple rare cell types, DUBStepR identified a dichotomy in CD4+ T, CD8+ T, and NK cells, defined by coordinated differential expression of SET, C1orf56, C16orf54, CDC42SE1, and HNRNPH1 (Supp. Fig. S5), all of which have been previously identified as markers of a latently infected T cell subtype in HIV25. Once again, DUBStepR was the only feature selection method to clearly distinguish these cell states (Supp. Figs. S5, S6). In summary, DUBStepR was the only feature selection algorithm that robustly detected common and rare cell types and subtypes in this complex primary lymphocyte population.

DUBStepR generalizes to scATAC-seq data

Feature selection is typically not performed on scATAC-seq data, since their almost binary nature (most genomic bins have zero or one count) renders them refractory to conventional single-cell feature selection techniques based on variance-mean comparison26. However, since the logic of feature correlations applies even to binary or almost binary data, we hypothesized that DUBStepR could also improve the quality of cell type inferences from this data type. To test this hypothesis, we applied DUBStepR to scATAC-seq data from eight FACS-purified subpopulations of human bone marrow cells27 (Supp. Note 2D). In contrast to the common approach of using all scATAC-seq peaks, we found that peaks selected by DUBStepR more clearly revealed the emergence of the three major lineages from hematopoietic stem cells: lymphoid, myeloid, and megakarocyte/erythroid (Fig. 7). Specifically, trajectory analysis using Monocle 320 yielded a topology that matched the known hematopoietic differentiation hierarchy27 (Fig. 7g) only in the case of DUBStepR (Fig. 7d–f).

Fig. 7: Comparison of feature selection on scATAC-seq data from Buenrostro et al.27.
figure 7

ac UMAP visualizations of cells colored by FACS-sorted cell type labels, with trajectories constructed using Monocle 3. a All peaks (no feature selection), b peaks selected by highly variable features (HVF), c peaks selected by DUBStepR. df Monocle-generated trajectories on the corresponding UMAP visualizations. g Hematopoietic differentiation schematic. CLP common lymphoid progenitor, CMP common myeloid progenitor, GMP granulocyte/macrophage progenitor, HSC hematopoietic stem cell, LMPP lymphoid-primed multipotent progenitor, MEP megakaryocyte/erythroid progenitor, Mono monocyte, MPP multipotent progenitor.

Discussion

DUBStepR is based on the intuition that cell-type-specific marker genes tend to be well correlated with each other, i.e., they typically have strong positive and negative correlations with other marker genes. After filtering genes based on a correlation range score, DUBStepR exploits structure in the gene–gene correlation matrix to prioritize genes as features for clustering. To benchmark this feature selection strategy, we used a stringently defined collection of single-cell datasets for which cell type annotations could be independently ascertained28. Note that this avoids the circularity of defining the ground truth based on the output of one of the algorithms being tested. Results from our benchmarking analyses indicate that, regardless of feature set size, DUBStepR separates cell types more clearly other methods (Fig. 4a, b). This observation is further corroborated by the fact that DUBStepR predicts cell-type-specific marker genes substantially more accurately than other methods (Fig. 4c). Thus, our results demonstrate that gene–gene correlations, which are ignored by conventional feature selection algorithms, provide a powerful basis for feature selection.

The plummeting cost of sequencing, coupled with rapid progress in single-cell technologies, has made scalability an essential feature of novel single-cell algorithms. DUBStepR scales effectively to datasets of over a million cells without sharp increases in time or memory consumption (Supp. Fig. S4). Thus, the method is likely to scale well beyond a million cells. A major contributor to the algorithm’s scalability is the fact that, once the gene–gene correlation matrix is constructed, the time and memory complexity of downstream steps is constant with respect to the number of cells.

Intriguingly, DUBStepR approaches its maximum silhouette index value at 200–500 feature genes (Supp. Fig. S3), which is well below the default feature set size of 2000 used in most single-cell studies10,12. Thus, our results suggest that, if feature selection is optimized, it may not be necessary to select a larger number of feature genes. Note, however, that the optimum feature set size can vary across datasets (Supp. Fig. S3). Selecting a fixed number of feature genes for all datasets could therefore result in sub-optimal clustering (Fig. 5d).

From the perspective of cell clustering, the optimal feature set size is that which maximizes cell type separation in feature space, which can be quantified using the SI. As an indirect correlate of cell type separation, we have defined a measure of the inhomogeneity or “clumpiness” of cells in feature space, which we term the density index (DI). To our knowledge, DI is the only metric for scoring feature gene sets based on the distribution of cells in feature space. Our results suggest that the DI correlates with the SI, and that cluster separation is improved in most cases when the feature set is chosen to maximize DI. Another important advantage of the DI is that it is computationally straightforward to calculate from the Frobenius norm of the data matrix. It is possible that the DI measure could also be applied to other stages of the clustering pipeline, including dimensionality reduction (selecting the optimal number of PCs) and evaluation of normalization strategies.

Interestingly, although DUBStepR was not specifically designed to detect rare cell types, it nevertheless substantially outperformed all other methods at detecting multiple cell populations present at low frequency (<4%) in a complex primary PBMC sample (Fig. 6). Notably, the rare populations identified by DUBStepR-included RBCs and platelet-containing doublets, which should not have been present in the T/NK population. It is likely that these cells were mis-classified as T/NK by SingleR due to the absence of platelet and RBC transcriptomes in the reference panel. In addition, DUBStepR greatly outperformed all other feature selection methods at detecting T and NK cell subpopulations over a range of higher frequencies (4.9−9.5%) in the same dataset. Given that this dataset posed the greatest challenge in terms of clustering difficulty, it is remarkable that DUBStepR provided a major, qualitative improvement over all other feature selection methods.

Algorithmic pipelines for single-cell epigenomic data, for example scATAC-seq, typically do not incorporate a formal feature selection step26,29. In most cases, such pipelines merely discard genomic bins at the extremes of high and low sequence coverage. This is because the sparsity and near-binary nature of single-cell epigenomic data reduces the efficacy of conventional feature selection based on mean-variance analysis. Since DUBStepR uses an orthogonal strategy based on correlations between features, it is less vulnerable to the limitations of single-cell epigenomics data (Fig. 7). Thus, DUBStepR opens up the possibility of incorporating a feature selection step in single-cell epigenomic pipelines, including scATAC-seq, scChIP-seq, and single-cell methylome sequencing.

Methods

DUBStepR methodology

Gene filtering

By default, DUBStepR filters out genes that are not expressed in at least 5% of cells. We allow the user to adjust this parameter if they are interested in genes that are more sparsely expressed in their dataset. In addition, mitochondrial genes, ribosomal genes, and pseudogenes are identified using gene symbols or Ensembl IDs for human, mouse, and rat datasets using the latest Ensembl references downloaded from BioMart30.

Correlation range

Correlation range ci for gene i can be defined for a gene–gene correlation matrix G as

$${c}_{i}={{{{{{\mathrm{max}}}}}}}_{3}({G}_{i})-0.75\cdot {{{{{\mathrm{min}}}}}}({G}_{i}),$$
(1)

where maxh(Gi) refers to the hth-largest correlation value in column i of G. Correlation range uses the second-largest non-self correlation value (3rd largest value in GGC column) to calculate the range, so as to protect against genes with overlapping 5′ or 3′ exons31.

The minimum correlation value has been down-weighted to 0.75 to give greater importance to stronger positive correlations over negative correlations.

We first binned genes based on their mean expression level, as mean expression tends to correlate with technical noise32. In each bin, we compute a z-score of the correlation range of gene i as

$${z}_{i}=\frac{{c}_{i}-{\mu }_{c}}{{\sigma }_{c}},$$
(2)

where μc is the mean correlation range of the gene and σc refers to variance in the correlation range scores of a gene. Genes with a z-score ≤ 0.7 are filtered out at this step.

Stepwise regression

We define the stepwise regression equation as

$$G=g{{{{{{{{\bf{w}}}}}}}}}^{T}+\epsilon ,$$
(3)

where G is the column-wise zero-centered gene–gene correlation matrix, g is the column of the matrix G to be regressed out, ϵ is the matrix of residuals and w is a vector of regression coefficients. The squared error (ϵTϵ) is minimized when

$${{{{{{{{\bf{w}}}}}}}}}^{T}={\hskip 1pt}\frac{{{{{{{{{\bf{g}}}}}}}}}^{T}G}{{{{{{{{{\bf{g}}}}}}}}}^{T}{{{{{{{\bf{g}}}}}}}}}$$
(4)

Thus,

$$G=\frac{{{{{{{{\bf{g}}}}}}}}{{{{{{{{\bf{g}}}}}}}}}^{T}G}{{{{{{{{{\bf{g}}}}}}}}}^{T}{{{{{{{\bf{g}}}}}}}}}+\epsilon .$$
(5)

We calculate variance explained by the regression step as \(V={\hskip -3pt}\parallel G-\epsilon {\parallel }_{F}^{2}\), where F indicates the Frobenius norm. To efficiently compute V for all genes, we define X = GTG and x as the row of X corresponding to gene g. Thus, x = gTG. We can simplify V as

$$V ={\hskip -3pt}\parallel G-\epsilon {\parallel }_{F}^{2}\\ ={\left\Vert \frac{{{{{{\bf{g}}}}}}{{{{{{\bf{g}}}}}}}^{T}G}{({{{{{{\bf{g}}}}}}}^{T}{{{{{\bf{g}}}}}})}\right\Vert }_{F}^{2}\\ =\frac{{\parallel {{{{{\bf{g}}}}}}{{{{{\bf{x}}}}}}\parallel }_{F}^{2}}{{({{{{{{\bf{g}}}}}}}^{T}{{{{{\bf{g}}}}}})}^{2}}\\ =\frac{{{{{{\rm{Tr}}}}}}({({{{{{\bf{g}}}}}}{{{{{\bf{x}}}}}})}^{T}({{{{{\bf{g}}}}}}{{{{{\bf{x}}}}}}))}{{({{{{{{\bf{g}}}}}}}^{T}{{{{{\bf{g}}}}}})}^{2}}\\ =\frac{{{{{{\rm{Tr}}}}}}({{{{{{\bf{x}}}}}}}^{T}{{{{{\bf{x}}}}}})}{{{{{{{\bf{g}}}}}}}^{T}{{{{{\bf{g}}}}}}}\\ =\frac{{{{{{\bf{x}}}}}}{{{{{{\bf{x}}}}}}}^{T}}{{{{{{{\bf{g}}}}}}}^{T}{{{{{\bf{g}}}}}}}.$$
(6)

Thus, we can use a single matrix multiplication GTG to efficiently calculate variance explained by each gene in the gene–gene correlation matrix, and then regress out the gene explaining the greatest variance. The residual from each step k is then used as the gene–gene correlation matrix for the next step. In other words,

$$ {G}_{k}={{{{{{\bf{g}}}}}}}_{k}{{{{{{\bf{w}}}}}}}_{k}^{T}+{\epsilon }_{k}\\ {G}_{k+1}={\epsilon }_{k}.$$
(7)

For computational efficiency, we repeat this regression step 30 times and then assume that the next 70 steps explain the amount of variance as the 30th step, giving a total of 100 steps. We observed that this shorter procedure had little or no impact on the results, since the variance explained changed only marginally beyond the 30th step.

To select the genes contributing to the major directions in G, we use the elbow point on a scree plot. Elbow point computation is described in Supp. Note 1B. The genes that are regressed out upto the elbow point form the “seed” gene set.

Guilt-by-association

Guilt-by-association, also known as label propagation through a network, allows DUBStepR to determine a robust ordering of features in an iterative manner. Once the seed genes have been determined, the gene with the strongest Pearson correlation to any of the seed genes is first identified. This gene is then added to the seed genes, thereby expanding the feature set. This feature set (now consisting of the seed genes and the newly added feature gene) is then used to identify the next most strongly correlated gene, which is again added to the feature set. By iteratively repeating the latter step, DUBStepR propagates through the gene–gene correlation network until the feature set has reached its final size.

We have developed a custom implementation of this guilt-by-association approach as part of the DUBStepR package in R, the source code for which is available on our GitHub repository (see “Code availability” section).

Density index

For a given feature set, PCA5 is performed on the gene expression matrix and the top D principal components (PCs) are selected, where D is a user-specified parameter with a default value of 20. Let M be the matrix of embeddings of the gene expression vectors of N cells in D principal components. The root-mean-squared distance drms between pairs of cells i and j can be calculated as

$${d}_{rms}=\sqrt{ < {d}_{i,j}^{2} > }=\sqrt{ < {{{\Sigma }}}_{p = 1}^{D}{({M}_{i,p}-{M}_{j,p})}^{2} > },$$
(8)

where < > denotes the average over all pairs {(i, j)i [1, N], j [1, N]}. Note that, for simplicity of the final result, we include pairs in which i = j. This can be further simplified as follows:

$${d}_{rms} =\sqrt{ < {{{\Sigma }}}_{p = 1}^{D}({M}_{i,p}^{2}+{M}_{j,p}^{2}-2{M}_{i,p}{M}_{j,p}) > }\\ =\sqrt{{{{\Sigma }}}_{p = 1}^{D}( < {M}_{i,p}^{2} > + < {M}_{j,p}^{2} > -2 < {M}_{i,p}{M}_{j,p} > )}\\ =\sqrt{{{{\Sigma }}}_{p = 1}^{D}( < {M}_{i,p}^{2} > + < {M}_{j,p}^{2} > )}\\ =\sqrt{2{{{\Sigma }}}_{p = 1}^{D} < {M}_{i,p}^{2} > }\\ =\sqrt{\frac{2}{N}}\parallel M{\parallel }_{F}.$$
(9)

In the above derivation, the mean product term <Mi,pMj,p> is zero because Mi,p and Mj,p have zero mean across i and j, respectively. Let ki denote the average distance of cell i from its k nearest neighbors, and km denote the mean of ki across all cells. We define the DI as

$${{{{{\mathrm{DI}}}}}}=\frac{{d}_{rms}}{{k}_{m}}=\sqrt{\frac{2}{N}}\frac{\parallel M{\parallel }_{F}}{{k}_{m}}.$$
(10)

Rheumatoid arthritis dataset

Patient sample collection

Fresh blood samples of patients diagnosed with rheumatoid arthritis were collected at Tan Tock Seng Hospital, Department of Rheumatology, Allergy & Immunology, Singapore, and were transferred to Genome Institute of Singapore for further processing. All technical procedures and protocols for the recruitment, blood collection, and PBMCs isolation were reviewed and approved by the Institutional Review Board (IRB) at the National Healthcare Group Domain Specific Review Board (NHG DSRB), Singapore (Reg. no. 2016/00899).

Single-cell RNA sequencing

For each sample, fresh PBMCs were isolated from 5 ml Sodium Heparin tubes using standard Ficoll-Hypaque density gradient centrifugation33. Briefly, blood samples were diluted by an equal volume of Phosphate-buffered saline (PBS) containing 2% fetal bovine serum (FBS) (STEMCELL, catalog #07905) and were gently added on top of LymphoprebTM density gradient medium (STEMCELL, catalog #07801) in the SepMateTM (STEMCELL, catalog #15420) tubes. The tubes were centrifuged at 1200 × g for 15 min and the upper layer containing the enriched PBMCs and plasma was collected into a new falcon tube to wash with PBS + 2% FBS and centrifuge at 300 × g for 8 min. After counting the cells, isolated PBMCs were divided into two or three vials and frozen down in FBS containing 10% dimethylsulfoxide (DMSO) for downstream experiments. Immediately after isolation, fresh PBMCs were washed, filtered (using cell strainer), and re-suspended in PBS containing 0.04 bovine serum albumin. Single-cell suspensions from four samples were mixed at the final concentration of 106 cells/ml and the mixed suspension was loaded into a 10x Chromium Instrument to target total number of 3000 cells from each patient (n = 4). GEM-RT reactions, cDNA synthesis, and library preparations were performed using Single Cell \(3^{\prime}\) v2 10x ReagentTM Kit according to manufacturer protocols. The single-cell libraries were run onto the Illumina HiSeq® 4000 platform as prescribed by 10x Genomics. The raw base call (BCL) files from the sequencer were processed through 10x Genomics Cell Ranger 2.1.1 analysis pipelines. First, the mkfastq pipeline was run to generate FASTQ files, followed by read alignment to the hg19 genome reference using the count pipeline. The raw counts matrix generated by the count pipeline was loaded into R for further analysis.

Demultiplexing pooled samples

To facilitate demultiplexing, the patients were genotyped using the Illumina Infinium® HTS Assay following the manufacturer’s protocol. Reads from the sequencer were attributed to each of the four patients using Demuxlet34 at default settings. For each library, the corresponding genotyping data (VCF file) and scRNA-seq data (BAM file) were imported into Demuxlet in order to infer the sample source for each cell barcode.

Data preprocessing

The raw data files were used to generate a Seurat object, including only features that were detected in a minimum of 3 cells and cells having at least 200 uniquely detected features. Additional filtering was performed to remove cells having >10% mitochondrial rate (calculated using the PercentageFeatureSet function in Seurat35). Further, only cells with unique feature counts between 200 and 2500 were retained, resulting in 8350 cells. The data was then log-normalized using the NormalizeData function in Seurat35 at default settings.

Isolating T & NK cells

The normalized data were then annotated using SingleR21, with the Monaco et al. immune dataset36 as the reference. The SingleR function was run at default settings, using the log-normalized counts as input. Thirty-eight cells were pruned by SingleR, leaving a total of 8312 annotated cells. Of these, cells annotated as CD4+ T cells, CD8+ T cells, or NK cells were isolated (Fig. 8; Table 2).

Fig. 8: Heatmap showing SingleR scores for cells in the RA PBMC dataset computed against the Monaco immune reference36.
figure 8

Higher score indicates greater likelihood of the cell being of that corresponding cell type. Cells pruned by SingleR are indicated in orange.

Table 2 Number of cells of each cell type annotated by SingleR that were obtained from each patient in the RA PBMC dataset.

DUBStepR analysis

Due to the lower expression of genes in the T and NK cell population, we modified DUBStepR’s gene filtering threshold to filter out genes expressed in less than 1% of the cells. The Seurat package was used for downstream processing. First, the feature genes were zero-centered and scaled, and PCA was performed. The top 12 PCs were selected using the elbow plot of variance explained by the PCs. Clustering was performed using the Louvain graph-based approach—using the FindNeighbors function with 12 PCs and FindClusters at default parameter settings. UMAP coordinates were computed using the cell embeddings in 12 PCs.

Other feature selection methods

All other feature selection methods (HVGDisp, HVGVST, trendVar, devianceFS, M3DropDANB, GiniClust, and HLG) were run at default settings. To showcase the result of HVGVST, we used the Seurat package to cluster cells, as described above. Similar to the DUBStepR result, 12 PCs were used for both clustering and UMAP visualization.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.