A clustering-independent method for finding differentially expressed genes in single-cell transcriptome data

A common analysis of single-cell sequencing data includes clustering of cells and identifying differentially expressed genes (DEGs). How cell clusters are defined has important consequences for downstream analyses and the interpretation of results, but is often not straightforward. To address this difficulty, we present singleCellHaystack, a method that enables the prediction of DEGs without relying on explicit clustering of cells. Our method uses Kullback–Leibler divergence to find genes that are expressed in subsets of cells that are non-randomly positioned in a multidimensional space. Comparisons with existing DEG prediction approaches on artificial datasets show that singleCellHaystack has higher accuracy. We illustrate the usage of singleCellHaystack through applications on 136 real transcriptome datasets and a spatial transcriptomics dataset. We demonstrate that our method is a fast and accurate approach for DEG prediction in single-cell data. singleCellHaystack is implemented as an R package and is available from CRAN and GitHub.


Supplementary Methods singleCellHaystack methodology
The two main functions in singleCellHaystack are haystack_2D and haystack_highD, for 2D and multidimensional (≥2D) input spaces, respectively. The input parameters are 1) the coordinates of cells in a multidimensional (≥2D) space (e.g. t-SNE, UMAP, PC coordinates, spatial coordinates, etc.), and 2) data indicating for each gene in which cells it was detected and not detected. We refer to Supplementary Figure 1 for an overview of the workflow. Here, we first focus on haystack_highD. A description of haystack_2D follows.
Step 0: Optional standardization of the input space In this optional step, each dimension of the input coordinates is rescaled to mean 0 and standard deviation 1.
Step 1: Setting parameters Steps 2-4 depend on using grid points for estimating the local density of cells using a Gaussian kernel, in function of their distance to the grid points. In the first step, haystack_highD decides the grid points and bandwidths for doing these calculations.
By default, 100 grid points (option grid.points) are decided in a way that results in grid points being roughly uniformly spread over the subspace in which the cells are located. In other words, grid points should not be located close to each other, but should be proximal to the cells, and no cells should be distal from all grid points. In haystack_highD, the default way of deciding grid points is by running k-means clustering of the input cell coordinates, and then use the resulting 100 centroids as grid points. Note that the clustering of cells itself is not important. Another approach is to use seeding, as used in k-means++ clustering, in which initial centers are picked iteratively so that they are distal from other centers 1 .
A bandwidth ℎ is decided as follows: for each cell, the distance to the closest grid point is calculated, and ℎ is defined as the median of those distances. Normalized distances between cells and grid points are subsequently defined as the Euclidean distances divided by the bandwidth ℎ. The density contribution of each cell to each grid point is calculated as: where , is the normalized distance between and grid point .
Step 2: Estimating reference distribution The density contributions of all cells in the multidimensional space are used to estimate a reference distribution, . The density of cells at each grid point in the space is calculated as: (1) After this, is normalized to sum to unity.
Step 3: Estimating distributions ( = ) and ( = ) The distribution of the cells in which gene is detected, ( = ), and those in which gene is not detected, ( = ), is estimated using the same bandwidth and grid points, as follows: and where ( = ) and ( = ) represent the subsets of cells in which is detected and not detected, respectively. Subsequently ( = ) and ( = ) are normalized to sum to unity.
Step 4: Estimating the Kullback-Leibler divergence of gene , ( ) The divergence of the expression pattern of gene , ( ), is calculated as follows: ∈{ , } If the cells in which is detected (and not detected) do not show a biased distribution, and approximately follow the reference distribution , then ( ) is close to 0. As the discrepancy to the reference distribution increases, the value of ( ) also increases.
Step 5: Estimating the significance of ( ) Finally, haystack_highD evaluates the statistical significance of the ( ) values. We cannot naively regard high ( ) values as significant, because there is a tendency for genes expressed in very few cells, and for genes expressed in almost all cells, to have a high ( ).
Instead, haystack_highD evaluates the significance of observed ( ) values by comparing them to randomized data. First, let represent the number of cells in which each gene is detected in the scRNA-seq data. In practice, 50 sets of randomized genes are made that are expressed in cells; being 200 values taken from the actual values of the data, evenly spread across the range of values. The ( ) of these randomized genes are recorded. If there are less than 200 unique values in the data, then all are used. (3) For each value, the randomized 2 ( ( )) values follow an approximately normal distribution. We can therefore use the mean and standard deviation of randomized 2 ( ( )) values to estimate a p-value of the actually observed ( ) values. The mean and standard deviation of randomized 2 ( ( )) values are modeled as a function of using B-splines (using the splines R package). By default, 10 degrees of freedom are used for the splines, although this is lowered if there are few values.
Using the B-splines, expected means and standard deviations are predicted for each gene, in function of its . P-values are then calculated using the pnorm function in R.

The singleCellHaystack advanced mode
In practice, in some cells more genes are detected than in others (for examples, see Figure 3A, and Figure 6, left). The default reference distribution does not take this into account, which can lead to genes being judged to be DEGs, even if they are merely detected in cells that have many detected genes.
haystack_highD can take this into account by weighting the density contributions of cells by the number of genes detected in each cell: × where is the number of genes detected in . This assigns a larger influence to cells with many detected genes in the calculation of . Estimation of ( = ) and ( = ) is done as in the default mode.
A second difference is in the construction of the randomized genes (see Step 5 above). In the default mode randomized genes are simulated by randomly picking cells from all cells in the input data according to a uniform probability (i.e. each cell has the same chance of being selected). In the advanced mode, the probabilities reflect the number of genes detected in each cell (i.e. cells in which many genes are detected have a higher chance of being selected). As a result, randomized genes tend to follow the detection levels of the actual cells more closely.
The advanced mode can be activated by giving the vector of values as input to the use.advanced.sampling parameter of the haystack_highD and haystack_2D functions.

The haystack_2D function for 2-dimensional input
In addition to haystack_highD, which is aimed at >2D input, singleCellHaystack includes the function haystack_2D, which was designed for 2D input coordinates. The general concept is the same. Below, we discuss differences in steps 1-4.
Step 5 is the same as introduced above.
Step 1: Setting parameters For haystack_2D, steps 2-4 depend on the mapping of 2D coordinates onto a grid, and estimating the local density of cells using a Gaussian kernel, in function of their (6) distance to the grid points. In the first step, haystack_2D decides the parameters for doing these calculations. Bandwidths ℎ and ℎ for the X and Y axes are decided using the rule-of-thumb of the bandwidth.nrd function (MASS package in R). The number of grid points is decided so that there are 25 grid points between the 10% and 90% percentile of coordinates along both axes. The grid is then further extended until it covers all points. The goal of this strategy is to reduce the influence of outliers on the definition of the grid.
The density contributions of each cell to each grid point along both axes is calculated as: for the X-axis grid points, and ) for the Y-axis grid points, where and represent the coordinates of each cell , and the coordinates of grid points of the X-axis and Y-axis, respectively.
Step 2: Estimating reference distribution The density distribution of all cells in the 2D space is used as a reference distribution, . The density at each grid point in the 2D space is calculated as: To each ( , ) value, a small pseudo count is added, defined as the 1% percentile value of non-zero ( , ) values. After this, is normalized to sum to unity.
Step where ( = ) and ( = ) represent the subsets of cells in which is detected and not detected, respectively. The same pseudo count is added as for , and subsequently ( = ) and ( = ) are normalized to sum to unity.
Step 4: Estimating the Kullback-Leibler divergence of gene , ( ) The divergence of the expression pattern of gene , ( ), is calculated as follows: , ∈{ , } Step 5: Estimating the significance of ( ) This step is the same as for haystack_highD (see above).

scRNA-seq datasets and processing
Datasets were processed as described below. In many steps, we followed the recommendations described by Kobak and Berens 2 .
Step 1: Data sources - Tabula  Step 2: Filtering of cells and genes Library sizes of each cell were calculated by summing the number of reads or UMI (Universal Molecular Identifier) counts over all genes. Counts were then converted to counts per million counts (CPM). Genes were defined to be detected in a cell if their CPM was above a threshold CPM. We used the median CPM of each gene in the dataset as threshold, except for the spatial transcriptomics analysis where we defined a gene detected if the counts where > 0.
To focus on dataset sizes that are representative of typical current single-cell datasets we randomly selected 20,000 cells in datasets with more than 20,000 cells. For the Tabula Muris microfluidic droplet data, we selected the 20,000 cells with the most detected genes.
We filtered out cells with fewer than 100 detected genes, and genes detected in fewer than 10 cells.
Step 3: Principal Component Analysis, t-SNE and UMAP We selected 1,000 genes with large variance given their mean using dropout rates and mean CPM across non-zero counts, as described by Kobak and Berens 2 . The log2 CPM values of these 1,000 genes (adding pseudocount of 1) were used as input for PCA, without scaling. Subsequently, t-SNE and UMAP were run on the first 50 PCs. For t-SNE, we used the Rtsne package (version 0.15) using perplexity 30, for a maximum of 500 iterations 6 . For UMAP, we used the umap package (version 0.2.0.0) 7 .
Step 4: singleCellHaystack analysis singleCellHaystack was applied on the datasets using the haystack function, with inputs: 1) the coordinates of cells in a ≥2D space (2D t-SNE coordinates, 2D UMAP coordinates, or 5, 10, 15, 25, and 50 PCs) and 2) the detection data of each gene in each cell. This includes all genes that passed the filtering step (i.e. not only the 1,000 genes used as input for PCA). haystack was run both using the default mode and the advanced mode, which takes into account the general detection levels of genes (see main manuscript and explanation above). Runtimes were recorded for each run.
Step 5: Visualization of results For the results returned by haystack by the default mode and the advanced mode, the following was done, separately: -Top-scoring DEGs were selected, using the function show_result_haystack. For this, the p-value threshold 1e-6 was used. -Significant DEGs were grouped into clusters by similarity of their expression pattern in the input space by hierarchical clustering. This was done using the function hclust_haystack. For the example application on the Mouse Cell Atlas Testis 1 dataset (Supplementary Figure 6), the function kmeans_haystack for k-means clustering was used. In all cases the number of clusters was arbitrarily set to 5. -The average distribution of the genes in each of the clusters was visualized using function plot_gene_set_haystack (see for example Supplementary Figure 4). -Within each cluster, the most significant DEG was plotted using plot_gene_haystack (see for example Figure 3 and Supplementary  Figures 5 and 6).

Supplementary Notes
Dependency of singleCellHaystack on input space As expected, different input spaces result in different results (Supplementary Figure  7A). Similar differences were observed for cluster-based DEG prediction approaches when they were given clusters of cells obtained from different input spaces (e.g. clustering done on the first 5 PCs, first 10 PCs, etc.). Generally, similar input spaces (t-SNE vs UMAP, or 5 PCs vs 10 PCs) returned more consistent results, while different input spaces (5 PCs vs 50 PCs) resulted in more discrepancies. Although generally there was a correlation in the p-values obtained from different input spaces, there was a notable deviation seen especially in the 50 PC input for the Testis 1 dataset. The 11 th to 50 th PC of this dataset contains variation which is obviously not included in the first 5 or 10 PCs and which is not completely captured by t-SNE and UMAP.

Dependency of singleCellHaystack on bandwidth ℎ
The estimation of the gene expression distributions depends on a bandwidth ℎ (see Supplementary Methods). Our method uses a heuristic to pick a suitable (neither too large nor too small) bandwidth value that is in the order of magnitude of the typical distance between cells and the nearest grid point. Supplementary Figure 7B shows how results change when the bandwidth is increased or decreased. In general, the default bandwidth setting returns consistent results with larger or smaller bandwidths (i.e. top scoring genes are the same). However, when the bandwidth is very narrow more discrepancies appear. With very narrow bandwidths, distances between cells and grid points increase to a point where some cells have no proximal grid points, resulting in some expression patterns being less well captured.

Dependency of singleCellHaystack on the number of grid points
For high-dimensional input spaces (ex: 50 PCs) it is impossible to use an extensive grid that covers the entire input space. Instead, singleCellHaystack uses a limited set of grid points (see Supplementary Methods) covering the parts of the input space that contain cells. By default, singleCellHaystack uses 100 such grid points, but different values can be specified by the user. Supplementary Figure 7C compares results when using 25, 50, 100, 150, and 200 grid point. In general, top scoring DEGs are consistent regardless of the number of grid points, but some discrepancies can be seen especially between runs using 25 vs 200 grid point. Using too few grid points could result in smaller groups of cells being not well covered. For very large and heterogeneous dataset, increasing the number of grid points is advised. On the other hand, using too many grid points could result in longer runtimes.

Dependency of singleCellHaystack on the coordinates of grid points
The grid points used by haystack_highD are not deterministic (see Supplementary  Methods). Unless a seed value is set for R's random number generator, each run will use different grid point coordinates and will return different results. To evaluate the dependency of results on grid point coordinates, we ran our method using 5 different seed values (Supplementary Figure 7D). We observed that results were stable.

Dependency of cluster-based approaches on the number of clusters
For comparison, we evaluated how a typical cluster-based method (Seurat's FindAllMarkers function using the default Wilcoxon Rank Sum test) depends on the number of predicted clusters. For each dataset, we applied Seurat's FindAllMarkers function on the default number of predicted clusters, as well as on the default reduced by 2 and 1, or increased by 1 and 2 (Supplementary Figure 7E). For each dataset, we found that increasing or decreasing the number of clusters resulted in considerable differences in the top-scoring DEGs.
Note that the number of clusters returned by Seurat's FindAllMarkers function is not deterministic, and the number of clusters and cell-to-cluster assignments change when different random number seed values are used. The correct number of clusters in a real single-cell dataset is therefore hard to determine, illustrating the importance of this comparison: different clustering results can result in considerably different DEGs being predicted.   Figure 4A in the main manuscript shows the scatterplot of p-values on which these rankings are based. Red: 176 genes with p-value of 0 by FindAllMarkers; Green: top 100 genes with highest significance according to singleCellHaystack. Cyan: three genes in the intersect of the above two sets of genes.