Main

Single-cell RNA-sequencing (RNA-seq) count tables are heteroskedastic. In particular, counts for highly expressed genes vary more than for lowly expressed genes. Accordingly, a change in a gene’s counts from 0 to 100 between different cells is more relevant than, say, a change from 1,000 to 1,100. Analyzing heteroskedastic data is challenging because standard statistical methods typically perform best for data with uniform variance.

One approach to handle such heteroskedasticity is to explicitly model the sampling distributions. For data derived from unique molecular identifiers (UMIs), a theoretically and empirically well-supported model is the gamma-Poisson distribution (also referred to as the negative binomial distribution)1,2,3, but parameter inference can be fiddly and computationally expensive4,5. An alternative choice is to use variance-stabilizing transformations as a preprocessing step and subsequently use the many existing statistical methods that implicitly or explicitly assume uniform variance for best performance3,6.

Variance-stabilizing transformations based on the delta method7 promise an easy fix for heteroskedasticity if the variance predominantly depends on the mean. Instead of working with the raw counts Y, we apply a non-linear function g(Y) designed to make the variances (and possibly, higher moments) more similar across the dynamic range of the data8. The gamma-Poisson distribution with mean μ and overdispersion α implies a quadratic mean–variance relationship \({\mathbb{V}}{{{\rm{ar}}}}[Y]=\mu +\alpha {\mu }^{2}\). Here, the Poisson distribution is the special case where α = 0 and α can be considered a measure of additional variation on top of the Poisson. Given such a mean–variance relationship, applying the delta method produces the variance-stabilizing transformation

$$g(y)=\frac{1}{\sqrt{\alpha }}{{{\rm{acosh}}}}\left(2\alpha y+1\right).$$
(1)

Supplementary Information A1 shows the derivation. Practitioners often use a more familiar functional form, the shifted logarithm

$$g(y)=\log \left(y+{y}_{0}\right).$$
(2)

This approximates equation (1), in particular if the pseudo-count is y0 = 1 / (4α) (Supplementary Information A2).

An additional requirement is posed by experimental variations in sampling efficiency and different cell sizes9, which manifest themselves in varying total numbers of UMIs per cell. Commonly, a so-called size factor s is determined for each cell and the counts are divided by it before applying the variance-stabilizing transformation: for example, \(\log (y/s+{y}_{0})\)6,10,11. There is a variety of approaches to estimate size factors from the data. Conventionally, they are scaled to be close to 1 (for example, by dividing them by their mean), such that the range of the adjusted counts is about the same as that of the raw counts. The simplest estimate of the size factor for cell c is

$${s}_{c}=\frac{{\sum }_{g}{y}_{gc}}{L},$$
(3)

where the numerator is the total number of UMIs for cell c, g indexes the genes and \(L={\left({{{\rm{no}}.\, {\rm{cells}}}}\right)}^{-1}{\sum }_{gc}{y}_{gc}\) is the average across all cells of these numerators.

Sometimes, a fixed value is used instead for L. For instance, Seurat uses L = 10,000, others12 have used L = 106 calling the resulting values ygc/sc counts per million (CPM). Even though the choice of L may seem arbitrary, it matters greatly. For example, for typical droplet-based single-cell data with sequencing depth of ∑gygc ≈ 5,000, using L = 106 and then transforming to \(\log ({y}_{gc}/{s}_{c}+1)\) is equivalent to setting the pseudo-count to y0 = 0.005 in equation (2). This amounts to assuming an overdispersion of α = 50, based on the relation between pseudo-count and overdispersion explained in Supplementary Information A2. That is two orders of magnitude larger than the overdispersions seen in typical single-cell datasets. In contrast, using the same calculation, Seurat’s L = 10,000 implies a pseudo-count of y0 = 0.5 and an overdispersion of α = 0.5, which is closer to overdispersions observed in real data. Yet, choosing L or y0 is unintuitive. Instead, we recommend parameterizing the shifted logarithm transformation in terms of the typical overdispersion, using the relation y0 = 1 / (4α) motivated above.

Hafemeister and Satija13 suggested a different approach to variance stabilization based on Pearson residuals

$${r}_{gc}=\frac{{y}_{gc}-{\hat{\mu }}_{gc}}{\sqrt{{\hat{\mu }}_{gc}+{\hat{\alpha }}_{g}\,{\hat{\mu }}_{gc}^{2}}},$$
(4)

where \({\hat{\mu }}_{gc}\) and \({\hat{\alpha }}_{g}\) come from fitting a gamma-Poisson generalized linear model (GLM),

$$\begin{array}{rcl}{Y}_{gc}& \sim &\,{{\mbox{gamma-Poisson}}}\,\left({\mu }_{gc},{\alpha }_{g}\right)\\ \log ({\mu }_{gc})&=&{\beta }_{g,{{{\rm{intercept}}}}}+{\beta }_{g,{{{\rm{slope}}}}}\log ({s}_{c}).\end{array}$$
(5)

Here, sc is again the size factor for cell c, and βg,intercept and βg,slope are intercept and slope parameters for gene g. Note that the denominator in equation (4) is the s.d. of a gamma-Poisson random variable with parameters \({\hat{\mu }}_{gc}\) and \({\hat{\alpha }}_{g}\).

A third set of transformations infers the parameters of a postulated generative model, aiming to estimate so-called latent gene expression values based on the observed counts. A prominent instance of this approach is Sanity, a fully Bayesian model for gene expression14. It infers latent gene expression using a method that resembles a variational mean-field approximation for a log-normal Poisson mixture model. Sanity comes in two flavors: Sanity Distance calculates the mean and s.d. of the posterior distribution of the logarithmic gene expression; based on these, it calculates all cell-by-cell distances, from which it can find the k-nearest neighbors (k-NN) of each cell. Sanity MAP (maximum a posteriori) ignores the inferred uncertainty and returns the maximum of the posterior as the transformed value. A related tool is Dino, which fits mixtures of gamma-Poisson distributions and returns random samples from the posterior15. Normalisr is a tool primarily designed for frequentist hypothesis testing16, but as it infers logarithmic latent gene expression, it might also serve as a generic preprocessing method. Normalisr returns the minimum mean square error estimate for each count assuming a binomial generative model.

In this work, we analyze transformations for preprocessing UMI-based single-cell RNA-seq data based on each of these approaches. We will first contrast the conceptual differences between them. In a second part, we benchmark the empirical performance of all approaches and provide guidelines for practitioners to choose among the methods. In the benchmarks, we also include a fourth preprocessing approach that is not transformation-based and directly produces a low-dimensional latent space representation of the cells: factor analysis for count data based on the (gamma-)Poisson sampling distribution. An early instance of this approach, called GLM PCA, was presented by Townes4 and applied to biological data by Townes et al.17. Recently, Agostinis et al.18 presented an optimized implementation called NewWave.

Results

There are multiple formats for each of the four approaches:

  • Among the delta method-based variance-stabilizing transformations, we considered the \({{{\rm{acosh}}}}\) transformation equation (1), the shifted logarithm equation (2) with pseudo-count y0 = 1 or y0 = 1 / (4α) and the shifted logarithm with CPM. In addition, we tested the shifted log transformation with highly variable gene selection (HVG), z scoring (Z) and rescaling the output as suggested by Booeshaghi et al.19.

  • Among the residuals-based variance-stabilizing transformations, we considered the clipped and unclipped Pearson residuals (implemented by sctransform and transformGamPoi) and randomized quantile residuals. In addition, we tested the clipped Pearson residuals with HVG selection, z scoring and an analytical approximation to the Pearson residuals suggested by Lause et al.20.

  • Among the latent gene expression-based transformations (Lat Expr), we considered Sanity Distance and Sanity MAP, Dino and Normalisr.

  • Among the count-based factor analysis models (Count), we considered GLM PCA and NewWave.

Last, we include two methods as negative (Neg) controls in our benchmarks, for which we expect poor performance: the raw untransformed counts (y) and the raw counts scaled by the size factor (y / s).

Conceptual differences

A known problem for variance-stabilizing transformations based on the delta method derives from the size factors. Figure 1a shows the first two principal components of a homogeneous solution of droplets encapsulating aliquots from the same RNA21 for representative instances of the delta method-, residuals- and latent expression-based transformation approaches. Extended Data Fig. 1 shows the results for all transformations. Despite the size factor scaling, after the delta method-based transformation, the size factor remained a strong variance component in the data (Extended Data Fig. 1b). In contrast, the other transformations better mixed droplets with different size factors. Intuitively, the trouble for the delta method-based transformation stems from the fact that the division of the raw counts by the size factors scales large counts from droplets with large size factors and small counts from droplets with small size factors to the same value. This violates the assumption of a common mean–variance relationship. In Supplementary Information A3, we dissect this phenomenon more formally.

Fig. 1: Conceptual differences between variance-stabilizing transformations.
figure 1

The four columns of this figure correspond to raw counts and transformation by shifted logarithm, clipped Pearson residuals and Sanity MAP. a, Scatter-plots of the first two principal components of data from droplets encapsulating a homogeneous RNA solution. Each point corresponds to a droplet and is colored by its size factor. b, Scatter-plots of the mean–variance relationship, where each point is a gene from a human hematopoietic cell dataset. Note that the y axis range differs between transformations and outliers are plotted on the edge of the plot. c, Histogram of the transformed values for Sftpc, a marker for type II pneumocytes that has a bimodal gene expression in mouse lung epithelium. Source data are provided.

Source data

One of the motivations stated by Hafemeister and Satija13 for the Pearson residuals-based variance-stabilizing transformation is that the delta method-based transformations fail to stabilize the variance of lowly expressed genes. Warton22 provided a theoretical explanation for this fact. Indeed, Fig. 1b shows that the variance after transformation with a delta method-based variance-stabilizing transformation was practically zero for genes with a mean expression of <0.1. In contrast, after residuals-based transformation, the variance showed a weaker dependence on mean expression, except for very lowly expressed genes whose variance is limited by the clipping step (compare Pearson and Pearson (no clip) in Extended Data Fig. 2). The results of the latent expression-based transformations were diverse, reflecting that these methods are not directly concerned with stabilizing the variance. Individual patterns ranged from higher variance for lowly expressed genes (Sanity Distance and Normalisr) to the opposite trend for Dino (Extended Data Fig. 2).

A peculiarity of the Pearson residuals is their behavior if a gene’s expression strongly differs between cell subpopulations. Figure 1c shows a bimodal expression pattern of Sftpc, a marker for type II pneumocytes. Unlike the transformations based on the delta method or latent expression models, the Pearson residuals are an affine-linear transformation per gene (equation (4)) and thus cannot shrink the variance of the high-expression subpopulation more than that of the low-expression subpopulation (compare the Pearson residuals with y / s in Extended Data Fig. 3). This can affect visualizations of such genes and, in principle, other analysis tasks such as detection of marker genes or clustering and classification of cells.

An alternative is to combine the idea of delta method-based variance-stabilizing transformations with the generalized linear model residuals approach by using non-linear residuals. We considered randomized quantile residuals23 (Extended Data Fig. 4 shows how they are constructed). Like Pearson residuals, randomized quantile residuals stabilized the variance for small counts (Extended Data Fig. 2), but in addition, they also stabilized the within-group variance if a gene’s expression strongly differed across cells (Extended Data Fig. 3).

Such conceptual differences of the transformation approaches are important to understand when applying them to new data types or when developing new transformations; but for most practitioners, empirical performance will be of primary interest. We look at this in the next section.

Benchmarks

There is no context-free measure of success for a preprocessing method, as it is contingent on the objectives of the subsequent analysis. For instance, if interest lies in identification of cell type-specific marker genes, one could assess the shape of distributions, such as in Fig. 1c, or the performance of a supervised classification method. Here, we considered the objective that arguably has been the main driver of single-cell RNA-seq development and applications so far: understanding the variety of cell types and states in terms of a lower-dimensional mathematical structure, such as a planar embedding, a clustering, trajectories, branches or combinations thereof. For all of these, one can consider the k-nearest neighbor (k-NN) graph as a fundamental data structure that encodes essential information. The next challenge is then the definition of ‘ground truth’. We designed our benchmarks upon reviewing previous benchmarking approaches. For instance, Breda et al.14 and Lause et al.20 employed synthetic or semi-synthetic data. This is operationally attractive, but it is difficult to be certain about biological relevance. Hafemeister and Satija13 and Lause et al.20 used qualitative inspection of non-linear dimension reduction plots. This can be informative, but is difficult to scale up and make objective. Germain et al.24 compared how well the transformations recovered a priori assigned populations, defined either through FACS or by mixing different cell lines. This is conceptually clean, but restricts analysis to a limited range of datasets that also may only offer a caricature view of cell diversity.

For all our benchmarks, we applied the transformations to the raw counts of each dataset listed below, computed a lower-dimensional representation of the cells using principal-component analysis (PCA), identified the k-NNs of each cell as measured by Euclidean distance and, finally, computed the overlap of the thus obtained k-NN graph with a reference k-NN graph (Methods). We performed these three benchmarks:

  • Consistency. We downloaded ten 10x datasets from the Gene Expression Omnibus (GEO) database. As there was no formal ground truth, we measured the consistency of the results (a necessary, although not sufficient, condition for their goodness) by splitting the genes of each dataset into two disjoint subsets.

  • Simulation. We used four different previously published simulation frameworks and one adapted by us to generate a diverse collection of datasets for which we had full access to the true k-NN graph.

  • Downsampling. We used five deeply sequenced datasets based on mcSCRB and Smart-seq3, which we downsampled to sequencing depths typical for the 10x technology. We postulated that a proxy for ground truth could be constructed from the k-NN graph inferred from the deeply sequenced data intersected across all transformations which we call reliable nearest neighbors. To our knowledge, this work presents the first instance of such an approach.

Extended Data Fig. 5 and the Supplementary Information give an overview of the datasets.

We tested 22 transformations—where applicable with an overdispersion fixed to 0, 0.05 and a gene-specific estimate from the data—across four to eight settings for the number of dimensions of the PCA and measured the overlap with k = 10, 50 and 100 nearest neighbors. In total, we collected more than 61,000 data points. In addition to the results highlighted in the following, we provide an interactive website with all results for all tested parameter combinations.

Figure 2 shows the aggregated results for the three benchmarks for k = 50. Similar results were obtained for k = 10 and k = 100, shown in Extended Data Fig. 6.

Fig. 2: Benchmark results.
figure 2

a, Overlap between the k-NN inferred separately on two halves of the data. The colored points show the averages across ten datasets, each with five replicate random data splits (small, gray points). b, Overlap between k-NN inferred from simulated data and ground truth, using five simulation frameworks and five replicates per framework. c, Overlap between a reference k-NN graph (inferred using all transformations on deeply sequenced data and taking the intersection) and the k-NN inferred on data downsampled to match typical 10x data (5,000 counts per cell) for five datasets with five replicates each. To compare and aggregate results across the different datasets, Relative overlap (ac), which was computed by dividing, for each dataset, the overlap by its average across all transformations, fixing k = 50 and using a dataset-specific number of PCA dimensions (Extended Data Fig. 7 shows the underlying, unaggregated data). d, Overlap (y axis) as a function of PCA dimensions (x axis); the different transformation types are indicated by the colors, using the same palette as in ac. The performance of Sanity Distance is shown as a dashed line. Source data are provided.

Source data

In the consistency benchmark, the delta method-based transformations performed better than the other transformations (Fig. 2a).

On the simulated data, the differences between the transformations looked less pronounced in Fig. 2b than for the other two benchmarks; however, this is a result of the aggregated view. For each particular simulation framework, large differences between the transformations appeared, but the results varied from simulation to simulation framework (Extended Data Fig. 7b) and averaged out in the aggregated view.

The results of the downsampling benchmark (Fig. 2c) agreed well with the trends observed in the simulation and the consistency benchmark. This benchmark is particularly informative because the data had realistic latent structures that were reliably detectable through the high sequencing depth. The downsampling produced data that resembles the more common 10x data in many characteristics: for example, UMIs per cell, proportion of zeros in the data and mean–variance relationship (Supplementary Table 1 and Supplementary Fig. 1). The main difference was that the suitable (high sequencing depth per cell) datasets we could access mostly consisted of only a few hundred cells, except for the 4,298-cell short-interfering RNA KD dataset (Extended Data Fig. 5).

The results in Fig. 2 are on a relative scale, which hides the magnitude of the differences. In Extended Data Fig. 7, we show the underlying results for each dataset on an absolute scale. The range of k-NN overlaps was dataset dependent, ranging from 34 of 50 for the best performing transformation versus 9 of 50 for the negative control for the SUM149PT cell line consistency benchmark, to 2.9 of 50 versus 1.5 of 50 for the HEK downsampling benchmark. For the latter, the overall small overlaps were due to small sets of reliable nearest neighbors (Extended Data Fig. 8a,b). We also ran a version of the downsampling benchmark that only used the top two transformations per approach (Extended Data Fig. 8c,d), which increased the number of reliable nearest neighbors and confirmed the trends we saw in the full version.

In addition to the k-NN overlap with the ground truth, we also calculated the adjusted Rand index (ARI) and the adjusted mutual information (AMI) for the five simulation frameworks. Extended Data Fig. 9a,b shows the aggregated results, which were similar to the results for the k-NN overlap (Fig. 2b). Extended Data Fig. 9c,d show that the ARI and AMI had a larger dynamic range than the k-NN overlap for datasets with a small number of distinct clusters; however, for datasets with a complex latent structure, the k-NN overlap was more informative, which may reflect limitations of ARI and AMI to assess the recovery of gradual changes typical for many biological tissues.

The Random Walk simulation reproduced the benchmark based on which Breda et al.14 argued that Sanity was the best method for identifying the k-NN of a cell (Fig. 5a of their paper). We found that the delta method-based and residuals-based variance-stabilizing transformations performed as well in this benchmark if we projected the cells to a lower-dimensional representation before constructing the k-NN graph. In fact, Fig. 2d shows for four example datasets that the number of dimensions for the PCA was an important determinant of performance. This is because the dimension reduction acts as a smoothener, whose smoothing effect needs to be strong enough to average out uncorrelated noise (small enough target space dimension), but flexible enough to maintain interesting variation (large enough target space dimension).

The latent expression-based transformations (except Normalisr) and the count-based factor analysis models were computationally more expensive than the delta method- and residuals-based transformations. Figure 3a shows the CPU times for calculating the transformation and finding the k-NN on the 10x human helper T-cell dataset with 10,064 cells. Sanity Distance took particularly long because its distance calculation, which takes into account the uncertainty for the nearest neighbor search, scaled quadratically with the number of cells (Fig. 3b). Across all benchmarks, the computations took 24 years of CPU time, of which the latent expression-based transformations accounted for over 90%. The delta method-based transformations were the fastest, especially if the overdispersion was not estimated from the data. The residuals-based transformations took somewhat more time, except for the analytic approximation of the Pearson residuals, which could be calculated almost as fast as the shifted logarithm. In terms of memory consumption, the delta method-based transformations were most attractive because they retained the sparsity of the data.

Fig. 3: Computational expense.
figure 3

a, CPU time needed to calculate the transformation and identify the k-NNs for the 10x human helper T-cell dataset. The secondary axis shows the duration relative to the shifted logarithm. b, Dependence of run time on the number of cells, across datasets, shown on a double-logarithmic scale, with a linear fit. Most transformations have a slope of approximately 1 (scale linearly), whereas Sanity Distance and GLM PCA have a slope > 1.5, which indicates quadratic scaling. Source data are provided.

Source data

In terms of uncovering the latent structure of the datasets, none of the other transformations consistently outperformed the shifted logarithm (Fig. 4a), one of the simplest and oldest approaches. In fact, when followed by PCA dimension reduction to a suitable target dimension, the shifted logarithm performed better than the more complex latent expression-based transformations across all three benchmarks.

Fig. 4: Comparison of selected transformations.
figure 4

ac, 95% confidence intervals (based on n = 5 replicates over five (simulation, downsampling) / ten (consistently) independent datasets) of the differences of the relative k-NN overlap between selected transformations as shown in Fig. 2. Shifted logarithm and Pearson residuals against a selection of the best performing transformations from the other preprocessing approaches (a). Effect of applying various post-processing methods after applying the shifted logarithm or Pearson residuals transformation (b). Effect of fixing the overdispersion to 0 or 0.05, or estimating a gene-specific overdispersion from the data (c). d, Smoothed line plots of the k-NN overlap (y axis) as a function of the UMIs per cell (x axis) for the shifted logarithm transformation, the Pearson residuals, Sanity MAP and GLM PCA are colored by the respective transformation approach. Source data are provided.

Source data

We found no evidence that additional post-processing steps (rescaling the output of the shifted logarithm, selecting HVGs or equalizing the variance of all genes using z scoring) improved the results for identifying nearest neighbors (Fig. 4b). Lause et al.20 and Choudhary and Satija25 debated on how to best choose the overdispersion parameter. We found empirically that, for Pearson residuals and the \({{{\rm{acosh}}}}\) transformation, it is beneficial to have α > 0, but saw no clear benefits from estimating this parameter from the input data versus using a generic, fixed value such as 0.05 (Fig. 4c).

Last, we found that with increasing sequencing depth per cell, all methods generally had a better k-NN overlap with the ground truth (Fig. 4d). This makes intuitive sense; with higher sequencing depth, the relative size of the sampling noise is reduced. Based on Fig. 1a, we might assume that delta method-based transformations would perform particularly poorly at identifying the neighbors of cells with extreme sequencing depths; yet on three datasets, the shifted logarithm did not perform worse than other transformations for cells with particularly large or small size factors (Fig. 4d). We also considered the performance of the transformations as a function of cluster size (Extended Data Figs. 10); while we saw some interesting variation, we did not find that a single transformation performed consistently better or worse for small clusters.

Discussion

We compared 22 transformations, conceptually grouped into four basic approaches, for their ability to recover latent structure among the cells. We found that one of the simplest approaches, the shifted logarithm transformation \(\log (y/s+{y}_{0})\) with y0 = 1 followed by PCA, performed surprisingly well. We presented theoretical arguments for using the related \({{{\rm{acosh}}}}\) transformation or an adaptive pseudo-count y0 = 1 / (4α), but our benchmarks showed limited performance benefits for these.

We recommend against using CPM as input for the shifted logarithm. We pointed out that for typical datasets, this amounts to assuming an unrealistically large overdispersion and in our benchmarks this approach performed poorly compared to applying the shifted logarithm to size factor-scaled counts. We also advise against scaling the results of the shifted logarithm by the sum of the transformed values per cell as, for example, suggested by Booeshaghi et al.19. In our hands (Extended Data Fig. 1), this additional operation failed to remove the confounding effect of the sequencing depth (the authors’ stated motivation for it) and did not improve the k-NN recall performance.

The Pearson residuals-based transformation has attractive theoretical properties and, in our benchmarks, performed similarly well as the shifted logarithm transformation. It stabilizes the variance across all genes and is less sensitive to variations of the size factor (Extended Data Fig. 1b). The analytic approximation suggested by Lause et al.20 is appealing because it worked as well as the exact Pearson residuals but could be calculated faster. However, as seen in equation (4), the Pearson residuals-based transformation is affine linear when considered as a function per gene and this may be unsatisfactory for genes with a large dynamic range across cells. As an alternative, we considered randomized quantile residuals as a non-linear transformation, but found no performance improvement. This result exemplifies that choosing a transformation for conceptual reasons does not necessarily translate into better downstream analysis results.

The use of the inferred latent expression state as a transformation and count-based latent factor models are appealing because of their biological interpretability and mathematical common sense. In particular, Sanity Distance is appealing because it does not have any tunable parameters; however, all these transformations performed worse than the shifted logarithm with a reasonable range of PCA dimensions in our benchmarks and some of the transformations were exceptionally computationally expensive (for example, the median CPU time of Sanity Distance was 4,500-times longer than for the shifted logarithm).

Our results partially agree and disagree with previous studies. Germain et al.24 benchmarked many steps of a typical single-cell RNA-seq analysis pipeline, including a comparison of clustering results obtained after different transformations against a priori assigned populations. In line with our findings, they reported that dimension reduction was of great importance. They went on to recommended sctransform (Pearson residuals) based on its good performance on the Zhengmix4eq dataset, which is a mixture of peripheral blood mononuclear cells sorted by surface markers using flow cytometry; however, it is not clear how generalizable this result is and our benchmarks do not support such a singling out of that method. Lause et al.20 considered the related Zhengmix8eq dataset, into which they implanted a synthetic rare cell type by copying 50 B cells and appending ten genes exclusively expressed in the synthetic cell type. They used k-NN classification accuracy of the cell type averaged per cell type (macro F1 score; Fig. 5c of their paper) and averaged over all cells (online version of Fig. 5c). They found a performance benefit for the Pearson residuals over the shifted logarithm with the macro F1 score, but similar performance with regard to overall accuracy. The macro F1 score emphasizes the performance difference for the synthetic cell type, which seems somewhat construed and might not be a good model for most biologically relevant cell type and state differences. Instead of comparing clustering results to discrete cell type assignments, we have focused on the inference of the k-NN of each cell, with the expectation that this enables consideration of more subtle latent structures than well-separated, discrete cell types.

Pearson residuals- and delta method-based transformations weight genes differently; for example, Pearson residuals put more weight on lowly expressed genes than the delta method (Fig. 1b). This can lead to different downstream results, but our benchmarks did not indicate that any particular weighting is generally better; only that the delta method-based transformation produced more consistent results on the 10x datasets.

We did not evaluate the impact of alternative size factor estimators. We also did not consider how suitable a transformation is for marker gene selection, because we are not aware of a suitable metric to determine success, as the utility of a marker gene hinges on its biological interpretability. For a recent effort to compare different marker gene selection methods, see Pullin and McCarthy26.

Considerable research effort has been invested in the area of preprocessing methods for single-cell RNA-seq data. To our surprise, the shifted logarithm still performs among the best. Our bottom-line performance benchmark highlights current limitations of theoretical analysis of preprocessing methods, but also the utility of lower-dimensional embeddings of the transformed count matrix to reduce noise and increase fidelity. Interesting open questions include choosing among the many possible embedding methods and number of latent dimensions.

Methods

The delta method

The delta method is a way to find the s.d. of a transformed random variable.

If we apply a differentiable function g to a random variable X with mean μ, the s.d. of the transformed random variable g(X) can be approximated by

$${\mathbb{S}}{{{\rm{d}}}}[g(X\,)]\approx a\,{\mathbb{S}}{{{\rm{d}}}}[X\,],$$

where \(a=| {g}^{{\prime} }(\mu )|\) is the slope of g at μ.

Now consider a set of random variables X1, X2, … whose variances and means are related through some function v, that is, \({\mathbb{V}}{{{\rm{ar}}}}[{X}_{i}]=v(\,{\mu }_{i})\), or equivalently \({\mathbb{S}}{{{\rm{d}}}}[{X}_{i}]=\sqrt{v(\,{\mu }_{i})}\). Then we can find a variance-stabilizing transformation g by requiring constant s.d., \({\mathbb{S}}{{{\rm{d}}}}[g({X}_{i})]={{{\rm{const.}}}}\), which using the above approximation becomes

$${g}^{{\prime} }(\,\mu )=\frac{{{{\rm{const.}}}}}{\sqrt{v(\mu )}},$$

and can be solved by integration.

Transformations

We compared 22 transformations that can be grouped into four approaches.

The delta method-based transformations were: the shifted logarithm (\(\log (y/s+1)\)); the acosh transformation (\({{\mathrm{acosh}}}\,(2\alpha y/s+1)\)); the shifted logarithm with pseudo-count dependent on the overdispersion (\(\log (\,y/s+1/(4\alpha ))\)); the shifted logarithm with CPM (\(\log ({{{\rm{CPM}}}}+1)\)); the shifted logarithm with subsequent size normalization as suggested by Booeshaghi et al.19 (xgc/uc, where \({x}_{gc}=\log ({y}_{gc}/{s}_{c}+1)\) and uc = ∑gxgc); the shifted logarithm with subsequent HVG selection (\(\log (y/s+1)\to {{{\rm{HVG}}}}\)); the shifted logarithm with subsequent z scoring per gene (\(\log (y/s+1)\to {{{\rm{Z}}}}\)); and the shifted logarithm with subsequent highly variable gene selection and z scoring per gene (\(\log (y/s+1)\to {{{\rm{HVG}}}}\to {{{\rm{Z}}}}\)). For all composite transformations, we first calculated the variance-stabilizing transformation, then chose the HVGs and used the results without recalculating the variance-stabilizing transformation.

To retain the sparsity of the output also if the pseudo-count y0 ≠ 1, transformGamPoi uses the relation

$$\log \left(\frac{y}{s}+{y}_{0}\right)=\log \left(\frac{y}{{y}_{0}\,s}+1\right)+\log {y}_{0}.$$
(6)

Subtracting the constant \(\log {y}_{0}\) from this expression does not affect its variance-stabilizing properties, but has the desirable effect that data points with y = 0 are mapped to 0.

The residuals-based transformations were: Pearson residuals implemented with the transformGamPoi package where each residual is clipped to be within \(\pm \sqrt{{{{\rm{no}}.\,{\rm{cells}}}}}\), as suggested by Hafemeister and Satija13 (Pearson); Pearson residuals with clipping and additional heuristics implemented by sctransform v.2, an analytic approximation to the Pearson residuals with clipping suggested by Lause et al.20 (Analytic Pearson); randomized quantile residuals implemented by transformGamPoi (Random Quantile); Pearson residuals without clipping implemented by transformGamPoi (Pearson (no clip)); Pearson residuals with clipping and subsequent HVG selection (Pearson → HVG); Pearson residuals with clipping and subsequent z scoring per gene (Pearson → Z); and Pearson residuals with clipping and subsequent HVG selection and z scoring per gene (Pearson → HVG → Z). For each composite Pearson residual transformation (with HVG and/or z scoring), we used the transformGamPoi implementation.

The latent expression-based transformations were: Sanity with point estimates for the latent expression (Sanity MAP) and with calculation of all cell-by-cell distances taking into account uncertainty provided by the posteriors (Sanity Distance); Dino as provided in the corresponding R package; and Normalisr with variance normalization, implemented in Python, which we called from R using the reticulate package.

The count-based factor analysis models were: GLM PCA using the Poisson model and the gamma-Poisson model with α = 0.05. In the figures, we show the results for the Poisson model unless otherwise indicated. We used the avagrad optimizer. We ran NewWave with 100 genes for the mini-batch overdispersion estimation.

For the delta method-based transformations and the residuals-based transformations calculated with the transformGamPoi package, we calculated the size factor s using equation (3).

We defined HVGs as the 1,000 most variable genes based on the variance of the transformed data.

For z scoring, we took the transformed values xgc = g(ygc) and computed \({z}_{gc}=\frac{{x}_{gc}-{{\mathrm{mean}}}\,({{{{\boldsymbol{x}}}}}_{g})}{\sqrt{{{\mathrm{var}}}\,({{{{\boldsymbol{x}}}}}_{g})}}\), where mean and variance are the empirical mean and variance taken across cells.

In the overview figures (Figs. 24), we use a gene-specific overdispersion estimate for all residuals-based transformations and for the delta method-based transformations, which can handle a custom overdispersion; for GLM PCA, we use α = 0, because these settings worked best for the respective transformations. The latent expression-based transformations and NewWave do not support custom overdispersion settings.

Conceptual differences

For the visualization of the residual structure after adjusting for the varying size factors, we chose a control dataset of a homogeneous RNA solution encapsulated in droplets21. We filtered out RNAs that were all zero and plotted the first two principal components. Where applicable, we used gene-specific overdispersion estimates. For visualizing the results of Sanity Distance, instead of the PCA, we used multidimensional scaling of the cell-by-cell distance matrix using R’s cmdscale function. We calculated the canonical correlation using R’s cancor function on the size factors and the first ten dimensions from PCA and multidimensional scaling.

The plots of the mean–variance relationship are based on the 10x human hematopoietic cell dataset27. Where applicable, we used the gene-specific overdispersion estimates. The panel of Sanity Distance shows the variance of samples drawn from a normal distribution using the inferred mean and s.d.

For the mouse lung dataset28, we filtered out cells with extreme size factors (0.1smedian < sc < 10smedian, where smedian is the median size factor). We also removed cells that did not pass the scran quality control criterion regarding the fraction of reads assigned to mitochondrial genes. To account for the fact that some transformations share information across genes, we applied all transformations to the 100 most highly expressed genes and three genes (Sftpc, Scgb1a1 and Ear2) known to be differentially expressed in some cell types according to the assignment from the original publication.

Benchmarks

The benchmarks were executed using a custom work scheduler for slurm written in R on CentOS7 and R 4.1.2 with Bioconductor v.3.14. The set of R packages used in the benchmark with exact version information was stored using the renv package and is available from the GitHub repository.

k-NN identification and dimensionality reduction

To calculate the PCA, we used the irlba package. To infer the k-NN, we used annoy, which implements an approximate nearest neighbor search algorithm. To calculate t-distributed stochastic neighbor embeddings (tSNEs), which we only used for visualization, we used the Rtsne package on data normalized with the shifted logarithm with a pseudo-count of 1.

Consistency benchmark

We downloaded ten single-cell datasets listed in GEO browser after searching for the term mtx on 14 October 2021. All datasets are listed in the Data Availability section. To measure the consistency of the transformations, we randomly assigned each gene to one of two groups and processed the two resulting data subsets separately. We calculated the consistency as the mean overlap of the k-NN for all cells.

Simulation benchmark

We used five frameworks to simulate single-cell counts in R: we ran dyngen29 using a consecutive bifurcating mode and the default parameters otherwise. We ran muscat30 with four clusters, a default of 30% differentially expressed genes with an average log fold change of 2 and a decreasing relative fraction of log fold changes per cluster. We ran scDesign2 (ref. 31) with the 10x human hematopoietic cell dataset as the reference input with a copula model and a gamma-Poisson marginal distribution. We simulated the Random Walk by translating the MATLAB code of Breda et al.14 to R and using the data by Baron et al.32 as a reference. For the Linear Walk, we adapted the Random Walk simulation and, instead of following a Random Walk for each branch, we interpolated the cells linearly between a random start and end point. For both benchmarks, we used a small non-zero overdispersion of α = 0.01 to mimic real data.

With each simulation framework, we knew which cells were the k-NNs to each other. We calculated the overlap as the mean overlap of this ground truth with the inferred nearest neighbors on the simulated counts for all cells. Furthermore, we calculated the ARI and AMI by clustering the ground truth and the transformed values with the graph-based walktrap clustering algorithm from the igraph package.

Downsampling benchmark

We searched the literature for single-cell datasets with high sequencing depth and found five (one from mcSCRB, four from Smart-seq3) that had a sequencing depth of more than 50,000 UMIs per cell on average. We defined reliable nearest neighbors as the set of k-NNs of a cell that were identified with all 22 transformations on the deeply sequenced data (excluding the two negative controls). We used the downsampleMatrix function from the scuttle package to reduce the number of counts per cell to approximately 5,000, a typical value for 10x data. We considered only one setting for the overdispersion per transformation (instead of allowing multiple overdispersion settings for some transformations as in the other benchmarks). We ran all transformations that supported the setting, with a gene-specific overdispersion estimate (except GLM PCA, which performed better with an overdispersion fixed to 0). Finally, we computed the mean overlap between the k-NNs identified on the downsampled data with the set of reliable nearest neighbors for all cells with more than one reliable nearest neighbor.

k-NN overlap

For all three benchmarks, we calculated overlaps between pairs of k-NN graphs. Denoting their no. cell × no. cell adjacency matrices (a matrix of zeros and ones, where an entry is is one if a cell d is among the k-NNs of cell c) by N1 and N2, we defined their overlap as

$$\frac{1}{{{{\rm{No}}.\, {\rm{cells}}}}}\mathop{\sum }\limits_{c,d=1}^{{{{\rm{No}}.\,{\rm{cells}}}}}{N}_{cd}^{1}{N}_{cd}^{2}.$$
(7)

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.