A balanced measure shows superior performance of pseudobulk methods in single-cell RNA-sequencing analysis

Arising From Zimmerman, K. D., Espeland, M. A. & Langefeld, C. D. Nature Communications (2021). https://doi.org/10.1038/s41467-021-21038-1 Summary Recently, Zimmerman et al.,1 proposed the use of mixed models over pseudobulk aggregation approaches, reporting improved performance on a novel simulation approach of hierarchical single-cell expression data. However, their reported results could not prove the superiority of mixed models as they are based on separate calculations of type 1 (performance of the models on non-differentially expressed genes) and type 2 error (performance on differentially expressed genes). To correctly benchmark the models, a reanalysis using a balanced measure of performance, considering both the type 1 and type 2 errors (both the differentially and non-differentially expressed genes), is necessary. Contact Alan Murphy: a.murphy@imperial.ac.uk, Nathan Skene: n.skene@imperial.ac.uk Code availability The modified version of hierarchicell which returns the Matthews correlation coefficient performance metric as well as the type 1 error rates, uses the same simulated data across approaches and has checkpointing capabilities (so runs can continue from where they left off if aborted or crashed) is available at: https://github.com/neurogenomics/hierarchicell. The benchmarking script along with the results are available at: https://github.com/Al-Murphy/reanalysis_scRNA_seq_benchmark.


Introduction
The simulation approach of hierarchical single-cell expression data developed by Zimmerman et al., 1 (hierarchicell) is used to generate non-differentially expressed genes upon which performance is evaluated using the type 1 error rate; the proportion of non-differentially expressed genes indicated as differentially expressed by a model. The authors determined that pseudobulk methods are "overly conservative" relative to mixed models however this can not be determined based on this analysis tool. In single-cell expression analysis, a conservative model does a poor job of capturing truly differentially expressed genes. Thus, such a model would have a high type 2 error rate. Certain methods' 1-type 2 error or power were calculated on specific simulated dataset sizes in Figure 3 and Supplementary Figures 5-11 of the authors' analysis. Importantly though, a systematic analysis of the models' type 2 error rates was not reported, therefore, the authors' statement cannot be concluded.
Considering the systematic analysis of the type 1 error results reported by Zimmerman et al., 1 across the 20,000 iterations of 5 to 40 individuals and 50 to 500 cells at a p-value cut-off of 0.05, it was observed that pseudobulk approaches, in fact, have the lowest type 1 error at every iteration (Supplementary Figure 1). However, as previously outlined, we need a balanced measure of performance that considers both type 1 and type 2 error rate to correctly benchmark the models.

Main
Here, we modified Zimmerman et al.'s hierachicell approach to simulate both differentially expressed and non-differentially expressed genes. The differentially expressed genes were randomly simulated with a fold change between 1.1 and 10. We tested the models using the Matthews Correlation Coefficient (MCC) giving a balanced measure of performance as well as the type 1 error. MCC is a well-known and frequently adopted metric in the machine learning field which offers a more informative and reliable score on binary classification problems 2 .
MCC produces scores in [-1,1] and will only assign a high score if a model performs well on both non-differentially and differentially expressed genes. Moreover, MCC scores are proportional to both the size of the differentially and non-differentially expressed genes, so it is robust to imbalanced datasets. Furthermore, hierarchicell uses R's pseudo-random number functionality when generating the single-cell expression data, meaning each iteration will result in a different simulated dataset. However, Zimmerman et al.'s approach did not account for this in their benchmarks thus, their comparisons were not based on the same data. We further modified hierachicell to use the same simulated data for all models, enabling a fair comparison.
Our analysis demonstrates that pseudobulk approaches are the best performing across all number of individuals and cells variations (Figure 1). There is one exception for sum pseudobulk which performs worse than Tobit at 5 individuals and 10 cells.  In real datasets there would never be equal numbers of cells in each sample. To mirror this, we simulated data with an imbalanced number of cells between case and controls. Pseudobulk mean outperformed all other approaches on this analysis (Supplementary Figure 2). The pseudobulk approach which aggregated by averaging rather than taking the sum appears to be the top performing overall, however, it is worth noting that hierarchicell does not normalise the simulated datasets before passing to the pseudobulk approaches. This is a standard step in such analysis to account for differences in sequencing depth and library sizes 5  Therefore, our results on sum and mean pseudobulk extend their findings and indicate that mean aggregation may be the best performing. However, the reader should be cognisant that the lack of a normalisation step based on the flaw in the simulation software strategy likely causes the increased performance of mean over sum aggregation. Further, the use of simulated datasets in our analysis may not accurately reflect the differences between individuals that are present in biological datasets. Thus, despite both our results and those reported by Squair et al., there is still room for further analysis, benchmarking more models, including different combinations of pseudobulk aggregation methods and models, on more representative simulated datasets and biological datasets to identify the optimal approach. Specifically, we would expect pseudobulk sum with a normalisation step to outperform pseudobulk mean since it can account for the intra-individual variance which is otherwise lost with pseudobulk mean but this should be tested, including on imbalanced datasets.

Conclusion
In conclusion, our results demonstrate that pseudobulk approaches are far from being too conservative and are, in fact, the best performing models based on this simulated dataset for the analysis of single-cell expression data.