Abstract
Cells from the same individual share common genetic and environmental backgrounds and are not statistically independent; therefore, they are subsamples or pseudoreplicates. Thus, singlecell data have a hierarchical structure that many current singlecell methods do not address, leading to biased inference, highly inflated type 1 error rates, and reduced robustness and reproducibility. This includes methods that use a batch effect correction for individual as a means of accounting for withinsample correlation. Here, we document this dependence across a range of cell types and show that pseudobulk aggregation methods are conservative and underpowered relative to mixed models. To compute differential expression within a specific cell type across treatment groups, we propose applying generalized linear mixed models with a random effect for individual, to properly account for both zero inflation and the correlation structure among measures from cells within an individual. Finally, we provide power estimates across a range of experimental conditions to assist researchers in designing appropriately powered studies.
Introduction
The rapid evolution of singlecell technologies will enable novel interrogation of fundamental questions in biology, accelerating discoveries across many biological disciplines. Common fields of application for singlecell technologies include cancer, neurological disease, developmental biology, diabetes, and autoimmune disease. Thus, researchers are developing methods that leverage or account for the unique properties of singlecell RNA sequencing (scRNAseq) data, particularly their increased sparseness and heterogeneity compared to bulk sequencing counterparts^{1,2,3}. An important characteristic of singlecell experiments is that they use many cells from the same individual, and therefore the same genetic and environmental background. Here we empirically document correlation among measures from cells within an individual, and demonstrate how testing for differential expression analysis in scRNAseq data within a cell type across conditions without considering this correlation — the current common practice — violates fundamental assumptions and leads to false conclusions. While differential expression analysis can be computed across all cell types, throughout this manuscript, differential expression analysis is generally considered to be computed within a specific cell type of interest (i.e., after cell clustering and cell type identification).
Proper identification of the experimental unit (i.e., the smallest observation for which independence can be assumed) for the hypothesis is critical for proper inference. Observations nested within an experimental unit are referred to as subsamples, technical replicates, or pseudoreplicates. Pseudoreplication, or subsampling, is formally defined as “the use of inferential statistics where replicates are not statistically independent”^{4}. There are two types of pseudoreplication commonly occurring in singlecell experiments: simple and sacrificial. Simple pseudoreplication occurs when “samples from a single experimental unit are treated as replicates representing multiple experimental units”^{4,5,6}. Sacrificial pseudoreplication occurs when “samples taken from each experimental unit are treated as independent replicates”^{4,5,6}. Pseudoreplication has been addressed repeatedly in ecology, agriculture, psychology, and neuroscience and acknowledged as one of the most common statistical mistakes in scientific literature^{4,5,6,7,8,9}. New technologies are particularly prone to this error. Thus, it is not surprising that, when performing a literature review prior to conducting these analyses, we found pseudoreplication to be pervasive in the singlecell literature. Properly identifying the right experimental unit, and analyzing the data accordingly, need to be urgently addressed in singlecell studies before a lack of reproducibility tarnishes the singlecell technology itself as potentially unreliable.
In this study, we simulate hierarchical singlecell expression data and evaluate the type 1 error rates and power of mixed models relative to some of the most frequently applied differential expression methods. We assess a number of commonly applied differential expression methods, but we primarily focus on the computation of a twopart hurdle model. This model explicitly accounts for the common problem of zero inflation in scRNAseq data by simultaneously modeling the rate of expression and the positive expression mean^{10}. Using the twopart hurdle model, we compute differential expression as it is most typically applied in the literature: without a random effect for individual. We then reevaluate the twopart hurdle model’s performance when computing differential expression with a random effect for individual, and after applying a batch effect correction for individual. Additionally, we examine the type 1 error control and power of aggregation (i.e., pseudobulk) methods, where gene expression values are averaged across all cells within an individual and the test statistic is computed on the individual means^{11,12,13}. Aggregation methods are implemented to control for both zeroinflation and withinsample correlation, but are conservative in unbalanced situations, which are common in singlecell data. Overall, these simulations indicate how properly accounting for the correlation structure among measures from cells within an individual will greatly increase both robustness and reproducibility, thereby leveraging the very features that make singlecell methods powerful.
Results
Intraindividual correlation
We hypothesize that measures from cells from the same individual should be more (positively) correlated with each other than cells from unrelated individuals. Empirically, this appears true across a range of cell types (Fig. 1). We demonstrate this effect by estimating the pairwise correlation of cells within an individual and across any two different individuals. For a given cell type, the correlation of cells within an individual (intraindividual correlation) is always higher than the correlation of cells across individuals (interindividual correlation). Thus, singlecell data have a hierarchical structure in which the single cells may not be mutually independent and have a studyspecific correlation (e.g., exchangeable correlation within an individual). Within a cell type, cells appear to also exhibit some correlation across individuals (Fig. 1). We hypothesize this is due to stability in functional gene expression that is needed for a cell to be classified as a specific type (e.g., Tcells need to have some consistent signals of gene expression related to their function as Tcells).
Simulation
We completed a simulation study that reproduced both the inter and intraindividual variance structures estimated from real data and documented the effect of intraindividual correlation on the type 1 error rates of the most frequently used singlecell analysis tools (Fig. 2 and Supplementary Figs. 1–4). Our simulation captures some of the most important aspects of singlecell data (Supplementary Figs. 1–4) and was used to compare methods that do and do not account for the repeated observations within an experimental unit (see Methods). We varied the number of individuals and cells within an individual, and all methods considered use asymptotic approximations and admit covariates.
Type 1 error evaluations
The generalized linear mixed model (GLMM), employing either a tweedie distribution or a twopart hurdle model with a random effect (RE) for the individual, outperformed other methods across a variety of conditions (Table 1 and Supplementary Tables 1–4).
Among the methods that explicitly model the correlation structure, GLMM consistently better controlled for type 1 error rate than generalized estimating equation (GEE1) models. The latter performed poorly for all numbers of subsamples until the number of independent experimental units approached 30. However, all models that explicitly model the correlation structure have more appropriate type 1 error rates than methods that do not account for lack of independence among experimental units (Table 1 and Supplementary Tables 1–4). As the number of correlated cells rose, performance of all methods that treat observations independently grew increasingly worse (Table 1 and Supplementary Tables 1–4).
One of the most heavily cited singlecell analysis tools, modelbased analysis of singlecell transcriptomics (MAST), is a twopart hurdle model built to handle sparse and bimodally distributed singlecell data^{10}. Although to our knowledge no publications have employed MAST to account for pseudoreplication as discussed here, Finak et al. note that MAST “can easily be extended to accommodate random effects”^{10}. When implementing MAST with a random effect for individual (i.e., MAST with RE), the type 1 error rate is wellcontrolled. However, its type 1 error rate is just as inflated as other tools when it is not implemented with a random effect for individual. However, one suggested approach to account for withinindividual correlation is the aggregation of celltypespecific expression values within an individual by using either a sum or a mean^{11,12,13}. Such analysis methods, as would be expected, do control for the type 1 error rate, but are conservative (Table 1 and Supplementary Tables 1–4).
Another method that could be used to account for withinsample correlations is to apply a batch effect correction method prior to differential expression, for which the batches are different individuals. Here, when we applied batch effect correction via ComBat^{14} prior to differential expression analysis within a cell type, type 1 error rates markedly increased (Table 1 and Supplementary Tables 1–4).
In addition to evaluating type 1 error rates, we examined the preservation of the rankorder of results from these methods (Supplementary Table 5). We also evaluated the sensitivity (the proportion of correctly identified true positives) at varying fold changes of the twopart hurdle model when ignoring the withinindividual correlation (MAST), correcting it for “batch effect” prior to differential expression (MAST ComBat), or correcting it with a random effect for individual (MAST RE) (Supplementary Fig. 5). We did not explicitly evaluate specificity (the proportion of correctly identified true negatives), which is simply computed as 1type error. Thus, when the type 1 error rate for a method is inflated, the specificity is small. We found the highest correlations between the absolute value of the simulatedlog(foldchange) and the methods that properly account for withinperson correlation. The methods that do not do so maintained some semblance of rankorder, except for batch effectcorrected results (Supplementary Table 5). As expected, not properly accounting for withinperson correlation leads to extremely high sensitivity with very low specificity (Supplementary Fig. 5).
Power analyses
We computed an extensive simulationbased power analysis to provide estimates across a wide range of experimental conditions. We used a twopart hurdle model with random effects for individuals as implemented in MAST^{10}. We also computed power when expression values are averaged across cells within an individual. Increasing the number of independent experimental units (e.g., individuals) in a study is the best way to increase power to detect true differences (Fig. 3a). Empirically, when sample sizes become greater than 20, there are only marginal gains in power when more than 100 cells per individual are sampled for a particular analysis unit (i.e., computing the analysis within a single cell type of interest or across all cell types). Methods that aggregate information across cells within an individual by averaging or summing (i.e., “pseudobulk” methods) are only slightly underpowered relative to mixedeffects models when there is balance in the number of cells per individual. However, these are not as well powered as mixedeffects models when the number of cells per individual grow increasingly imbalanced (Supplementary Figs. 6–8).
Discussion
Singlecell studies designed to identify differentially expressed genes rarely mention or address the correlation among cells from the same individual or experimental unit. Excellent reviews of the field and methodologic work have largely focused on challenges presented by properly classifying cell types, multimodality, dropout, and higher noise derived from biological and technical factors. However, they fail to highlight the effects of pseudoreplication. Furthermore, papers evaluating the performance of singlecellspecific tools all compute the simulations as if cells were independent^{15,16,17,18,19,20,21}. The result is reduced reproducibility with real data, leading to the conclusion that tools built specifically to handle singlecell data do not appear to perform better than tools created for bulk data analysis^{22,23,24}.
Here, we have empirically documented the correlation among measures from cells within an individual for a few independent datasets and different cell types (Fig. 1). These findings imply that the current practice — testing for differential expression analysis across conditions in scRNAseq data within a cell type without considering this correlation — leads to pseudoreplication. Pseudoreplication, formally defined as “the use of inferential statistics where replicates are not statistically independent”, has been addressed repeatedly in both new and wellestablished scientific fields^{4,5,6,7,8}. Recently, it was acknowledged as one of the most common statistical mistakes in scientific literature^{9}. Here we hope to address pseudoreplication in singlecell analyses by demonstrating what a large and longstanding body of statistical literature already confirms: applying statistical inference to replicates that are not statistically independent without properly accounting for their correlation structure will inflate type 1 error rates and lead to spurious results^{4,5,6,9,25,26,27,28}.
In our results, models that explicitly parameterized the correlation structure all showed improved type 1 error control compared to methods that did not account for the lack of independence among experimental units (Table 1 and Supplementary Tables 1–4). Furthermore, as the number of correlated cells increased, performance of all methods that treated observations independently increasingly worsened (Table 1 and Supplementary Tables 1–4).
Both the twopart hurdle mixed model and the tweedie mixed model showed type 1 error control when adjusting for individual as a random effect (i.e., MAST with RE/Tweedie GLMM), but their type 1 error rates were highly inflated when not doing so. These specific evaluations of models with and without a random effect for individual illustrate why accounting for pseudoreplication is so important. As the denominator of most statistical tests (e.g., Wald test) is a function of the variance, not accounting for the positive correlation among sampling units underestimates the true standard error and leads to false positives^{27,28}. In addition, treating each cell as independent inflates the test degrees of freedom, making it easier to falsely reject the null hypothesis (type 1 error). This approach leads to apparently greater sensitivity among methods that do not properly account for pseudoreplication (Supplementary Fig. 5). However, the type 1 error rates (and, thereby, specificity) indicate that, while these methods capture most results at the lower foldchanges, they also capture a large proportion of false positives. Too many false positive results can mask true associations, especially when multiple comparison procedures such as false discovery rate are applied. In combination, this will adversely affect downstream analyses (pathway analysis), robustness, and reproducibility — all increasing the cost of science.
One potential method to remove interindividual differences prior to analysis is to apply batch effect correction prior to differential expression analysis, where the batches are individuals. Using these techniques to correct for intraindividual correlation should, in fact, be used more often before celltype clustering, to more accurately identify cell types upstream of testing for differentially expressed celltype marker genes with mixed models. However, when using these methods prior to differential expression analysis within a cell type, they show markedly increased type 1 error rates (Table 1 and Supplementary Tables 1–4). In these analyses, we used the batch effect correction tool, ComBat^{14}. Many other batch effect correction tools are available. However, the underlying concept common to all is that they should not be applied to account for withinperson correlation when estimating the variance used in testing the hypothesis of differential expression. This is primarily because regressing out the personspecific effect as a batch effect and subsequently analyzing each cell as an independent observation will underestimate the overall variance by removing interindividual differences while maintaining an inappropriately large number of degrees of freedom when treating cells as if they are independent. In addition, applying a batch effect correction in this manner will greatly disturb the rankorder of results (Supplementary Table 5). However, batch correction methods are valuable when used as designed.
Another method recommended before analysis is to aggregate individual gene expression values within a person by summing or averaging them^{11,12,13}. Such approaches, labeled “pseudobulk” techniques, can be appropriate ways of handling the correlation structure, but averaging withinsubject measurements reduces the number of data points and loses information. The fewer data points and decreased confidence in the estimated population mean make this a conservative approach when the number of cells per individual are imbalanced. Averaging withinsubject measurements ignores withinsubject variability. Furthermore, when withinsubject variability is large with respect to betweensubject variability, averaging withinsubject measurements fails to estimate the true data variability^{26}.
Overall, mixedeffects models lead to the most accurate results when analyzing data with a hierarchical structure^{6,25,26}. As we demonstrate here, aggregating values across cells from the same experimental unit will actually lead to an increased type 2 error rate and decreased power (Table 1 and Supplementary Figs. 6–8). This is due to an overestimation of the meansquare error relative to mixedeffects models, particularly when imbalance exists and the intraindividual variance is larger than the interindividual variance, as appears to be typical with scRNAseq data^{25,26}. In imbalanced situations, “pseudobulk” methods also cause cells from individuals with fewer cells to be more heavily weighted, where mixed models have consistent estimators and do not require balanced data^{29}. Taking the mean across all cells within an individual will also cause problems when used together with tools that require integers for the input. In addition, taking the mean in this situation generates heterogeneity in certainty of the estimates of the mean — that is, the data are not independent and identically distributed. Errorsinvariables regression should be explored as a means of accounting for this heterogeneity^{30}.
The twopart hurdle model also has the ability to test for differences in both the proportion of zeros and the magnitude of effect across treatment groups separately. In some scenarios, combining the data will cause a significant difference in the magnitude of effect to be washed out by a significant difference in the proportion of zeros or vice versa (i.e., Simpson’s paradox). Such scenarios occur infrequently in our simulated data, which holds the probability of zeros constant across treatment groups, but such scenarios may be more common in real data. “Pseudobulk” techniques may control the type 1 error rates and may help account for zeroinflation; however, we recommend mixedeffects models based on longstanding statistical justifications for the analysis of subsamples, including increased power. This is particularly so in scenarios with heavy imbalance and where withinsubject variability is large compared to betweensubject variability.
Among the methods that explicitly model the correlation structure, GLMMs consistently better controlled for type 1 error rate than GEE1 models. Here, GEE1 exhibited elevated type 1 error rates for any number of subsamples until the number of independent experimental units approached 30. When the number of experimental units was small, the GEE1 sandwich estimator of the variance provided standard errors that were too small and therefore inflated the type 1 error rate^{31,32}.
Here, we emphasize the twopart hurdle mixed model, implementable in MAST, as an already wellestablished tool in the field. We demonstrate that this implementation performs exceptionally well when adjusting for individual as a random effect^{10}. MAST with RE is testing a twopart hypothesis that the other tools are not directly testing. The discrete and continuous components being tested fit together — meaning that higher mean expression will generally correlate with a higher proportion of expressing cells, but assuming that the two will always relate is incorrect. There will be specific instances in the simulated data when the interindividual means are not significantly different, but the proportion of cell dropout is significantly different (even though the probability of dropout for any one gene across cells is held constant) and is driving a significant result. This will be particularly true with smaller sample sizes, and may contribute to the slightly elevated type 1 error rates with smaller sample sizes and cell counts.
While we recommend computing differential expression analysis using MAST with RE, alternative methods include using MAST with fixedeffects for individual, a tweedie GLMM, or permutation testing. Accounting for the withinsample correlation with a fixedeffect term for individual, will have a slight difference in interpretation, but should be considered an alternative option to random effects — particularly when the number of independent experimental units is modest^{33}. To not violate the exchangeability assumption, permutation methods must randomize at the independent experimental unit (e.g., individual) and properly account for covariates (i.e., conditional permutation). The tweedie GLMM method, which was selected for its distributional flexibility that can account for zero inflation, could be implemented using the “glmmTMB” Rpackage^{34}, but other mixed models could also be applied using a more appropriate distribution if, for example, the observed data do not exhibit zero inflation. However, none of these alternative approaches explicitly incorporates some singlecellspecific concepts implemented in MAST (e.g., cellular detection rate).
We computed power analyses for a variety of sample sizes and cell amounts. Empirically, we demonstrated that increasing the amount of cells captured per sample returns very little gain in power after 100 cells per individual in most scenarios, particularly after sample sizes increase (Fig. 3 and Supplementary Figs. 9–11). Instead, we suggest that increasing sample size is the most efficient way to improve power (Fig. 3a). Increasing the number of cells per individual does provide more precision in the estimate for an individual. However, it has limited effects on the power for detecting differences across individuals, such as differences among treatments applied to individuals (i.e., case/control studies). Estimating power with more than 1000 cells per individual is computationally expensive; run times for the random effects models are significantly higher than the other tools (Supplementary Table 6). Because using thousands of cells per individual is not atypical for singlecell experiments, tools that account for the correlation structure when analyzing these data need to be further developed to increase computational efficiency (e.g., parallel code, use of GPU). In addition, while mixed models will compute for only two individuals per treatment group, many genes will fail due to complete separation when the sample size is so small, especially when the number of cells per individual is also small (<50).
Most papers compare cells across very few individuals, sometimes even a single case and control (simple pseudoreplication). In the former case, the estimate of the interindividual variance is possible, but has wide bounds on parameter confidence intervals; in the latter case the variance is not estimable from the data. Simulations indicate that most published studies are underpowered (Fig. 3 and Supplementary Figs. 9–11).
Most singlecell papers show a deep understanding of the underlying biology and conduct otherwise very informative experiments, appropriately published in highvisibility journals. However, our type 1 error and power simulations document that many such studies are missing important true effects while reporting too many false positive effects generated via pseudoreplication.
As singlecell technology continues to evolve and costs decrease, scientists need to be aware of this issue to improve study design and avoid proliferation of irreproducible results. Our results encourage the use of mixed models, such as the twopart hurdle model with a random effect (e.g., as implemented in MAST with RE), as a way to account for repeated observations from an individual while being able to adjust for covariates at the individual level and, if appropriate, at the individual cell level. Additional random effects, such as sampling time, may also be included^{35}. Our extensive simulation study provides valuable information for understanding the power of specific designs and can be used in grant reviews as one justification of the design and analyses employed. A limitation of our simulation engine here is that it was modeled using platebased data. However, we demonstrate that dropletbased singlecell data also contain a hierarchical correlation structure (Fig. 1). Although our focus here is on hypothesis testing for finding differentially expressed genes within a cell type across conditions, the concept is applicable when comparing expression patterns between cell types and is broadly appropriate for all singlecell sequencing technologies such as proteomics, metabolomics, and epigenetics. In addition, there are numerous fields such as cancer, neurological disease, developmental biology, diabetes, and autoimmune disease, in which these results apply.
Methods
Literature review
A PubMed search in January 2019 for the keywords “singlecell differential expression” returned 251 articles published in the last 3 years; these were subsequently sorted and filtered by each of their abstracts. Many of the returned articles were associated with bulk RNA sequencing or completely irrelevant to differential expression analyses in single cells and were therefore eliminated. Of the 251 original hits, 76 were deemed appropriate for further consideration. Of those, 10 were reviews, 36 were methods papers, and 30 were implementation papers. Each of the methods and implementation articles was thoroughly reviewed along with its number of citations, date of publication, and any other pertinent information, such as number of independent samples, tools used, or number of cells captured.
Intra and intercorrelation analyses
We made pairwise comparisons between all cells of interest to compute intra and interindividual correlations. Genes were removed if the average transcriptpermillion (TPM) value was equal to zero. To control for the correlation structure between genes, genes were sampled one at a time, and any genes with a Spearman’s correlation coefficient >0.25 relative to the gene that was drawn were subsequently trimmed from the dataset. This step was repeated until either no more uncorrelated genes remained or a total of 500 uncorrelated genes were obtained, whichever happened first. For intraindividual correlations, Spearman’s correlation was computed for all possible pairs of cells within an individual. For interindividual correlations, Spearman’s correlation was computed for all possible pairs of cells from a random draw of one cell from each individual. We computed 1000 draws. To compute the correlation structure across multiple cell types, intraindividual correlations were assessed by repeatedly drawing one cell per cell type within an individual and computing all pairwise correlations. Interindividual correlations were assessed by dividing the data into a balanced set of observations, with 10 cells of each of the three main cell types retained for each individual. The intra and interindividual correlations and their means were examined for differences (Fig. 1). The measures were compared in ten different cell types across four different singlecell studies. These studies are all publicly available (accession numbers GSE81861, GSE72056, EMTAB5061, and EGAS00001004082). The GSE81861 dataset contains 161 normal mucosal cells from 6 individuals and were sequenced on Fluidigm’s C1. All individuals were patients with colorectal cancer, but tissues were taken from healthy mucosa. The GSE72056 data were also sequenced on Fluidigm’s C1. The dataset used here contains 337 Bcells from tumor tissues of 11 individuals (all with melanoma) and 1186 Tcells from tumor tissues of 17 individuals (also all with melanoma). EMTAB5061 data were sequenced using the SmartSeq2 protocol and contains pancreatic cells taken from 10 individuals, 6 healthy controls, and 4 with type 2 diabetes. Here, only data from 886 alpha cells, 270 beta cells, and 336 ductal cells were used. The EGAS00001004082 dataset contains 77,969 cells sampled from the respiratory tract of 10 healthy donors. The singlecell capture of these data was carried out using the 10X Genomics Chromium device (3′ V2). For these analyses, we used 24,138 basal cells, 2722 endothelial cells, 2417 macrophages, and 8322 multiciliated cells. Cell type designations were as given by the authors of these studies. More details are provided in their respective papers^{36,37,38,39}.
Simulation engine
A simulation engine was designed to simulate independent genes to approximate the hierarchical structure of real data by empirically estimating the range of parameters (i.e., grand mean of the TPM values, withinsample variance, betweensample variance, relationship between the grand mean and dispersion, and dropout) that define the observed distribution of TPM values for a gene. To estimate these parameters, genes were pruned to a set of uncorrelated genes that captured the most representative patterns of detectable TPM values, without the resulting parameter estimates being primarily driven by dropout. Specifically, genes were sequentially sampled one at a time; any other genes with transcript abundances that correlated (Spearman’s correlation coefficient >0.25) with the gene were removed. To estimate the grand means independently from the hierarchical correlation structure, we estimated the grand means by sampling one cell from each individual and computing the mean TPM value 1000 times. The mean of each of those means was used to approximate the grand mean.
To approximate the variance of the withinsample means (interindividual variance), the means of all nonzero TPM values were computed across all cells within each individual and the variance between those values was subsequently computed. To estimate the withinsample dispersion values, the nonzero TPM values were first used to compute a withinsample variance and withinsample mean. Consistent with the classical definition of the negative binomial distribution’s dispersion parameter, the withinsample dispersion parameter was then computed as:
where α_{ij} represents the dispersion parameter, μ_{ij} represents the withinsample mean, and \(\sigma _{ij}^2\) represents the withinsample variance for gene i and individual j.
The grand means and variances were computed empirically from the TPM values previously reported in six different cell types across three different singlecell studies^{36,37,38}. Once consistent patterns were identified across cell types, alpha cells from the pancreatic cell dataset were used as the model data for our simulation. A gamma distribution was fit to the global mean of the TPM values of each gene using maximumlikelihood estimation. For each independently simulated gene i, a random value was sampled from this gamma distribution to obtain a grand mean, μ_{i}. The variance of the withinsample means (interindividual variance) was modeled as a linear function of the grand means, f_{1}(μ_{i}) and the withinsample dispersion (intraindividual variance) was estimated as a logarithmic function of the withinsample means, f_{2}(μ_{i}). The probability of dropout was estimated independently as a gamma distribution (Fig. 2). Using a normal distribution with an expected value of zero and a variance computed by the first linear relationship, f_{1}(μ_{i}), a difference in means was drawn for each of the individuals j in the simulation. This difference was summed with the grand mean to obtain an individual mean, μ_{ij}.
Three different methods were used to simulate the number of cells per individual. To simulate scenarios where each individual had an identical number of cells, the number of cells desired for each individual was fixed at a constant value. To simulate scenarios where the number of cells per individual demonstrated slight imbalance, a Poisson distribution with a λ equal to the expected number of cells desired for each individual was then used to obtain the count of cells for each individual. To simulate scenarios where the number of cells per individual demonstrated greater imbalance, the number of cells per individual was modeled as a negative binomial random variable with a mean equal to the expected number of cells and a dispersion parameter of one. For each gene i and cell k assigned to an individual j, a read count value, Y_{ijk}, was drawn from a negative binomial distribution with an expected value equal to the individual’s assigned read count value, μ_{ij}, and a dispersion parameter, α_{ij}, computed by the logarithmic function of the grand mean f_{2}(μ_{i}). Along with the distributions of the primary parameters of interest, we made tSNE plots of the simulated data to assess how realistic the simulated data appeared (Supplementary Figs. 1–4).
Evaluation of type 1 error
To estimate type 1 error rate, we simulated TPM values for an individual gene 250,000 times for each simulation condition. We varied the simulation conditions by the number of individuals per treatment group and the number of cells per individual. For each iteration, the number of individuals per treatment group was fixed at either 5, 10, 20, 30, or 40. For each iteration at a fixed number of individuals per treatment group, the number of cells per individual was drawn from a Poisson distribution with either a λ of 50, 100, 250, or 500.
The type 1 error rate for differential expression testing was computed for each of the ten methods evaluated in this manuscript. For each, the number of results that met our significance threshold were counted and the type 1 error computed as the proportion of significant results. Type 1 error was computed for a tweedie mixedeffects model^{34}, MAST^{10}, MAST with random effects^{10}, MAST with a batch effect correction using ComBat^{10,14}, DESeq2 with aggregation methods (“Pseudobulk” summing and averaging)^{11,12,13,40}, Monocle^{41}, ROTS^{42}, Tweedie GLM^{34}, and a GEE1 with a Gaussian link and exchangeable correlation^{43}. MAST was implemented with and without the use of a random effect for individual, and the remaining singlecell tools were implemented exactly as their vignettes instruct. GEE1 with exchangeable correlation was implemented to compare its performance to the mixedeffects model, particularly where the numbers of donors were low. Except for DESeq2 and ROTS, which both require raw counts, all methods were computed on the log(x + 1) transformed gene expression matrix. To control for any differences in library size that DESeq2 might be falsely estimating, we fixed all of DESeq2’s size factors to one. Type 1 errors were computed using significance thresholds of 0.05, 0.01, 0.001, and 0.0001 (Table 1 and Supplementary Tables 1–4).
MAST models a log(x + 1) transformed gene expression matrix as a twopart generalized regression model^{10}. As in Finak et al.^{10}, the addition of random effects for differences among persons is:
where Y_{ig} is the expression level for gene i and cell k, Z_{ki} is an indicator for whether gene i is expressed in cell k, X_{k} contains the predictor variables for each cell k, and W_{k} is the design matrix for the random effects of each cell k belonging to each individual j (i.e., the random complement to the fixed X_{k}). β_{i} represents the vector of fixedeffects regression coefficients and γ_{j} represents the vector of random effects (i.e., the random complement to the fixed β_{i}). γ_{j} is distributed normally with a mean of zero and variance \(\sigma _{\gamma _k}^2\). To obtain a single result for each gene, the likelihood ratio or Wald test results from each of the two components are summed and the corresponding degrees of freedom for each component are added^{10}. These tests have asymptotic χ^{2} null distributions; they can be summed and remain asymptotically χ^{2} because Z_{i} and Y_{i} are defined conditionally independent for each gene^{10}.
Evaluation of rankorder preservation
To approximate how well rankorder was preserved across each of the ten methods evaluated, we simulated TPM values for an individual gene 2000 times with random foldchanges between 0 and 4. The number of individuals per treatment group was fixed at 30 and the number of cells per individual was fixed at 100. Foldchange was drawn from a uniform distribution with a minimum equal to 0 and a maximum equal to 4. Genes were retained along with their foldchange information to evaluate rankorder correlation and to complete the sensitivity analysis. The genes simulated under the null for the type 1 error rate calculations were used to estimate specificity (1type error) for each method. With each of the different methods, pvalues were computed and were ranked alongside the absolute value of the simulatedlog(foldchange) values. Spearman’s rankcorrelation coefficients were computed between each of the different methods.
Evaluation of power
To estimate the power of the respective tests, TPM values were simulated for an individual gene 1000 times for each incremental foldchange of 0.01 between 1 and 5. Here, foldchange is a constant that was multiplied by the global mean gene expression values to spike the expression of those genes in the desired treatment group. The direction of the foldchange was simulated with a Bernoulli distribution with a probability of 0.5 to allow the direction of effect to vary equally between treatment groups. We varied the simulation conditions by the number of individuals per treatment group and the number of cells per individual. For the datasets used to compare the performance of mixed models with pseudobulk methods dependent on the degree of cellular imbalance between individuals, the number of cells per individual were either fixed at an exact number, allowed to vary under a Poisson distribution with a λ equal to the expected number of cells, or allowed to vary under a negative binomial distribution with an overdispersion parameter equal to 1.
Using the twopart hurdle model with a random effect for individual, we computed power curves to estimate how well this method functions with varying numbers and ratios of cells and individuals. When evaluating the power of the pseudobulk methods, the size factors for DESeq2 were forced to one to keep DESeq2’s normalization from normalizing out the simulated effects. Power was computed for fold changes between 1 and 5 (Supplementary Figs. 9–11).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All data are publicly available. Two of the datasets are available on NCBI’s Gene Expression Omnibus under the accession numbers GSE81861^{36} and GSE72056^{37}. A third dataset is hosted on EMBLEBI’s ArrayExpress under the accession number EMTAB5061^{38}. The fourth dataset is hosted on EMBLEBI’s European Genomephenome Archive under the accession number EGAS00001004082^{39}.
Code availability
Data were simulated in R3.5.1. All of the code for the simulations and the evaluation of intra and interindividual correlation structure is available on GitHub at https://github.com/kdzimm/PseudoreplicationPaper. This base code was modified to run type 1 error and power analyses in parallel on Wake Forest’s High Performance Computing Cluster, DEMON.
References
Stuart, T. & Satija, R. Integrative singlecell analysis. Nat. Rev. Genet. 20, 257–272 (2019).
Grün, D. & van Oudenaarden, A. Design and analysis of singlecell sequencing experiments. Cell 163, 799–810 (2015).
Saliba, A.E., Westermann, A. J., Gorski, S. A. & Vogel, J. Singlecell RNAseq: advances and future challenges. Nucleic Acids Res. 42, 8845–8860 (2014).
Hurlbert, S. H. Pseudoreplication and the design of ecological field experiments. Ecol. Monogr. 54, 187–211 (1984).
Heffner, R. A., Butler, M. J. & Reilly, C. K. Pseudoreplication revisited. Ecology 77, 2558–2562 (1996).
Millar, R. B. & Anderson, M. J. Remedies for pseudoreplication. Fish. Res. 70, 397–407 (2004).
Freeberg, T. & Lucas, J. Pseudoreplication is (still) a problem. J. Comp. Psychol. 123, 450–451 (2009).
Lazic, S.E. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis?. BMC Neurosci. 11, 5 (2010).
Makin, T. R. & Orban de Xivry, J.J. Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife 8, e48175 (2019).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 16, 278 (2015).
L. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize singlecell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
Lun, A. T. L. & Marioni, J. C. Overcoming confounding plate effects in differential expression analyses of singlecell RNAseq data. Biostatistics 18, 451–464 (2017).
Crowell, H. L. et al. Muscat detects subpopulationspecific state transitions from multisample multicondition singlecell transcriptomics data. Nat. Commun. 11, 6077 (2020).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Vieth, B., Parekh, S., Ziegenhain, C., Enard, W. & Hellmann, I. A systematic evaluation of single cell RNAseq analysis pipelines. Nat. Commun. 10, 1–11 (2019).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of singlecell RNA sequencing data. Genome Biol. 18, 174 (2017).
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to singlecell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Korthauer, K. D. et al. A statistical approach for identifying differential distributions in singlecell RNAseq experiments. Genome Biol. 17, 222 (2016).
Van den Berge, K. et al. Observation weights unlock bulk RNAseq tools for zero inflation and singlecell applications. Genome Biol. 19, 24 (2018).
Vu, T. N. et al. BetaPoisson model for singlecell RNAseq data analyses. Bioinformatics 32, 2128–2135 (2016).
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of singlecell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in singlecell differential expression analysis. Nat. Methods 15, 255–261 (2018).
Dal Molin, A., Baruzzo, G. & Di Camillo, B. Singlecell RNAsequencing: assessment of differential expression analysis methods. Front. Genet. 8, 62 (2017).
Jaakkola, M. K., Seyednasrollah, F., Mehmood, A. & Elo, L. L. Comparison of methods to detect differentially expressed genes between singlecell populations. Brief. Bioinform. 18, 735–743 (2017).
G. W. Snedecor & W. G. Cochran. Statistical Methods (Oxford & IBH Publishing Co., 1994).
Tirrell, T. F., Rademaker, A. W. & Lieber, R. L. Analysis of hierarchical biomechanical data structures using mixedeffects models. J. Biomech. 69, 34–39 (2018).
Maas, C. J. M. & Hox, J. J. Sufficient sample sizes for multilevel modeling. Methodology 1, 86–92 (2005).
McNeish, D. Analyzing clustered data with OLS regression: the effect of a hierarchical data structure. Mult. Linear Regres. Viewp. 40, 11–16 (2014).
Jiang, J. Consistent estimators in generalized linear mixed models. J. Am. Stat. Assoc. 93, 720–729 (1998).
Lockwood, J. R. & McCaffrey, D. F. Correcting for test score measurement error in ANCOVA models for estimating treatment effects. J. Educ. Behav. Stat. 39, 22–52 (2014).
Ziegler, A. & Vens, M. Generalized estimating equations. Methods Inf. Med. 49, 421–425 (2010).
Draper, N. R. Analysis of messy data, volume 1: designed experiments, second edition by George A. Milliken, Dallas E. Johnson. Int. Stat. Rev. 77, 321–322 (2009).
Stroup, W. W. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications (CRC Press, 2016).
Brooks, M. E. et al. glmmTMB balances speed and flexibility among packages for zeroinflated generalized linear mixed modeling. R. J. 9, 378–400 (2017).
MassoniBadosa, R. et al. Sampling timedependent artifacts in singlecell genomics studies. Genome Biol. 21, 112 (2020).
Li, H. et al. Reference component analysis of singlecell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49, 708–718 (2017).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by singlecell RNAseq. Science 352, 189–196 (2016).
Segerstolpe, Å. et al. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Sungnak, W. et al. SARSCoV2 entry factors are highly expressed in nasal epithelial cells together with innate immune genes. Nat. Med. 26, 681–687 (2020).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 15, 550 (2014).
Trapnell, C. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 11 (2014).
Suomi, T., Seyednasrollah, F., Jaakkola, M. K., Faux, T. & Elo, L. L. ROTS: an R package for reproducibilityoptimized statistical testing.PLoS Comput. Biol. 13, e1005562 (2017).
Højsgaard, S., Halekoh, U. & Yan, J. The R package geepack for generalized estimating equations. J. Stat. Softw. 15, 1–11 (2005).
Acknowledgements
We thank T. D. Howard, L. D. Miller, and M. Olivier (all from Wake Forest School of Medicine) for critical review of this content. This work was supported by the Wake Forest Center for Public Health Genomics and grant U01 NS036695 (CoPI Langefeld) from NIH and by the Cancer Center Support Grant from the National Cancer Institute to the Comprehensive Cancer Center of Wake Forest Baptist Medical Center (P30CA012197).
Author information
Authors and Affiliations
Contributions
C.D.L. and K.D.Z. conceived the study together. K.D.Z. implemented simulations and analyses with guidance from C.D.L and M.A.E.; K.D.Z. wrote the original draft and reviewed and edited it with M.A.E. and C.D.L.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zimmerman, K.D., Espeland, M.A. & Langefeld, C.D. A practical solution to pseudoreplication bias in singlecell studies. Nat Commun 12, 738 (2021). https://doi.org/10.1038/s41467021210381
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021210381
This article is cited by

Intermittent hypoxia in a mouse model of apnea of prematurity leads to a retardation of cerebellar development and longterm functional deficits
Cell & Bioscience (2022)

A comparison of methods for multiple degree of freedom testing in repeated measures RNAsequencing experiments
BMC Medical Research Methodology (2022)

Lymphocyte infiltration and thyrocyte destruction are driven by stromal and immune cell components in Hashimoto’s thyroiditis
Nature Communications (2022)

A balanced measure shows superior performance of pseudobulk methods in singlecell RNAsequencing analysis
Nature Communications (2022)

Multimodal single cell sequencing implicates chromatin accessibility and genetic background in diabetic kidney disease progression
Nature Communications (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.