Abstract
Singlecell RNA sequencing (scRNAseq) has been widely used to characterize cell types based on their average gene expression profiles. However, most studies do not consider cell typespecific variation across donors. Modelling this cell typespecific interindividual variation could help elucidate cell typespecific biology and inform genes and cell types underlying complex traits. We therefore develop a new model to detect and quantify cell typespecific variation across individuals called CTMM (Cell Typespecific linear Mixed Model). We use extensive simulations to show that CTMM is powerful and unbiased in realistic settings. We also derive calibrated tests for cell typespecific interindividual variation, which is challenging given the modest sample sizes in scRNAseq. We apply CTMM to scRNAseq data from human induced pluripotent stem cells to characterize the transcriptomic variation across donors as cells differentiate into endoderm. We find that almost 100% of transcriptomewide variability between donors is differentiation stagespecific. CTMM also identifies individual genes with statistically significant stagespecific variability across samples, including 85 genes that do not have significant stagespecific mean expression. Finally, we extend CTMM to partition interindividual covariance between stages, which recapitulates the overall differentiation trajectory. Overall, CTMM is a powerful tool to illuminate cell typespecific biology in scRNAseq.
Similar content being viewed by others
Introduction
The technology of singlecell RNA sequencing (scRNAseq) profiles gene expression at the resolution of single cells. This resolution may be essential for understanding molecular mechanisms underlying many complex traits because disease gene expression is highly cell typespecific^{1,2,3}. For example, APOE is a risk gene for Alzheimer’s disease that is downregulated in astrocytes but is upregulated in microglia^{2}. One common application of scRNAseq is to investigate differentially expressed genes (DEG) that exhibit differences in mean expression between cell types, such as diseased vs healthy^{4} or pre vs posttreatment^{5,6,7}. Furthermore, methods to infer cell type labels in scRNAseq data primarily rely on differential mean expression between cell types^{8,9}.
Several studies have applied linear mixed models to scRNAseq data to account for variance across individuals, cell types, or experimental batches^{10,11,12,13,14}, but this variance has not been unbiasedly partitioned across cell types, and the potential for bias and miscalibration have not been evaluated. Understanding this variation could help identify and characterize genes and cell types that cause interindividual variation in complex traits ranging from height to autoimmune disorders. Studies using bulk RNAseq have shown that gene expression variability informs disease biology and drug development^{15,16,17}. However, bulk transcriptomics has poor resolution on individualcell types, which can cause both false positives and false negatives. In particular, prior signals in bulk RNAseq could be explained by variation in cell type proportions rather than variation in gene expression within cell types^{18}. Because scRNAseq data has a celllevel resolution, it provides an opportunity to powerfully partition expression variation within and between cell types. This has recently become possible with the proliferation of populationscale scRNAseq studies that contain hundreds of individuals^{13,19,20,21,22}.
In this paper, we develop CTMM (Cell Typespecific linear Mixed Model) to detect and quantify cell typespecific variation across individuals in scRNAseq data. We performed a series of simulations to evaluate CTMM’s performance in a broad range of realistic settings. We then applied CTMM to characterize transcriptomic variation across individual donors along the developmental trajectory from human induced pluripotent stem cells (iPSCs) to endoderm. Transcriptomewide, CTMM found that almost all interindividual variation was specific to each developmental time point, and the Full model found greater correlations between nearby time points. We also identified specific genes with statistically significant time pointspecific variation across individuals, including genes with known importance for cell pluripotency and differentiation. Finally, we studied the recent data from the OneK1K cohort and found that CTMM can be applied to this kind of largescale, lowdepth sequencing data.
Results
Overview of CTMM
CTMM is a linear mixed model that partitions singlecell gene expression variation across individuals into two distinct components: variation shared across cell types and variation specific to each cell type. We fit CTMM to cell typespecific pseudobulk (CTP) data, which is the mean expression over cells within each cell type for each individual. For a given gene, the CTP expression for individual \(i\) and cell type \(c\) is:
where \({y}_{{ics}}\) is the gene expression level for the \(s\)th cell from cell type \(c\) in individual \(i\) and \({n}_{{ic}}\) is the number of cells in individual \(i\) from cell type \(c\). CTMM models the CTP expression data by:
Here, \({{{{{{\boldsymbol{\beta }}}}}}}_{c}\) is the mean expression level in cell type \(c\), which we model as a fixed effect. \({{{{{{\boldsymbol{\alpha }}}}}}}_{i}\) captures differences between individuals that are shared across cell types, which we model as a random effect: \({{{{{{\boldsymbol{\alpha }}}}}}}_{i}{ \sim }^{{iid}}N\left(0,{\sigma }_{\alpha }^{2}\right)\). \({{{{{{\boldsymbol{\Gamma }}}}}}}_{{ic}}\) captures the difference between individuals that is specific to cell type \(c\), which we also model as a random effect: \({{{{{{\boldsymbol{\Gamma }}}}}}}_{{{{{{\bf{i}}}}}}{{{{{\boldsymbol{,}}}}}}}{ \sim }^{{iid}}N\left(0,\,{{{{{\bf{V}}}}}}\right)\), where \({{{{{{\boldsymbol{\Gamma }}}}}}}_{{{{{{\bf{i}}}}}}{{{{{\boldsymbol{,}}}}}}}\) is a vector of cell typespecific expression for individual \(i\) across all \(C\) cell types and \({{{{{\bf{V}}}}}}\) is a \(C\times C\) matrix describing cell typespecific variances and covariances across cell types. \({{{{{{\boldsymbol{\delta }}}}}}}_{{ic}}\) is the noise due to measurement errors at single cells and/or variation from cell subtypes, which we model by: \({{{{{{\boldsymbol{\delta }}}}}}}_{{ic}}{ \sim }^{{ind}}\,N\left(0,\,\frac{{\sigma }_{{ic}}^{2}}{{n}_{{ic}}}\right)\,\), where \({\sigma }_{{ic}}^{2}\) is celltocell variance within individual \(i\) and cell type \(c\). We estimate this quantity from the celllevel data by \(\hat{{\sigma }_{{ic}}^{2}}{{{{{\rm{:=}}}}}}{\sum }_{s=1}^{{n}_{{ic}}}{\left({y}_{{ics}}{y}_{{ic}}\right)}^{2}/({n}_{{ic}}1)\). In the Methods, we show how this model is derived from a single celllevel model, which also motivates our Gaussian assumption on \({{{{{{\boldsymbol{\delta }}}}}}}_{{ic}}\) based on the central limit theorem. We also developed a version of CTMM that applies to overall pseudobulk (OP) data, which averages over all cells from all cell types for each individual. This is a useful analogy to bulk sequencing data, but we find that it is far less powerful in our setting.
The focus of CTMM is on the covariance matrix \({{{{{\bf{V}}}}}}\), which captures cell typespecific variation across individuals. We consider three nested models of cell typeshared and specific variation defined by the structure of \({{{{{\bf{V}}}}}}\). In the simplest model where \({{{{{\bf{V}}}}}}=0\), all variation is shared homogeneously between cell types (“Hom”), with cell types differing only in mean expression. The next model allows independent variation in each cell type (“Free”), i.e., cell typespecific variation, by allowing \({{{{{\bf{V}}}}}}\) to be an arbitrary diagonal matrix. The richest model allows for arbitrary forms of covariance between cell types (“Full”), where \({{{{{\bf{V}}}}}}\) can be any arbitrary semidefinite matrix and \({\sigma }_{\alpha }^{2}=0\) for identifiability (Methods).
We explored several statistical methods to fit and test CTMM’s parameters. Achieving calibrated and unbiased estimates in CTMM is challenging because scRNAseq datasets currently have small to moderate numbers of donors, ranging from one to hundreds, and thus offtheshelf asymptotic tests may fail. We implemented three approaches to fit CTMM: maximum likelihood (ML), restricted maximum likelihood (REML), and methodofmoments (HE, as it is called HasemanElston regression in genetics). Then, we implemented the likelihood ratio test (LRT) and Wald tests to compare the Free and Hom models, which tests whether interindividual variation is cell typespecific or shared uniformly across cell types. Importantly, we develop a novel testing framework based on jackknife (JK) to address false positives in established tests that arise from the complexity of scRNAseq data (Methods).
Simulation
We simulated a series of scenarios to assess the performance of CTMM. We simulated Hom and Free models by varying sample size, level of cell typespecific variance, number of cell types, and cell type proportions. We first evaluated the accuracy to quantify cell typespecific variance. Supplementary Fig. 1 showed the estimation of cell typespecific variance in the simulation of the Free model with varying sample sizes from 20 to 1000. As expected, when fitting simulated data into the Free model, both OP and CTP performed well, as illustrated by the roughly unbiased estimates of cell typespecific variance \({{{{{\bf{V}}}}}}\). The performance improved along with the increase in sample size. CTP provided more precise estimates than OP, since CTP uses more information than OP by modeling pseudobulk expression for each cell type. Comparing methods for parameter estimation, likelihoodbased methods, including ML and REML had similar level of precision, and both had better precision than HE, since likelihoodbased methods utilize more information than HE by assuming normal distribution of random effects. Supplementary Fig. 2 showed estimates with varying levels of cell typespecific variance. Our models provided unbiased estimates of cell typespecific variance, even in the simulation of the Hom model where there is no cell typespecific effect. Additionally, Supplementary Fig. 3 showed that CTMM is unbiased across different numbers of cell types. Supplementary Fig. 4 showed estimates with varying cell type proportions. When decreasing the proportion of the main cell type (with the largest cell typespecific variance), all models performed well except for HE with CTP input, which broke down when the main cell type proportion went below 10%. We also simulated under the Full model, which had precise and unbiased estimates of covariance between cell types when the sample size was above 50 with CTP (Supplementary Fig. 5).
We then evaluated the power of our models to detect cell typespecific variance. Figure 1 showed positive rates of REML and HE using OP or CTP data as input for different sample sizes. Under the simulation of the Hom model where there is no cell typespecific variance, different tests for cell typespecific variance with both OP and CTP input were appropriately null with around 5% of the false positive rate, except for REML (Wald), that is Wald test in REML using precision matrix inferred from the Fisher information matrix. REML (JK), that is jackknifebased Wald test in REML, and HE were inflated in CTP when sample size was 50 or lower. Under the simulation of Free model, CTP gained much larger power than OP, for example, when sample size was 50, CTP had tenfold positive rate over OP (100 versus 10% using REML with LRT). REML (LRT) in CTP had the best power. Its true positive rate reached above 80% even when the sample size was only 20 and reached 100% when the sample size was 50. The other three tests in CTP, including REML (Wald), REML (JK), and HE, also had over 70% true positive rates when the sample size reached 100. ML and REML had similar performance when fitting CTP (Supplementary Fig. 6). We also assessed the impact of cell type proportions, number of cell types, and level of cell typespecific variance. As expected, the power increased when the main cell type became more common, when additional cell types were included, or when cell typespecific variance increased (Supplementary Fig. 6).
To examine the impact of uncertain estimates of \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\), we repeated the CTP simulation while incorporating noisy \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\). To be more realistic, this simulation was conducted with parameters estimated from real data of iPSCs differentiation. We first evaluated the uncertainty of \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\) by bootstrap resampling cells for all combinations of individual, cell type, and gene. Most of them had a coefficient of variation around 0.2 (Supplementary Fig. 7). To incorporate this uncertainty into the simulation, we added noise into \({\nu }_{{ic}}\) when fitting models (Methods). We tried five distributions of noise to cover the distribution of coefficient of variation in real data (Supplementary Fig. 7). In the simulation of Hom model, when there was no noise of \({\nu }_{{ic}}\), that is a fitting model with real \({\nu }_{{ic}}\) used in simulations, REML (LRT) and REML (JK) were well calibrated, HE was slightly inflated (Fig. 2a). Along with the increase of noise, REML (LRT)’s false positive rate increased quickly and reached ~80% when using a high level of noise (coefficient of variation = 0.45); REML (JK) was rather resistant to noise that it only completely broke when unrealistically strong noise was added; while HE was not impacted by noise, it remained slightly inflated for all levels of noise. Of note, estimates of cell typespecific variance were weakly biased in REML under strong noise (Supplementary Fig. 8). In the simulation of the Free model, we found that REML (JK) had 80% of positive rate even when the cell typespecific variance is weak with 0.05 variance for the first cell type; HE also had intermediate power with about 50% of positive rate when the first cell type had 0.05 variance (Fig. 2b). Taken together, REML (JK) is the most powerful and robust method and is our primary approach in our iPSCs analysis.
Finally, we conducted simulations at the level of single cells to evaluate the impact of sequencing depth and the number of cells (Supplementary Note 1 Section 3.4). To assess whether our simulated count distribution is realistic, we compared it to the real data using countsimQC^{23}. The comparison demonstrated a good fit to the real data in terms of the meanvariance distribution and the fraction of zeros per gene (Supplementary Fig. 10). We found that CTMM was robust across a realistic range for sequencing depth and number of cells (Supplementary Fig. 11A, C), though power improved with greater read depth or number of cells (Supplementary Fig. 11B, D).
Application to human induced pluripotent stem cells
We applied our methods to differentiating iPSCs^{13}. Before fitting CTMM, we compared different approaches to imputing the cell typespecific pseudobulk (\({{{{{{\bf{y}}}}}}}_{{ic}}\), that is CTP). We evaluated singlegene imputation with softImpute and MVN and transcriptomewide imputation with softImpute. We found that transcriptomewide softImpute performed best (Supplementary Fig. 12A, C), though MVN performed similarly. We also compared approaches to impute the noise variance (\({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\)), which is required for CTMM. We observed similar results as for the pseudobulk in terms of mean squared error (Supplementary Fig. 12B, D). Based on these results, we used transcriptomewide softImpute in practice.
We fit the Free model with both OP and CTP data using ML, REML, and HE (Supplementary Figs. 13, 14). We focus on REML with CTP, which was most powerful and robust in simulations. Transcriptomewide, we found that the variation across individuals was almost entirely cell typespecific, as the homogeneous variance has a median close to 0 (\({{{{{{\mathrm{median}}}}}} }=0.4\%\), Fig. 3a). By contrast, cell typespecific variance has a median of 32.8% across cell types. Weighted by cell type proportions, cell typespecific variation explained 14% of interindividual variation on average across the transcriptome (Supplementary Note 1 Eq. 3). Additionally, cell type proportion differences explained 12%, and residual celllevel variation (\({{{{{\boldsymbol{\nu }}}}}}\)) explained 10%. The remaining variation is explained by covariates, especially PCs of pseudobulk gene expression (39%) and batch effects (21%). Note that cell subtype variation within an individual will be captured in the residual variation in \({{{{{\boldsymbol{\nu }}}}}}\), which also captures measurement errors (i.e., RNA transcripts that exist but are not sequenced), while interindividual variation in cell subtype proportions will be captured in the interindividual covariance, \({{{{{\bf{V}}}}}}\). This illustrates the importance of modeling cell typespecific expression.
To evaluate the correlation of gene expression between cell types, we next fit the Full model. As expected, the correlations between adjacent development stages (CT1 and CT2, CT2 and CT3, and CT3 and CT4) were larger than the correlations between more distant stages (Fig. 3b). Furthermore, the correlation between CT2 and CT3 (median: 0.051) was smaller than the other adjacent stages (CT1CT2 median: 0.318; CT3CT4 median: 0.322). This is consistent with rapid changes in molecular profiles at day 2 (CT3)^{13}. These patterns were also observed when fitting with ML or HE (Supplementary Fig. 13). When fitting OP, as expected, the estimates were far less precise, especially for HE (Supplementary Fig. 14).
Figure 4a shows gene expression differentiation in mean and variance in fitting CTP with REML (JK). We found many genes that were differentiated in variance between cell types, that is at least one cell type with nonzero cell typespecific variance. Among them, the top gene POU5F1 (Wald \(p=2.18\times {10}^{27}\)), also known as OCT4, is one of the three core transcription factors in the pluripotency gene regulatory network^{24}. This signal was also confirmed in HE with CTP, where POU5F1 was the most significant signal in variance differentiation (\(p=1.33\times {10}^{28}\), Supplementary Fig. 15). Although this gene was also significantly differentiated in mean, it is not outstanding in either REML or HE and less likely to be discovered for further functional analyses. To control for false positive, we identified candidate genes for cell typespecific variance as ordered by \(p\) value in REML (JK) meanwhile requiring significant signals after Bonferroni correction in both REML (JK) and HE, with top 10 genes shown in Table 1. Among them, 85 genes were not differentiated in mean (\(p > 0.01\) in REML), with some of those genes involved in processes like cell differentiation and growth (see top 10 of those genes in Table 2). Take NDUFB4 for example, there was no differentiation in mean between cell types (\(p=0.014\) in REML and \(p=0.125\) in HE), while significant differentiation in variance in both REML (\(p=1.93\times {10}^{19}\)) and HE (\(p=9.62\times {10}^{15}\)) (Fig. 5). We next performed GO enrichment analysis using clusterProfiler^{25}. We tested the top 100 CTMM genes that did not have significant differences in mean (\(p > 0.05\) after Bonferroni correction). We found dozens of significant enrichments, almost all of which reflect cellular metabolic activity (Supplementary Data 1). This finding aligns with the known importance of variation in metabolic state during iPSC differentiation^{26}. Of note, there were three marker genes used in Cuomo et al.^{13} to indicate each stem cell differentiation stage, spanning iPSC (NANOG), mesendoderm (T), and definitive endoderm (GATA6). We successfully detected significant mean differentiation in all three marker genes in both REML and HE; on the other hand, we detected significant variance differentiation in all three genes in HE, while only in the NANOG gene in REML, indicating loss of power in REML. We also note that mean differentiation had much stronger signals than variance differentiation in both REML and HE (Supplementary Fig. 16). We compared \(p\) values for variance differentiation from different tests when fitting CTP (Supplementary Fig. 17). Generally, \(p\) values from different tests were largely consistent, except for REML with LRT. Specifically, for REML (JK) and HE, there were 4776 genes that were significant in both; 333 genes were significant only in HE, likely to be false positive; 2017 genes were significant only in REML (JK), partially due to false positive and partially due to higher power in REML (JK) than HE. We also conducted tests with OP data. Consistent with the low power observed in simulations, we identified 23 genes that were significantly differentiated in variance in REML (LRT), and 0 genes were identified in HE (Supplementary Fig. 18).
Gene features associated with cell typespecific variation
We next evaluated the relationship between CTMM results and four gene features related to genome structure and evolution. We compared CTMM’s measure of cell typespecificity, which is based on interindividual variance, to a standard measure of cell typespecificity based on mean differences (Methods). First, we found that both CTMM and ordinary differential expression signals were enriched in genes with larger enhancer domains (based on the number of enhancers or enhancer domain score [EDS], Fig. 4b and Supplementary Fig. 19). These results align with previous findings that genes with larger enhancer domains were less likely to exhibit ubiquitous expression across tissues^{27}. Second, we examined a measure of gene conservation called LOEUF (lossoffunction observed/expected upper bound fraction). We found that moreconstrained genes have lower mean differences across cell types (p = 1.0e3, Fig. 4c), consistent with previous findings that constrained genes were more frequently ubiquitously expressed across tissues^{28,29}. CTMM’s cell typespecificity measure also correlates with LOEUF, but in the opposite direction (p = 1.2e8, Fig. 4c). Digging deeper, CTMM shows this primarily results from decreases in cell typeshared variation (Supplementary Fig. 19). In other words, stronger selection implies that cell types and individuals are more constrained toward their averages, so cell typespecific interindividual variation plays a larger role. We found qualitatively similar results using pLI (Supplementary Fig. 19).
Application to peripheral blood mononuclear cells
The iPSC data^{13} we analyzed above used platebased sequencing. We next sought to confirm that CTMM is applicable to dropletbased sequencing, a different scRNAseq technology that generally trades off a greater number of cells at the cost of lower read depth. We thus applied CTMM to the recent dropletbased data from the OneK1K cohort^{22}. This dataset has many more individuals than the iPSC data (N = 982 vs 125) and more cells per individual (1300 vs 300), but it has far fewer reads per cell (~3 K vs ~0.5 M). The cells themselves are also very different, as OneK1K contains peripheral blood mononuclear cells (PBMCs) from living people rather than differentiating iPSCs from a controlled lab experiment. Another important difference is that the PBMC cell types are computationally inferred, while the iPSC cell types are defined by experimental days. Finally, the PBMC cell type proportions vary substantially (Supplementary Fig. 20). Our primary analysis was restricted to cell types with at least ten cells in at least 90% of individuals, resulting in seven cell types (CD4_{NC}, CD4_{ET}, CD8_{ET}, CD8_{NC}, NK, B_{IN}, and B_{Mem}, Supplementary Note 1 Section 4).
We find that CTMM provides powerful and robust estimates in the OneK1K data. First, CTMM detected significant cell typespecific interindividual variance for 2310 genes out of the 11,526 total genes tested (\(p < 0.05/11526\), Supplementary Fig. 21). The top signal is RPS26 (\(p=1.78\times {10}^{167}\), Supplementary Fig. 22), which plays a key role in regulating T cells^{30,31}. Further, genetic variation causes interindividual variation in this gene that is cell typespecific^{22} and is linked to complex traits such as eczema and asthma^{32}. As in the iPSCs, we tested GO enrichment in the top CTMM genes. In this large dataset, almost all genes have significant differential mean expression, so we tested the top 100 CTMM genes irrespective of their mean differences. Almost all of the top enrichments relate to immune function, including several that are specific to leukocytes (Supplementary Data 2). Second, CTMM partitions transcriptomewide interindividual variation into components that are shared across cell types (10.9%) vs cell typespecific (21.5% on average across cell types, Supplementary Fig. 23). This is an interesting contrast with the iPSC results, where shared interindividual variation was near zero. Biologically, this could be explained by differences in cellular environment: individuallevel covariates like age, smoking, or BMI may have shared effects across cell types and they are likely to have larger effects on PBMCs in whole blood than iPSCs in a controlled lab. Finally, we evaluated CTMM’s Full model transcriptomewide to quantify interindividual covariance between cell types. These estimates recapitulated expected relationships between cell types (Supplementary Fig. 24). For example, the mostcorrelated cell types are CD4_{NC} and CD4_{ET}; intuitively, this means that an individual with an aboveaverage expression of a gene in their CD4_{NC} cells will typically also have an aboveaverage expression in their CD4_{ET} cells. The second mostcorrelated cell types are CD4_{NC} and CD8_{NC}, which is consistent with the observation that CD4_{NC} and CD8_{NC} shared the most genetic effects in prior work^{22}.
We next tested the robustness of CTMM to rarer cell types in a secondary analysis that includes two additional cell types, Mono_{C} and CD8_{S100B}. CTMM gave consistent results for the seven larger cell types that are included in both analyses (Supplementary Figs. 25–29). As expected, CTMM’s estimates are noisier for Mono_{C} and CD8_{S100B}, which are rarer cell types. Nonetheless, adding these cell types enables CTMM to discover new differentiallyvariable genes. For example, the top newlysignificant gene is TMEM176B (\(p=5.45\times {10}^{63}\) vs \(p=0.18\) in our primary analysis), which makes sense as this gene is primarily expressed in Mono_{C} (Supplementary Fig. 30). We conclude that CTMM’s results are robust to variations in the input cell types, but its estimates are less accurate for rarer cell types.
Discussion
Mean differences in gene expression across cell types are well documented and are the primary focus of most scRNAseq analyses. Here, we have introduced a new model called CTMM to quantify variance differences across cell types in scRNAseq data. Bulk expression analyses have established that interindividual variance in expression can be important for characterizing disease biology^{33} and identifying contextdependent genetic effects^{34}. The key innovation in CTMM is adapting Gaussian LMMs to scRNAseq data, which is challenging because scRNAseq data are highly noisy and nonGaussian. The key idea is to summarize the scRNAseq data into cell typespecific pseudobulk^{21}, which enables approximately unbiased inference with CTMM on as few as 20 individuals. We carefully profile several standard methods to fit LMMs and propose a jackknifebased test using REML as the most powerful and robust method, which we support with extensive simulations and analyses of differentiating iPSCs and PBMCs. We implement and freely release these methods as a userfriendly Python package. We expect that CTMM will be an important step toward robust and rich variance decompositions of scRNAseq data, which will be increasingly powerful and informative as scRNAseq sample sizes grow.
In the limiting case with infinite cells, when the measurement error is reduced to 0, CTMM simplifies to a typical LMM on bulk expression. In this case, CTMM with Overall Pseudobulk (OP) data were comparable to decomposing variance in bulk expression data using computationally deconvolved cell type proportions. The significant benefit is that scRNA data provides much better estimates of cell type proportions, which can both reduce false positives and improve power. Likewise, CTMM with Cell Typespecific Pseudobulk (CTP) data becomes comparable to bulk analyses of sorted cells without the need for sorting predefined cell types. In practice, when the number of cells is limited, another significant benefit of CTMM over bulk analyses is the ability to distinguish biological variance across individuals from measurement error, which is especially important when measurement error varies across individuals, cell types, or experimental conditions. However, the disadvantage of CTMM compared to bulk is that it requires larger sample sizes, which is currently expensive.
CTMM has several important limitations. First, as scRNA data in individual cells is highly nonGaussian, CTMM’s Gaussian assumption relies on combining many cells and the central limit theorem. In practice, we require >10 cells per individualcell type pair, which limits CTMM to common cell types. A related concern is that lowlyexpressed genes can be severely nonGaussian, increasing the number of cells needed for the Gaussian approximation. We find that higher overall levels of a gene’s expression increase the power of CTMM (Supplementary Fig. 31), which also holds for most tests of scRNAseq data. Second, CTMM assumes cell types are already known. Our iPSCs data analysis solves this by defining cell types based on experimental days. However, most studies infer cell types directly from the scRNAseq data, such as the cell types in our PMBC data analysis, inducing some circularity; this is typically ignored^{2,7,35} yet will deflate estimates of cell typespecific variance by construction. Third, CTMM assumes discrete cell types, whereas continuous cell types are more appropriate in some cases, e.g., when defined by pseudotime^{13,20,36} or degree of IFN stimulation^{21}. While incorporating continuous cell types is straightforward with overall pseudobulk data, it can only be expressed in cell typespecific pseudobulk data by discretizing the continuous cell types. Fourth, it is wellknown that count data evince a complex meanvariance relationship, and studies have observed that the variance of gene expression across cells is dependent on mean expression^{37}. Nonetheless, simulations show that this problem is unlikely to be important in practice (Supplementary Fig. 32). Moreover, we find biologically plausible genes with significant differential variance but without significant differential mean, showing that modeling variance has utility beyond merely tagging mean signals. Fifth, CTMM only considers pseudobulk data, which greatly improves computational efficiency and facilitates its simplifying Gaussian assumptions. Nonetheless, pseudobulk inherently discards celllevel information, which sacrifices statistical power and resolution within cell types. Moreover, cell typespecific pseudobulk necessarily discretized cells into categories, which can be somewhat arbitrary. Finally, despite our use of careful nonparametric tests, our Free test for cell typespecificity remains slightly inflated, emphasizing the importance of biologically validating and replicating results. While this inflation is small for sample sizes around ~100 and vanishes for sample sizes above ~300, CTMM is not reliable for sample sizes below ~50. Nonetheless, CTMM’s estimates remain unbiased (Supplementary Fig. 1), hence it can be used to profile transcriptomewide averages for any sample size. Also, we developed a simplified version of CTMM which remains calibrated for sample sizes ~50, but it assumes that all cell typespecific variances are equal (Supplementary Fig. 33).
CTMM is a step toward translating wellestablished LMM methodologies to scRNAseq data. A key extension of CTMM is to quantify cell typespecific heritability of gene expression, which is typically more powerful than singleSNP tests of contextspecific genetic regulation^{12,21}. Because CTMM models celllevel noise, it can eliminate downward biases in heritability that are unavoidable in bulk expression data. Another extension is to jointly model covariance across both cell types and genes. For example, this enables identifying cell typeshared and specific networks. This, too, is necessarily biased in bulk expression data, where covarying measurement errors will confound biologically meaningful networks. The Full model can also be extended to learn structured networks between cell types by leveraging penalized covariance estimates^{38} or by specifically tailoring it to a given application; for example, we could restrict \(V\) to be banded to capture temporal structure in the differentiating iPSCs. It would also be useful to extend CTMM to test for variance differences between groups of individuals, e.g., disease cases and controls. Nonetheless, this will require careful modeling and robustness tests to account for subtle ascertainment biases. The longterm goal is to combine together these features into a comprehensive model of transcriptomic covariation across cells, cell types, individuals, and environments in order to understand genetic and nongenetic drivers of complex diseases. Overall, we consider CTMM an important step on a long path to fully understanding the causes and consequences of variation within and between individuals in scRNAseq data.
Methods
Models
Overview of cell typespecific linear mixed models for gene expression
We model the expression level of a given gene for individual \(i\), cell type \(c\), and cell \(s\) by:
In this model, \({y}_{{ics}}\) is the gene expression level for the \(s\)th cell from cell type c in individual \(i\); note that the number of measured cells varies across individuals and cell types. \({{{{{{\boldsymbol{\beta }}}}}}}_{c}\) is the mean expression level in cell type c, which we model as a fixed effect. \({{{{{{\boldsymbol{\alpha }}}}}}}_{i}\) captures differences between individuals that are shared across cell types, which we model as a random effect: \({{{{{{\boldsymbol{\alpha }}}}}}}_{i}{ \sim }^{{iid}}N\left(0,{\sigma }_{\alpha }^{2}\right)\). \({{{{{{\boldsymbol{\Gamma }}}}}}}_{{ic}}\) captures the difference between individuals that is specific to cell type \(c\), which we also model as a random effect by \({{{{{{\boldsymbol{\Gamma }}}}}}}_{{{{{{\bf{i}}}}}}{{{{{\boldsymbol{,}}}}}}}{ \sim }^{{iid}}N\left(0,\,{{{{{\bf{V}}}}}}\right)\). Here \({{{{{{\boldsymbol{\Gamma }}}}}}}_{{{{{{\bf{i}}}}}}{{{{{\boldsymbol{,}}}}}}}\) is a vector of cell typespecific expression for individual \(i\) and \({{{{{\bf{V}}}}}}\) is a \(C\times C\) matrix describing cell typespecific variances and covariances across cell types.
Finally, \({\epsilon }_{{ics}}\) is the residual effect, which we assume to be i.i.d. for each individualcell type pair with \(E\left({\epsilon }_{{ics}}\right)=0\) and \(V\left({\epsilon }_{{ics}}\right)={\sigma }_{{ic}}^{2}\). We directly estimate \({\sigma }_{{ic}}^{2}\) from the single celllevel data by \(\hat{{\sigma }_{{ic}}^{2}}:={\sum }_{s=1}^{{n}_{{ic}}}{\left({y}_{{ics}}{y}_{{ic}}\right)}^{2}/({n}_{{ic}}1)\), where \({n}_{{ic}}\) is the number of cells in individual \(i\) and cell type \(c\) and \({{{{{{\bf{y}}}}}}}_{{ic}}\) is the average expression across all \({n}_{{ic}}\) cells (i.e., cell typespecific pseudobulk, defined below). \({\hat{\sigma }}_{{ic}}^{2}\) is unbiased even if \({\epsilon }_{{ics}}\) is nonGaussian, which is important because expression in single cells is nonGaussian. Note that this is impossible in bulk expression data, even if sorted into cell types, because bulk only measures average expression. That is, scRNAseq data makes it possible to distinguish true interindividual variation from measurement noise.
Our focus is the covariance matrix \({{{{{\bf{V}}}}}}\), which captures the differences and similarities between cell types (for a given gene). The diagonal terms capture cell typespecific variance. If there is no cell typespecific variation between individuals, then \({{{{{{\bf{V}}}}}}}_{{cc}}=0\) for all \(c\). The offdiagonal terms capture covariance between specific pairs of cell types; if all cell types are equally similar to each other, then \({{{{{{\bf{V}}}}}}}_{{cc}{\prime} }=0\) for all \(c\ne c{\prime}\).
We consider three nested models of interindividual variation defined by the structure of \({{{{{\bf{V}}}}}}\). First, the homogeneous (Hom) model assumes that \({{{{{\bf{V}}}}}}=0\), i.e., that all expression variance is shared homogeneously across cell types without any cell typespecificity. Second, the Free model allows arbitrary levels of cell typespecific variance by allowing \({{{{{\bf{V}}}}}}\) to be an arbitrary diagonal matrix. Third, the Full model captures arbitrary levels of covariance between specific cell type pairs by allowing \({{{{{\bf{V}}}}}}\) to be any positive semidefinite matrix. Intuitively, the Hom model captures variation across individuals, but assumes this variation is identically shared across cell types. The Free model allows cell typespecific variation, e.g., a gene that is largely similar between individuals except in a single cell type. The Full models allows complex relationships among cell types, e.g., hierarchical relationships among immune cell types.
A technical consideration in the Full model is that \({{{{{\bf{V}}}}}}\) and \({\sigma }_{\alpha }^{2}\) are not jointly identified. Specifically, passing a constant between \({\sigma }_{\alpha }^{2}\) and \({{{{{\bf{V}}}}}}\) does not change the likelihood (i.e., \(L\left({\sigma }_{\alpha }^{2},{{{{{\bf{V}}}}}}\right)\equiv L({\sigma }_{\alpha }^{2}\lambda,{{{{{\bf{V}}}}}}+\lambda \,{{{{{{\bf{J}}}}}}}_{{{{{{\bf{C}}}}}}})\), where \({{{{{{\bf{J}}}}}}}_{{{{{{\bf{C}}}}}}}\) is \(C\times C\) matrix containing all 1s). Therefore, without loss of generality, we set \({\sigma }_{\alpha }^{2}=0\) in the Full model. The Full model is statistically challenging because its number of parameters scales quadratically with the number of cell types, \(C\). In practice, the Full model only has precise estimates with hundreds to thousands of samples or, as below, when aggregating together many genes.
Deriving models for overall and cell typespecific pseudobulk
Directly modeling single cell expression as in Eq. 3 is challenging computationally and statistically. Computationally, modeling individual cells increases the number of observations by orders of magnitude because there can be dozens or hundreds of cells per individualcell type pair. Statistically, the individual cell’s expression is highly nonGaussian, requiring additional assumptions and computationally expensive generalized linear mixed models. Instead, we study scRNAseq data at the level of pseudobulk expression, which averages expression over many cells. We consider both overall pseudobulk (OP), which averages over all measured cells per individual, and cell typespecific pseudobulk (CTP), which averages over cells in each cell type per individual.
Specifically, the pseudobulk measures that we input to CTMM are:
where \({{{{{{\bf{y}}}}}}}_{i}\) is the OP expression for individual \(i\), and \({{{{{{\bf{y}}}}}}}_{{ic}}\) is the CTP expression for individual \(i\) and cell type \(c\).
Our celllevel model in Eq. 3 implies the following mixed model for the OP expression:
with \({{{{{{\boldsymbol{\delta }}}}}}}_{i}{{{{{\rm{:=}}}}}}\frac{1}{{n}_{i}}{\sum }_{c=1}^{C}{\sum }_{s=1}^{{n}_{{ic}}}{\epsilon }_{{ics}}{ \sim }^{{ind}}\,N\left(0,{{{{{{\boldsymbol{\nu}}}}}}}_{i}\right);{{{{{{\boldsymbol{\nu}}}}}}}_{i}{{{{{\rm{:=}}}}}}{\sum }_{c=1}^{C}\frac{{n}_{{ic}}}{{n}_{i}^{2}}{\sigma }_{{ic}}^{2}\)
\({{{{{{\boldsymbol{\delta }}}}}}}_{i}\) is the measurement noise for individual \(i\), with variance \({{{{{{\boldsymbol{\nu }}}}}}}_{i}\) that we estimate by plugging in our estimate of \({\sigma }_{{ic}}^{2}\). \({{{{{\bf{P}}}}}}\) is the matrix of cell type proportions, defined by \({{{{{{\bf{P}}}}}}}_{{ic}}:=\frac{{n}_{{ic}}}{{n}_{i}}\).
Our celllevel model in Eq. 3 also implies a mixed model on the CTP expression data:
with \({{{{{{\boldsymbol{\delta }}}}}}}_{{ic}}{{{{{\rm{:=}}}}}}\frac{1}{{n}_{{ic}}}{\sum }_{s=1}^{{n}_{{ic}}}{\epsilon }_{{ics}}{ \sim }^{{ind}}\,N(0,{{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}){{{{{\rm{;}}}}}}{{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}=\frac{{\sigma }_{{ic}}^{2}}{{n}_{{ic}}}\)
Here\(,\,{{{{{{\boldsymbol{\delta }}}}}}}_{{ic}}\) is the noise for individual \(i\) and cell type \(c\), with variance \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\). By the central limit theorem, both \({{{{{{\boldsymbol{\delta }}}}}}}_{i}\) and \({{{{{{\boldsymbol{\delta }}}}}}}_{{ic}}\) are approximately Gaussian when \({n}_{{ic}}\) is not too small, even though \({\epsilon }_{{ics}}\) is very nonGaussian. Note that Eq. 6 is the same as Eq. 2.
Fitting and testing parameters of CTMM
We evaluated three approaches to estimate the parameters in CTMM: ML, REML, and HE. We implemented ML by maximizing the likelihood function using the BFGS algorithm implemented in the R function `optim’ (Supplementary Note 1). REML was fit similarly using the restricted likelihood, which residualizes covariates from the full likelihood (Supplementary Note 1). For both REML and ML, we reran 10 random restarts if the initial optimization attempt failed (Supplementary Note 1), which is important to mitigate bias from local maxima with modest sample sizes. We allowed negative variance components to reduce bias, though the total expression variance was always positive. Due to the complexity of these likelihood functions, we evaluated refining the BFGS solution with NelderMead iterations; we found that this is not necessary for the analyses considered in the Main text, but it can be important for the more challenging analyses, such as fitting the Free model with ML on OP data (Supplementary Fig. 34). We fit HE analytically (Supplementary Note 1).
Because the CTP expression data is a vector of length \({NC}\), where \(N\) is the number of individuals, naively fitting CTMM in ML and REML has a computational complexity of \(O({N}^{3}{C}^{3})\). We use several linear algebra identities to simplify the complexity to \(O(N{C}^{3})\). The relative gains will increase as \(N\) and \(C\) grow, which are both expected in future scRNAseq datasets.
Our primary test compared the Hom and Free models, which asks whether interindividual variation is cell typespecific or shared uniformly across cell types. In other words, this is a test for differential expression variance across cell types. By comparison, standard tests for differential expression ask whether the mean expression levels, \({{{{{{\boldsymbol{\beta }}}}}}}_{c}\), differ across cell types.
We implemented LRT and Wald tests to compare the Free and Hom models. For the LRT, we used \(C\) degrees of freedom because the Free model adds variance components for each cell type (but see ref. ^{39} and ref. ^{40} for a more detailed discussion of these tests). For the Wald test, we used an \(F\)test with \(C\) numerator degrees of freedom and \(NR\) denominator degrees of freedom, where \(R\) is the number of model parameters in the Free model, that is \(2C+1\). We evaluated two options to estimate the precision matrix for CTMM’s variance component estimates, which is needed for the Wald test. First, we used the inverse of the Fisher information matrix for REML and ML, which is consistent for large sample sizes. Second, we used a jackknife to nonparametrically estimate the precision matrix by fitting the model after excluding each sample in turn. For large sample sizes, both LRT and Wald tests are valid; however, we are interested in modest sample sizes, and hence, we profile a wide range of approaches.
We tested for mean expression differentiation by evaluating the null hypothesis that \({{{{{{\boldsymbol{\beta }}}}}}}_{c}={{{{{{\boldsymbol{\beta }}}}}}}_{{c}^{{\prime} }}\) for all cell types \(c\) and \({c}^{{\prime} }\). \({{{{{\boldsymbol{\beta }}}}}}\) is the cell type fixed effect (i.e., cell typespecific mean expression) and is estimated using generalized least squares with variance components fit under the Free model. We used a Wald test for \({{{{{\boldsymbol{\beta }}}}}}\) with numerator degrees of freedom \(C1\) and denominator degrees of freedom \(NR\), where \(R\) is the number of parameters in the Free model, that is \(2C+1\). We estimated the precision matrix for CTMM’s estimates of \({{{{{\boldsymbol{\beta }}}}}}\) using jackknife. The jackknife includes refitting variance components, which is important because these variance component estimates are noisy. As the covariance matrix estimated by HE can be singular and the fixed effects are not our focus, we tested for mean expression differentiation simply using ordinary least squares.
We have also extended CTMM to accommodate additional random effects (Supplementary Note 1). This can be essential in practice, but it can be computationally infeasible as it requires inverting large matrices. Fortunately, the primary use case involves blocked random effects, such as experimental batch in our iPSC analysis. We derived a different optimization approach designed specifically for this common scenario, which simplified the computational complexity by orders of magnitude. In our iPSC analysis, these manipulations reduced REML computation time per gene from ~40 to ~10 s.
Prior LMM applications to scRNAseq data
LMMs are a basic statistical framework for partitioning variation, and several prior studies have applied LMMs to scRNAseq data (Supplementary Table 1). Most prior work has applied generic LMM methods at the level of single cells. They fit variance components for batch effects^{11,13}, experimental context^{10}, and/or some form of interindividual variation^{11,12,13,14}. Additionally, some of these studies model the nonGaussianity of celllevel expression data^{10,14}. Despite the strengths of these works, none aim to partition shared vs specific components of interindividual variation. This is the key novelty in CTMM, and it requires a different variance component model than the ones that have been used in prior work. At a more technical level, CTMM develops a novel testing framework based on jackknife rather than use offtheshelf tools, which solves biases in standard LMM variance component tests due to the complexity of scRNAseq data.
Simulation
We tested the performance of CTMM with a series of simulations under Hom and Free models. We simulated gene expression for each individual from Eq. 5 (for overall pseudobulk) and Eq. 6 (for cell typespecific pseudobulk). We varied the number of individuals, cell type proportions, number of cell types, and levels of cell typespecific variance. For each simulated dataset, we evaluated all three methods to fit CTMM (ML, REML, and HE) and each applicable test for cell typespecific interindividual variance (LRT and Wald). For each simulation parameter setting, we ran 1000 replicate simulations to calculate the average CTMM estimates, their sampling distributions, and the test positive rate. We also simulated under the more complex Full model. Further simulation details are provided in Supplementary Note 1, with a list of simulation parameters in Supplementary Table 2.
We also performed simulations to assess CTMM’s sensitivity to estimation errors in \({\nu }_{{ic}}\), the level of noise due to celllevel variation. This is important because \({\nu }_{{ic}}\) is not known in practice. Specifically, for each \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\), we draw \({x}_{{ic}}\) i.i.d. from a \({Beta}(2,b)\) distribution and then add \({+{x}_{{ic}}{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\) or \({x}_{{ic}}{{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\) before inputting \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\) to CTMM (Supplementary Note 1). To span the range of estimation errors in the real iPSCs data, we simulated \(b=20,\,10,\,5,\,3,\,2\). To evaluate power under a range of Free models, we varied cell typespecific variance for cell type 1 (\({{{{{{\bf{V}}}}}}}_{11}\)) from 0.05 to 0.5 and fixed other cell typespecific variances to 0.1. For simplicity, the Free model simulations are always used \(b=5\) (the most realistic value). As this simulation focuses on CTMM’s utility in our real data analysis, we simulated using the parameters we estimated in the iPSCs data below (Supplementary Note 1), and we only examined CTP as it is far more powerful. We ran 1000 replicates for each setting of simulation parameters.
Differentiating iPSCs analysis
Data and model
Human induced pluripotent stem cells (iPSCs) are derived from somatic cells that have been reprogrammed into an embryoniclike pluripotent state. iPSCs can differentiate into diverse cell types, with a concomitant transcriptomic trajectory across time as the cells differentiate. We studied the transcriptome as iPSCs differentiate into endoderm using scRNAseq data from 125 individual donors^{13}. Cells were collected on four consecutive days as the iPSCs differentiated, starting from iPSCs, which we used to define four cell types. We used the logtransformed gene expression data provided by Cuomo et al., which has been through a thorough process of quality control and normalization (https://zenodo.org/record/3625024#.Xil0y2cZ0s). The dataset includes 11,231 genes and 36,044 cells.
For the 33 individuals who had technical replicates in the data, we only included the replicates with the largest number of cells. We excluded individuals with fewer than 100 cells to better satisfy the Gaussian approximation of \({\delta }_{{ic}}\), leaving 94 individuals.
For each gene, we standardized OP and CTP by scaling such that OP has a mean of 0 and a variance of 1. We then fit this scaled OP and CTP expression into Hom, Free, and Full models with ML, REML, and HE. In all models, we adjusted for sex, neonatal diabetes, and the first six principal components calculated on OP expression as fixed effects. We used our extension of CTMM to model the experimental batch as a random effect, which is important because the batch has large effects that cannot be ignored yet has too many degrees of freedom to fit as fixed effects (24 batches vs 94 individuals). We used Bonferroni correction to account for multiple testing across genes.
Impute pseudobulk data
For individualcell type pairs with no more than ten cells, \({{{{{{\bf{y}}}}}}}_{{ic}}\) and \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\) were set to missing. Requiring more than ten cells is our default guidance (in practice, we find that our results are robust to modifying this cutoff from ten cells to 5 or 20, Supplementary Fig. 35). We then imputed missing entries in \({{{{{{\bf{y}}}}}}}_{{ic}}\) and \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\). We compared three approaches to imputation (each applied separately to \({{{{{\bf{y}}}}}}\) and \({{{{{\boldsymbol{\nu }}}}}}\)). First, we imputed each gene separately using either softImpute^{41} or MVNimpute (implemented in ref. ^{42}). In brief, the former makes a lowrank approximation, while the latter approximates individuals as independent and leverages correlations among cell types. We also evaluated imputing all genes jointly across the transcriptome using softImpute (in an \(N\times {CG}\) matrix, where \(G\) is the number of genes); this is computationally impossible with MVNimpute.
To evaluate imputation accuracy, we masked observed entries in \({{{{{{\bf{y}}}}}}}_{{ic}}\) and \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\) and compared the imputed values to the masked true values. Of note, if one cell type of an individual has less than or equal to ten cells, all genes’ expression would be missing for the pair of individual and cell type. To be realistic, we maintained this structure of missingness by employing a “copymask” approach as in our prior study^{42}. We randomly sampled an individual with missing cell types and masked the same cell types in another randomly chosen individual. We repeated this process until 10% of all pairs of individual and cell type were masked. We calculated correlation and mean squared error (MSE) between imputed values and masked true values across individuals for each genecell type pair. We conducted ten replications of the process of masking and imputation and calculated the medians of correlation and MSE across those repeats as final measures of imputation accuracy. For \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\), imputation might get negative values by chance. We treated these negative variances in different ways in OP and CTP expression data. In OP, we set negative variances to 0, so they had little impact on the estimation of \({{{{{{\boldsymbol{\nu }}}}}}}_{i}\) while maintaining information from other cell types; in CTP, for each gene and cell type, we set them to maximum raw \({{{{{{\boldsymbol{\nu }}}}}}}_{{ic}}\) in that specific gene and cell type, so they contributed less to model likelihood. Note that standard approaches to impute expression in single cells^{43,44} does not impute the pseudobulk data, which has missing entries due to missing cells, not missing expression within observed cells.
Enrichment of gene features related to enhancers and selection
We evaluated four genelevel features: LOEUF, pLI, EDS, and the number of enhancers. LOEUF and pLI were obtained from the Genome Aggregation Database (gnomAD) version 2.1^{29}. LOEUF and pLI measure a gene’s susceptibility to lossoffunction mutations, and they approximately quantify the degree of selection on a gene. EDS and the number of enhancers were obtained from Wang and Goldstein^{27}. The number of enhancers was computed from enhancergene links inferred by ref. ^{45} based on chromatin state and correlation of histone modifications with gene expression. EDS is a comprehensive score derived from 108 features associated with enhancer domains, including the number of enhancers. It reflects the size and redundancy of enhancer domains in a gene.
For each feature, genes were stratified into deciles based on their respective feature scores. Subsequently, we computed both the mean and median values of various gene expression properties within each decile, as well as their standard errors. These gene expression properties are:

the total interindividual variance, which sums the cell typeshared variance with the average cell typespecific variance: \({\sigma }_{\alpha }^{2}+\widetilde{V}\), where \(\widetilde{V}=\frac{1}{C}{\sum }_{c=1}^{C}{{{{{{\bf{V}}}}}}}_{{cc}}\) is the average cell typespecific variance.

the proportion of interindividual variation that is cell typespecific, defined by \(\frac{\widetilde{V}}{{\sigma }_{\alpha }^{2}+\widetilde{V}}\).

the amount of mean differences across cell types, quantified by the variance of the mean expression level across cell types: \({{{{\mathrm{var}}}}}(\beta )\).

the positive rate for two cell typespecificity tests: CTMM’s test of cell typespecific interindividual variance and the ordinary test of cell typespecific mean expression.
To robustly examine the broad relationship between CTMM results and gene features, we performed a metaregression of each decile’s mean and median CTMM results against the decile index using ordinary least squares.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Processed single cell count data from iPSCs are publicly available from Zenodo: https://zenodo.org/record/3625024#.Xil0y2cZ0s. OneK1K singlecell gene expression data are publicly available via Gene Expression Omnibus (GSE196830 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE196830]). The simulated datasets and imputed pseudobulk data were fully reproducible using the code provided in the study and the publicly available iPSCs and OneK1K data. All data generated during this study are included in this published article and its supplementary information files.
Code availability
The CTMM Python package, along with Python (version 3.11.5) and R (version 4.3.1) code used for all analyses in this paper, is available at: https://github.com/MinhuiChen/CTMM.
References
Travaglini, K. J. et al. A molecular cell atlas of the human lung from singlecell RNA sequencing. Nature 587, 619–625 (2020).
Grubman, A. et al. A singlecell atlas of entorhinal cortex from individuals with Alzheimer’s disease reveals celltypespecific gene expression regulation. Nat. Neurosci. 22, 2087–2097 (2019).
Bageritz, J. et al. Gene expression atlas of a developing tissue by single cell expression correlation analysis. Nat. Methods 16, 750–756 (2019).
Peng, J. et al. Singlecell RNAseq highlights intratumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma. Cell Res. 29, 725–738 (2019).
Wang, Z. et al. Singlecell RNA sequencing of peripheral blood mononuclear cells from acute Kawasaki disease patients. Nat. Commun. 12, 1–10 (2021).
Kang, H. M. et al. Multiplexed droplet singlecell RNAsequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2017).
Oelen, R. et al. Singlecell RNAsequencing of peripheral blood mononuclear cells reveals widespread, contextspecific gene expression regulation upon pathogenic exposure. Nat. Commun. 13, 1–15 (2022).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: Largescale singlecell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of singlecell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
MartinezJimenez, C. P. et al. Aging increases celltocell transcriptional variability upon immune stimulation. Science 355, 1433–1436 (2017).
Tung, P. Y. et al. Batch effects and the effective design of singlecell gene expression studies. Sci. Rep. 7, 1–15 (2017).
Cuomo, A. S. E. et al. CellRegMap: a statistical framework for mapping contextspecific regulatory variants using scRNAseq. Mol. Syst. Biol. 18, e10663 (2022).
Cuomo, A. S. E. et al. Singlecell RNAsequencing of differentiating iPS cells reveals dynamic genetic effects on gene expression. Nat. Commun. 11, 1–14 (2020).
Crowell, H. L. et al. muscat detects subpopulationspecific state transitions from multisample multicondition singlecell transcriptomics data. Nat. Commun. 11, 1–12 (2020).
Ho, J. W. K., Stefani, M., dos Remedios, C. G. & Charleston, M. A. Differential variability analysis of gene expression and its application to human diseases. Bioinformatics 24, i390–i398 (2008).
Simonovsky, E., Schuster, R. & YegerLotem, E. Largescale analysis of human gene expression variability associates highly variable drug targets with lower drug effectiveness and safety. Bioinformatics 35, 3028–3037 (2019).
Li, J., Liu, Y., Kim, T. H., Min, R. & Zhang, Z. Gene expression variability within and between human populations and implications toward disease susceptibility. PLoS Comput. Biol. 6, e1000910 (2010).
Fair, B. J. et al. Gene expression variability in human and chimpanzee populations share common determinants. Elife 9, 1–26 (2020).
Resztak, J. A. et al. Genetic control of the dynamic transcriptional response to immune stimuli and glucocorticoids at singlecell resolution. Genome Res. 33, 839–857 (2023).
Jerber, J. et al. Populationscale singlecell RNAseq profiling across dopaminergic neuron differentiation. Nat. Genet. 53, 304–312 (2021).
Perez, R. K. et al. Singlecell RNAseq reveals cell type–specific molecular and genetic associations to lupus. Science 376, eabf1970 (2022).
Yazar, S. et al. Singlecell eQTL mapping identifies cell type–specific genetic control of autoimmune disease. Science 376, eabf3041 (2022).
Soneson, C. & Robinson, M. D. Towards unified quality verification of synthetic count data with countsimQC. Bioinformatics 34, 691–692 (2018).
Li, M. & Izpisua Belmonte, J. C. Deconstructing the pluripotency gene regulatory network. Nat. Cell Biol. 20, 382–392 (2018).
Wu, T. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2, 100141 (2021).
Xu, X. et al. Mitochondrial regulation in pluripotent stem cells. Cell Metab. 18, 325–332 (2013).
Wang, X. & Goldstein, D. B. Enhancer domains predict gene pathogenicity and inform gene discovery in complex disease. Am. J. Human Genet. 106, 215–233 (2020).
Lek, M. et al. Analysis of proteincoding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Kasela, S. et al. Pathogenic implications for autoimmune mechanisms derived by comparative eQTL analysis of CD4+ versus CD8+ T cells. PLoS Genet. 13, e1006643 (2017).
Chen, C. et al. Ribosomal protein S26 serves as a checkpoint of Tcell survival and homeostasis in a p53dependent manner. Cell. Mol. Immunol. 18, 1844–1846 (2021).
Richardson, T. G., Hemani, G., Gaunt, T. R., Relton, C. L. & Davey Smith, G. A transcriptomewide Mendelian randomization study to uncover tissuedependent regulatory mechanisms across the human phenome. Nat. Commun. 11, 1–11 (2020).
Lea, A. et al. Genetic and environmental perturbations lead to regulatory decoherence. Elife 8, e40538 (2019).
Brown, A. A. et al. Genetic interactions affecting human gene expression identified by variance association mapping. Elife 3, e01381 (2014).
Olah, M. et al. Single cell RNA sequencing of human microglia uncovers a subset associated with Alzheimer’s disease. Nat. Commun. 11, 1–18 (2020).
Croft, A. P. et al. Distinct fibroblast subsets drive inflammation and damage in arthritis. Nature 570, 246–251 (2019).
Eling, N., Morgan, M. D. & Marioni, J. C. Challenges in measuring and understanding biological noise. Nat. Rev. Genet. 20, 536–548 (2019).
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2007).
Crainiceanu, C. M. & Ruppert, D. Likelihood ratio tests in linear mixed models with one variance component. J. R. Stat. Soc. Ser. B Stat. Methodol. 66, 165–185 (2004).
Greven, S., Crainiceanu, C. M., Küchenhoff, H. & Peters, A. Restricted likelihood ratio testing for zero variance components in linear mixed models. J. Comput. Graph. Stat. 17, 870–891 (2012).
Mazumder, R. et al. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).
Dahl, A. et al. A multiplephenotype imputation method for genetic studies. Nat. Genet. 48, 466–472 (2016).
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for singlecell RNAseq data. Nat. Commun. 9, 1–9 (2018).
Huang, M. et al. SAVER: gene expression recovery for singlecell RNA sequencing. Nat. Methods 15, 539–542 (2018).
Liu, Y., Sarkar, A., Kheradpour, P., Ernst, J. & Kellis, M. Evidence of reduced recombination rate in human regulatory domains. Genome Biol. 18, 1–11 (2017).
Geng, Y. et al. A chemical biology study of human pluripotent stem cells unveils HSPA8 as a key regulator of pluripotency. Stem Cell Rep. 5, 1143–1154 (2015).
Ochiai, K. et al. Chromatin protein PC4 orchestrates B cell differentiation by collaborating with IKAROS and IRF4. Cell Rep. 33, 108517 (2020).
Zhang, C. et al. HSPC111 governs breast cancer growth by regulating ribosomal biogenesis. Mol. Cancer Res. 12, 583–594 (2014).
Liu, P. Y. et al. The long noncoding RNA lncNB1 promotes tumorigenesis by interacting with ribosomal protein RPL35. Nat. Commun. 10, 1–17 (2019).
Liang, J. R. et al. A genomewide ERphagy screen highlights key roles of mitochondrial metabolism and ERresident UFMylation. Cell 180, 1160–1177.e20 (2020).
Gherzi, R. et al. Akt2mediated phosphorylation of Pitx2 controls Ccnd1 mRNA decay during muscle cell differentiation. Cell Death Differ. 17, 975–983 (2009).
Yang, C. T. et al. Activation of KLF1 enhances the differentiation and maturation of red blood cells from human pluripotent stem cells. Stem Cells 35, 886–897 (2017).
Wu, C., der, Kuo, Y. S., Wu, H. C. & Lin, C. T. MicroRNA1 induces apoptosis by targeting prothymosin alpha in nasopharyngeal carcinoma cells. J. Biomed. Sci. 18, 1–10 (2011).
al Adhami, H. et al. Systematic identification of factors involved in the silencing of germline genes in mouse embryonic stem cells. Nucleic Acids Res. 1, 13–14 (2013).
Makhnevych, T., Lusk, C. P., Anderson, A. M., Aitchison, J. D. & Wozniak, R. W. Cell cycle regulated transport controlled by alterations in the nuclear pore complex. Cell 115, 813–823 (2003).
FerrandoMay, E. Nucleocytoplasmic transport in apoptosis. Cell Death Differ. 12, 1263–1276 (2005).
Saito, T. T., Mohideen, F., Meyer, K., Harper, J. W. & Colaiácovo, M. P. SLX1 is required for maintaining genomic integrity and promoting meiotic noncrossovers in the Caenorhabditis elegans germline. PLoS Genet. 8, e1002888 (2012).
Shen, R., Weng, C., Yu, J. & Xie, T. eIF4A controls germline stem cell selfrenewal by directly inhibiting BAM function in the Drosophila ovary. Proc. Natl Acad. Sci. USA 106, 11623–11628 (2009).
Wang, G. et al. TFPI2 suppresses breast cancer cell proliferation and invasion through regulation of ERK signaling and interaction with actinin4 and myosin9. Sci. Rep. 8, 1–12 (2018).
Galber, C. et al. The f subunit of human ATP synthase is essential for normal mitochondrial morphology and permeability transition. Cell Rep. 35, 109111 (2021).
Kon, S. et al. Smap1 deficiency perturbs receptor trafficking and predisposes mice to myelodysplasia. J. Clin. Invest. 123, 1123–1137 (2013).
Lee, D. F. et al. Modeling familial cancer with induced pluripotent stem cells. Cell 161, 240–254 (2015).
Acknowledgements
This work was funded by K25HL157603 (to A.D.). We thank Joseph E. Powell for kindly providing a postquality control version of the OneK1K data. We thank the Center for Research Informatics and the Research Computing Center for providing the computing resources. The Center for Research Informatics is funded by the Biological Sciences Division at the University of Chicago with additional funding provided by the Institute for Translational Medicine, CTSA grant number 2U54TR00238906 from the National Institutes of Health. We thank Ben Umans for helpful feedback and Xuanyao Liu for feedback and help in defining gene features related to enhancers and selection.
Author information
Authors and Affiliations
Contributions
M.C. developed statistical methodology, performed analyses, and wrote the manuscript. A.D. conceived and supervised the project and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Anna Cuomo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, M., Dahl, A. A robust model for cell typespecific interindividual variation in singlecell RNA sequencing data. Nat Commun 15, 5229 (2024). https://doi.org/10.1038/s41467024492429
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467024492429
This article is cited by

Longitudinal singlecell data informs deterministic modelling of inflammatory bowel disease
npj Systems Biology and Applications (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.