Nature Biotechnology  Computational Biology  Analysis
Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells
 Florian Buettner^{1, 2}^{, n1}
 Kedar N Natarajan^{2, 3}^{, n1}
 F Paolo Casale^{2}^{, }
 Valentina Proserpio^{2, 3}^{, }
 Antonio Scialdone^{2, 3}^{, }
 Fabian J Theis^{1, 4}^{, }
 Sarah A Teichmann^{2, 3}^{, }
 John C Marioni^{2, 3}^{, }
 Oliver Stegle^{2}^{, }
 Journal name:
 Nature Biotechnology
 Volume:
 33,
 Pages:
 155–160
 Year published:
 DOI:
 doi:10.1038/nbt.3102
 Received
 Accepted
 Published online
Abstract
Recent technical developments have enabled the transcriptomes of hundreds of cells to be assayed in an unbiased manner, opening up the possibility that new subpopulations of cells can be found. However, the effects of potential confounding factors, such as the cell cycle, on the heterogeneity of gene expression and therefore on the ability to robustly identify subpopulations remain unclear. We present and validate a computational approach that uses latent variable models to account for such hidden factors. We show that our singlecell latent variable model (scLVM) allows the identification of otherwise undetectable subpopulations of cells that correspond to different stages during the differentiation of naive T cells into T helper 2 cells. Our approach can be used not only to identify cellular subpopulations but also to tease apart different sources of gene expression heterogeneity in singlecell transcriptomes.
Subject terms:
At a glance
Figures
Introduction
Singlecell measurements of gene expression, using imaging techniques such as RNAFiSH (fluorescence in situ hybridization), have provided important insights into the kinetics of transcription and celltocell variation in gene expression^{1, 2, 3}. However, such approaches can examine the expression of only a small number of genes in each experiment, thus restricting our ability to examine coexpression patterns and to robustly identify subpopulations of cells. Protocols have been developed to overcome these limitations by amplifying small quantities of mRNA^{4, 5}, which, in combination with microfluidics approaches for isolating individual cells^{6, 7}, have been used to analyze the coexpression of tens to hundreds of genes in single cells^{8, 9}. These protocols also allow the entire transcriptome of large numbers of single cells to be assayed in an unbiased way. This was initially done using microarrays^{10, 11} but is more often now done using nextgeneration sequencing^{12, 13, 14, 15}. Such approaches have been used to model early embryogenesis in the mouse^{16} and to investigate bimodality in gene expression patterns of differentiating immune cell types^{17}.
After the generation of singlecell RNAsequencing (RNAseq) profiles from hundreds of cells, one goal to identify subpopulations that share a common geneexpression profile. Some of these subpopulations may represent previously unidentified cell types. Additionally, by studying patterns of gene expression in different single cells, insights into the regulatory landscape of each cell population can be obtained.
However, methods for identifying subpopulations of cells and modeling their gene regulatory landscapes are only now beginning to emerge^{18, 19}. To fully exploit singlecell RNAseq data, we have to account for the random noise inherent to such data sets^{20} and, equally important, to account for different hidden factors that might result in gene expression heterogeneity. Although the importance of accounting for unobserved factors is well established in bulk RNAseq studies^{21, 22, 23}, robust approaches to detect and account for confounding factors in singlecell RNAseq studies remain to be developed. Here, we describe a computational approach that uses latent variable models to reconstruct such hidden factors from the observed data. We validate our scLVM using a population of staged mouse embryonic stem cells (mESCs), before applying it to study T helper 2 (T_{H}2) cell differentiation. We show that scLVM facilitates the identification of physiologically meaningful subpopulations of cells, which cannot otherwise be found.
Results
Cell cycle variation affects global gene expression
Singlecell RNAseq is now commonly used to study cell differentiation^{15, 24}. Here, we reanalyzed data from a singlecell RNAseq experiment that was originally designed to study the differentiation of naive T cells into T_{H}2 cells^{25}. Briefly, a population of naive Cd4^{+} T helper cells were activated and polarized with interleukin (IL)4 to induce differentiation toward a T_{H}2 subtype. At 4.5 d poststimulation, cells were sorted into a G4P group (fourth generation, IL13–GFP^{+} cells) and a G2N group (second generation, IL13–GFP^{−} cells). Subsequently, these two groups of cells were pooled in equal proportions. From this pool, a set of 96 asynchronously dividing cells (including both fully and partially differentiated cells) was captured using the Fluidigm C1 system, and sequencing libraries were prepared and processed. After quality control and accounting for technical noise, RNAseq data for 81 cells and 7,073 genes with variation in their expression level above technical noise were considered for analysis (Supplementary Fig. 1).
The cell cycle is known to have wideranging effects on cellular physiology^{26, 27} and can modulate both differentiation and gene expression profiles^{28} (Fig. 1a). Cells that are analyzed during development are likely to be in different stages of the cell cycle^{28}. When we examined sets of genes whose expression is known to be associated with different cellcycle stages, we observed that their expression levels varied considerably among single cells (Supplementary Fig. 1). Although variation in gene expression that is linked to the cell cycle can provide important biological insights, in many contexts such variation might mask other more physiologically important differences in gene expression between cells.
Importantly, variation in gene expression that is linked to the cell cycle is not restricted to wellannotated cellcycle marker genes. When we examined a set of moderately to highly variable genes that have not previously been associated with the cell cycle, we observed that 2,881 genes (44%) showed a significant correlation of gene expression with at least one cellcycle gene (P < 0.05, Bonferroni adjusted; Supplementary Fig. 2). Therefore, merely removing the set of annotated cellcycle genes before performing downstream analyses is likely to be unsuccessful because it would not enable all effects independent of the cell cycle to be detected.
Development of scLVM to account for effects of the cell cycle
We used scLVM to address the confounding effects of the cell cycle. In this type of computational approach, one first reconstructs the cellcycle state (or other unobserved factors) and then uses this information to infer 'corrected' gene expression levels. This twostep approach enables the effect of unobserved factors on gene expression heterogeneity to be accounted for in downstream analyses, thereby allowing us to study variation in gene expression levels that is independent of the cell cycle. Moreover, for each gene whose expression is analyzed, our method allows the relative contribution of any reconstructed factors that affect celltocell variation in expression to be determined. A schematic overview of the approach is shown in Figure 1b.
To validate our method, we generated singlecell RNAseq data from mESCs using the Fluidigm C1 protocol, where the cellcycle status of each cell is known a priori. We assayed the transcriptional profile of 182 ESCs that had been staged for cellcycle phase (G1, S and G2M) based on sorting of the Hoechst 33342stained cell area of a flow cytometry (FACS) distribution. In the fitting stage, scLVM uses the expression profiles of a relatively small set of 892 annotated cellcycle genes (Supplementary Table 1) to recover a covariance matrix that accounts for celltocell heterogeneity due to the cell cycle (Supplementary Fig. 3). Using alternative annotations for cellcycle genes (Supplementary Table 1) yielded very similar results (Supplementary Figs. 3, 4, 5). Subsequently, for all remaining genes, we used scLVM to estimate the proportion of variance in expression across cells that is explained by technical noise, biological variability and cell cycle. This approach can also be used to create a 'corrected' gene expression data set, in which the effect of the identified factor(s) is removed, which can be used as the input for existing analysis methods. scLVM is related to approaches for modeling variability in bulk mRNA expression studies^{21, 22} and to methods used in genomewide association studies in which the relatedness between individuals is inferred from genotype^{29} and/or expression levels^{30} and then accounted for in downstream analyses using linear mixed models.
As the cellcycle stage of each cell is known in our data set, we can compare the scLVM estimates of the proportion of variance explained by the cell cycle with the gold standard values obtained when using the annotation of individual cells based on the Hoechst staining (FACS). We observed a striking correlation (r^{2} = 0.91) between our scLVM estimates and the gold standard values, providing confidence in the efficacy of our approach (Fig. 2a). The model fit and these estimates for the variance explained by the cell cycle were consistent when a much smaller gene set containing only tens of genes was used to train the model (Supplementary Fig. 5a–g) and when alternative metrics were applied to quantify the proportion of variation explained by the cell cycle (Supplementary Fig. 5h). This suggests that scLVM can be used to robustly recover and estimate the variance due to unobserved factors from relatively small gene sets that annotate these factors. Additionally, we examined how many pairs of genes had significantly correlated patterns of expression across cells (i) without cellcycle correction, (ii) with the scLVM correction and (iii) with an ideal correction using the gold standard cellcycle state. The set of significant genegene correlations obtained with the scLVM correction was much more consistent with a gene correlation network based on the experimental staging than the set generated under the nocorrection model (Supplementary Fig. 6), with the number of falsepositive correlations reduced by three orders of magnitude (from 72,117 to 77). Finally, we compared the scLVM correction to a basic removal strategy, in which cell cycle–annotated genes (892 genes, Supplementary Table 1) were omitted from the analysis. A nonlinear principal component analysis (PCA)^{31} on the data set from which cellcycle annotated genes were removed yielded a clear separation of cells according to cellcycle stage (Fig. 2b). In contrast, when repeating the analysis using scLVMcorrected gene expression levels, the same separation of cells was not observed, showing that the cell cycle–related expression signature was effectively removed (Fig. 2c). Further, to show that scLVM is specific in removing the effects of cell cycle–related variation, we considered a noncycling cell type (terminally differentiated neurons) as a negative control. Reassuringly, scLVM attributed more than 30% of variation to the cell cycle for only 27 genes, and the maximum proportion of variation attributed to the cell cycle for any single gene was 37%. In comparison, when we applied scLVM to cycling T cells, for 1,895 genes, more than 30% of variation was attributed to the cell cycle, with the maximum proportion for any single gene being 79% (Fig. 3a and Supplementary Fig. 7). These results give additional confidence that the variance estimates are accurately inferred. Finally, we repeated the validation of scLVM using a second previously published data set of 35 mESCs staged for the cell cycle, but prepared for sequencing with an alternative protocol (QuartzSeq)^{32} and cultured under different media conditions that are known to induce reduced variability in expression of cellcycle genes^{33}. Again, direct comparison of variance estimates from scLVM with the gold standard derived from the staging information of individual cells yielded good agreement (Supplementary Figs. 8 and 9). To assess the consistency of the expression signatures that are used by scLVM to infer the celltocell covariance, we projected the 35 mESCs from this published data set onto the larger mESC validation data set discussed above. This analysis revealed that the expression signatures due to the cell cycle are robust across sequencing protocols, studies and experimental batch (Supplementary Fig. 10). In sum, these analyses provide confidence that our scLVM approach effectively accounts for latent factors such as the cell cycle.
Application of scLVM to identify cell populations in differentiating T_{H}2 cells
We next applied our scLVM approach to study a population of asynchronously differentiating T_{H}2 cells that have previously been profiled using singlecell RNAseq^{20, 25}. We observed that the cell cycle contributed markedly to gene expression variability, in particular for the set of genes with medium to high overall nontechnical variability (Fig. 3a). Genomewide, for 1,895 (27%) of these variable genes, the cell cycle accounted for more than 30% of the variance in expression across cells (Fig. 3a, Supplementary Figs. 11 and 12, and Supplementary Table 2), suggesting that the expression of many genes is affected by the cell cycle. When comparing the expression signatures of cellcycle genes in the T_{H}2 cell type with those found in the mESC validation data set, we found striking agreement of the main axes of variation (PC1, r^{2} = 0.998, Supplementary Fig. 10). This result strongly suggests that scLVM robustly captures cell cycle effects in the T_{H}2 data set.
Turning next to the question of whether pairs of genes show patterns of correlation between cells (genegene correlations), we observed a striking decrease in significant correlations after accounting for the cell cycle (P < 0.05, Bonferroni adjusted; Fig. 3b,c and Supplementary Fig. 11). This suggests that many of the genegene correlations observed in the initial data were driven by cellcycle stage. Notably, the much smaller set of genes with significant correlation patterns after correction was enriched for genes involved in glycolysis^{34} and for genes that mark the cellular response to IL4 stimulation (Supplementary Table 3a), both of which are key processes in T_{H}2 cell differentiation. In contrast, genegene variation obtained using uncorrected data yielded no enrichment for variation in expression of genes involved in glycolysis but instead identified genes that were enriched for cell cycle–related categories (Supplementary Table 3b), again indicating that cell cycle, if not accounted for, is a major confounder of genegene correlations.
Next, we examined whether the cellcycle correction facilitated by the scLVM model enabled a more reliable identification of subpopulations of cells. Without correction, a nonlinear PCA^{31} revealed little structure in the data, with no obvious subgroups of the cells identified (Fig. 3d and Supplementary Fig. 11). A similar lack of structure was observed when other clustering algorithms, including hierarchical and kmeans clustering, were applied (data not shown). However, when applying the same nonlinear PCA approach to the cell cycle–corrected data, two clear subpopulations of cells were identified (Fig. 3e; see Supplementary Data 1 for assignment of cells to clusters).
To investigate whether these two populations correspond to physiologically distinct subsets, we studied the set of 401 genes with significant differences in expression between the clusters (P < 0.05, Bonferroni adjusted; Supplementary Table 4). This set was heavily enriched for genes that have important roles in T_{H}2 cell differentiation—Il4ra^{35}, Gata3 (ref. 36), Stat3 (ref. 37), Klf13 (ref. 38), Batf (ref. 39) (P < 0.0001, Bonferroni adjusted) and Il24 (ref. 40) (P = 0.01) are all upregulated in the righthand cluster (Fig. 3e), suggesting that cells contained in that group represent fully differentiated T_{H}2 cells, whereas the lefthand population of cells correspond to a group that is only partially differentiated (Fig. 3e,f and Supplementary Fig. 13).
Consistent with this observation, an analysis of 122 manually curated 'T_{H}2 signature' genes (Supplementary Table 5) revealed a significant enrichment in the set of 401 genes that were differentially expressed between the identified clusters (P = 0.001, Hypergeometric Test). Further, Gene Ontology (GO) enrichment analysis showed that the differentially expressed genes contained statistically significant enrichments of genes involved in glycolysis, cellular response to IL4 stimulation and positive regulation of Bcell proliferation (Supplementary Table 6). To establish whether the genes distinguishing the two clusters act in a coordinated manner, we studied their interactions using the STRING database^{41}. This yielded a densely connected network with three major hubs, which were highly enriched for glycolysis, translational elongation and Tcell activation, respectively (Supplementary Fig. 14). With glycolysis being a hallmark for Tcell activation^{42} and Tcell activation being linked to increased translational activity^{43}, this provides further evidence that the two clusters contain cells at different positions along the trajectory to becoming fully differentiated T_{H}2 cells.
Importantly, the cellcycle correction afforded by scLVM not only enabled identification of two cell populations, but was also required for characterizing the two clusters. Testing for differential expression between the two identified populations of cells using the uncorrected data yielded only 7 genes whose transcription differed significantly between clusters (compared to 401 with the correction).
Accounting for more than one factor
The scLVM approach can be applied to account for the effects of other factors, provided that an informative gene set is available. As an example, we extended the analysis of the T_{H}2 cells by simultaneously modeling the cellcycle state and the T_{H}2 differentiation process as distinct factors. We used a set of 122 manually curated T_{H}2 signature genes (Supplementary Table 5), introduced earlier, to fit a T_{H}2 differentiation factor after removing the effects of the cell cycle. Although, in general, inference of multiple factors is statistically challenging, the much stronger effect of the cellcycle factor helps to ensure that inference results are robust when considering different approaches (Supplementary Fig. 15; see Online Methods for a discussion of practical challenges). The joint analysis with both factors offered a more finegrained decomposition of expression variability, attributing expression variation of individual genes to cellcycle effects, T_{H}2 differentiation and interactions between both factors (Fig. 4a). The interaction component allows genes that are associated with T_{H}2 differentiation in a cellcyclestagespecific manner to be identified (Fig. 4b). Although the overall variance due to these interaction effects was small, a set of 375 genes with strong interactions (explained variance >5%; Supplementary Table 7) contained prominent candidates for effectors of the interplay between the cell cycle and T_{H}2 differentiation. Several T_{H}2 differentiation markers, including Batf and Il2ra, were among these genes (Fig. 4b), and this set was enriched for positive cell proliferation and negative regulation of apoptosis (Supplementary Tables 8 and 9). This finding is consistent with the known link between differentiation and cell proliferation in T helper cells^{44, 45}.
Additionally, we investigated the relevance of the T_{H}2 factor when testing for genegene correlation networks (Fig. 4c). The number of significant genegene correlations decreased markedly when including the additional T_{H}2related factors (from 17,389 to 2,077), suggesting that the cell cycle, T_{H}2 differentiation and their interactions are the predominant sources of variation in this population of cells.
In summary, accounting for cell cycle–related variation by using scLVM is necessary both for identification and characterization of distinct populations of T cells that are at different stages of differentiation into mature T_{H}2 cells. We also applied scLVM to other singlecell RNAseq data sets, including 34 human embryonic stem cells and a set of 90 cells from human preimplantation embryos^{15}, which confirmed that the cell cycle explains substantial proportions of the variability in other contexts. Moreover, correcting for cell cycle as a confounder revealed otherwise hidden structure that might correlate with different cell populations in these independently generated, singlecell RNAseq data sets (Supplementary Figs. 16 and 17).
Discussion
We have shown how heterogeneity in gene expression in single cells due to factors such as the cell cycle can compromise the interpretation of singlecell RNAseq experiments. To overcome this problem we present a computational approach that effectively accounts for these confounding factors. This method (Fig. 1) builds on existing approaches for modeling gene expression heterogeneity in bulk data^{22, 30}, which we here adapt to singlecell transcriptomics. We have validated our method using a large mESC data set in which the cellcycle stages of individual cells are known a priori (Fig. 2) and demonstrated the utility of our approach by applying it to obtain insights into T_{H}2 cell differentiation (Fig. 3). We treated the cell cycle as a confounding variable in our study, but cyclingrelated processes may be of high interest in other contexts. This is exemplified in the analysis of the interaction between the effects of the T_{H}2 differentiation factor and the cell cycle (Fig. 4). More generally, scLVM allows the user to model and account for latent factors of other predefined sets of genes, enabling the sources of variation in a wide range of singlecell RNAseq experiments to be studied. Our analysis of the T_{H}2 differentiation process uses a nonlinear PCA approach to uncover the differentiation structure. More generally, scLVM can be used to remove variation due to the cell cycle and other confounding factors before applying alternative downstream analytic strategies, such as Monocle^{18}.
One important challenge when multiple confounding factors are considered is to ensure that the model remains statistically identifiable, such that the effect of each individual factor can be robustly estimated. This may be of particular concern if multiple weak and nonindependent factors are present. Finally, we note that there remain open questions regarding the best way to process singlecell RNAseq data^{46}. In particular, our scLVM approach could be refined in several ways. For example, statistics to formally test for the presence of a particular factor might be warranted and scLVM could also be coupled with methods to reconstruct pseudotemporal trajectories^{18}. Also, comprehensive methods to properly normalize RNAseq data within and across multiple independent singlecell transcriptome experiments are an important area of future work.
Methods
Data sets and processing.
Mouse ESC data. A detailed description of cell culture, Hoechst staining, singlecell capture and mRNA sequencing as well as quality control can be found in the Supplementary Notes and Supplementary Figure 18. In brief, Rex1GFP–expressing mESCs (Rex1GFP mESCs) were cultured on gelatincoated dishes using serumfree NDiff 227 medium (Stem Cells Inc.) supplemented with 2i inhibitors. Hoechst staining (Hoechst 33342; Invitrogen) was optimized for Rex1GFP mESC, and cells were sorted using FACS (MOFLO XDP; Beckmann Coulter) for respective cellcycle fractions (G1, S and G2M phase). Singlecell RNAseq was done using the C1 Single Cell Auto Prep System (Fluidigm; 1007000). After normalization and estimation of technical noise using ERCC spikeins (see RNAseq normalization and estimation of technical noise), we retained a set of 9,571 genes for analysis with variation above the technical background level (FDR <0.1; Supplementary Data 1).
To account for errors in the assignment of a cellcycle phase using the Hoechst staining (e.g., due to cells cycling after FACS sorting), we performed an additional filtering step based on the ERCC spikeins. We reasoned that for cells within a cellcycle phase, the ratio of endogenous reads to total mapped reads—which can be interpreted as a proxy for cell size—should follow a narrow distribution. Therefore, we excluded cells where the difference between this ratio and its median within a cellcycle phase exceeded one median absolute deviation (Supplementary Fig. 19). This resulted in a filtered set of 59 cells in G1 phase, 58 cells in S phase and 65 cells in G2M phase. Analysis results for the unfiltered data are shown in Supplementary Figure 20 (see Supplementary Notes for full details), leading to consistent overall conclusions.
Mouse ESC data (QuartzSeq protocol). We used the normalized data and counts from the primary publication^{32}. These data consist of gene expression level estimates, obtained using the QuartzSeq protocol, for 35 mESCs, where the cellcycle state of each cell is known a priori (7 S, 8 G2M and 20 G1 cells). FACS sorting the distribution of the Hoechst 33342stained cell area with gates corresponding to G1, S and G2/M phases was used to establish the cellcycle state before processing. In this particular data set, technical noise cannot be reliably estimated owing to the lack of spikeins. Consequently, we estimated the amount of technical (null) noise expected for genes with variable levels of expression using a loglinear fit between the expression mean and the squared coefficient of variation between cells, approximating the typical fitting procedure when spikeins are available (Supplementary Fig. 1b). This approach yielded a total of 5,546 highly variable genes (FDR< 0.1; see RNAseq normalization below).
Tcell data. Generation of the Tcell data has been described in detail previously^{20, 25}. In brief, untouched Naïve CD4+ cells from spleens of IL13eGFP Balb/c mice were negatively selected and differentiated toward T_{H}2 in antiCD3/CD28 coated plates. CellTrace Violet staining was performed according to manufacturer's instructions. After 4.5 d of activation cells were sorted according to presence/absence of GFP and number of cell division. In particular GFPNegative cells that had undergone 2 cycles of cell division and GFPPositive cells that had divided 4 times were then pooled in 1:1 ratio and loaded on a C1 machine for capturing. Duplets and cells with low yield or poor quality cDNA were removed, yielding 81 cells for analysis. After normalization and estimation of technical noise using ERCC spikeins (see RNAseq normalization and estimation of technical noise), we retained a set of 7,073 genes for analysis with variation above the technical background level (FDR <0.1; Supplementary Data 2).
RNAseq normalization and estimation of technical noise. For the Tcell data, raw read counts were normalized using the approach proposed in DESeq^{47}, deriving size factors for each cell from the ERCC spikeins. Estimates of the technical variability were also derived using the ERCC spikeins, adapting the approach in Brennecke et al.^{20} (Supplementary Fig. 1a). We omitted the normalization for cell size as proposed previously^{20} because the computational correction by scLVM yielded much better results (Supplementary Fig. 21). This is likely explained by noting that cell size and cell cycle are correlated, thus the normalization proposed by Brennecke et al. reduces the amount of information available for inferring cellcell correlations due to cell cycle; see also Supplementary Figure 21 and discussion in Supplementary Notes. To determine genes with high biological variability, we followed Brennecke et al.^{20} and tested against the null hypothesis that the biological coefficient of variation is at most 50% (at 10% FDR, Supplementary Notes). This justifies ignoring Poisson shot noise because of the large proportion of technical noise of genes expressed at low levels (see ref. 20 and details below). For the QuartzSeq mESC data no spikeins were available; we therefore used fragments per kilobase of transcript per million fragments mapped (FPKM) expression estimates as provided by the authors. Because there were no spikeins, we estimated the baseline variability using a loglinear fit to describe the relationship between mean and squared coefficient of variation overall (Supplementary Fig. 1c). All subsequent analyses were carried out on logtransformed normalized count values and logtransformed FPKM estimates for the Tcell and newly generated mESC data and the QuartzSeq mESC data, respectively.
scLVM method.
The scLVM algorithm is a twostep approach. First, one or more covariance structures are inferred from genes that are annotated to hidden factors such as cellcycle progression. Subsequently, these covariance structures can be used to account for the hidden factors as random effects in a mixed model, allowing the variance in expression for each gene to be decomposed into a technical, a biological and a separate component for each hidden factor. Additionally, the hidden factors can be accounted for when performing pairwise genegene correlation analyses, and further allow 'corrected' residual gene expression data sets to be generated. scLVM is closely related to previous approaches that correct for hidden confounding factors in gene expression data^{21, 48} and the inference employed to fit hidden factors builds on the PANAMA model^{30}.
Briefly, the fitting process uses Gaussian Process Latent variable models (GPLVMs)^{31}, a recent development in machine learning and statistics. The approach resembles a PCA on genes annotated to a hidden factor (such as cell cycle). However, instead of explicitly reconstructing PCA loadings and scores, the GPLVM approach fits a lowrank celltocell covariance to the observed gene expression matrix of these genes. Related approaches have been proposed to account for relatedness between individuals in the context of expression Quantitative Trait Loci (eQTL) studies^{30}, where an individualtoindividual covariance is inferred to explain the heterogeneity in gene expression levels between individuals rather than cells.
More specifically, for any gene g that is annotated to the hidden factor under consideration, its expression profile y_{g} across cells is modeled as
where X represents the hidden factor (such as cell cycle), C corresponds to additional observed covariates (if available) and
denotes the residual variance. Because the same distributional assumptions are shared across a large set of genes in the annotated set, the state of the hidden variables X and the remaining covariance parameters can be robustly inferred by means of standard maximum likelihood approaches (Supplementary Notes). Once X is inferred, we calculate the covariance structure between cells, which is induced by the hidden factor as Σ = XX^{T}.
An important choice when fitting the model is the dimensionality of the hidden factor matrix, X, which corresponds to the rank of the celltocell covariance matrix Σ. In the context of distinct factors such as the cell cycle or T_{H}2 differentiation, we found that a onedimensional factor (rank one covariance) is commonly sufficient (see also the scree plots in Supplementary Fig. 22 and Choosing the rank of the cellcycle factor). In general, the Pvalue distribution of a test statistic on the residual data set^{48}, heuristic selection approaches^{21} or hierarchical modeling to regularize the effective dimensionality of the hidden factor^{22, 30} can also be employed (Supplementary Notes).
Alternative fitting approaches, including methods to account for multiplicative effects between covariates and hidden factors, are discussed in the Supplementary Notes. Once fitted, the covariance matrix Σ can be used for a range of analyses, using efficient implementations of linear mixed models^{29, 49} to decompose variance, test for genegene correlations or produce residuals corrected for the latent factors under consideration.
Analysis of variance. To estimate the components of variance, scLVM employs a linear mixed model that is fitted to the expression levels of each gene, decomposing sources of variation. Contributions from hidden factors such as cellcycle effects, technical noise and residual biological variation to the observed expression variability of gene g are modeled as random effects:
with
denoting the variance attributable to H hidden factors (see section below for a discussion of estimating multiple hidden factors), residual biological variability (not related to hidden factors) and technical noise/baseline variability respectively. The hidden factor covariance matrices Σ_{h} are estimated in the GPLVM step and
is estimated from spikeins as described above. The parameters
are then estimated by maximum likelihood. Interactions between pairs of factors can be considered by combining their previously estimated covariance matrices; see section above and Supplementary Notes.
Genegene correlation analysis. To estimate pairwise correlation coefficients while controlling for hidden factors such as the cell cycle, we introduce an additional fixed effect representing the contribution of another gene j
In this linear mixed model, β_{ij} can be interpreted as the pairwise correlation coefficient between genes i and j, and its significance can be assessed by means of a standard likelihood ratio test. Owing to efficient implementations of mixed models in applications to GWAS^{29, 49}, these correlation tests are extremely efficient (Supplementary Notes).
Creating residual expression data sets with the effect of hidden factors removed. To facilitate reuse of existing analyses methods, such as clustering, visualization or dimension reduction approaches, scLVM facilitates generation of a corrected expression data set where the effect of one or multiple hidden factors (e.g., the cell cycle) is removed.
For each gene i, the variance component model (see above) implies a predictive distribution of the cellcycle component with mean ŷ_{i} and predictive variance . Expression levels that are corrected for the effect of hidden factors can then be obtained from the model residuals, that is, y_{i}* = y_{i} − ŷ_{i}. These corrected gene expression values can be used in the full range of existing methods, including clustering or nonlinear PCA^{31}. Cellcycle corrected expression values for the Tcell data and the mESC data are available online (Supplementary Data 1 and 2).
Applying scLVM to multiple annotated gene sets. In some circumstances, scLVM can be used to fit more than a single factor, provided that multiple informative gene sets are available. In general, statistical identifiability is a major concern and careful choice of the inference approach is important (Supplementary Notes). These factors can either be considered independently or learned by conditioning on one of them if prior knowledge exists as to which has a stronger effect (as for the cell cycle). As the cell cycle represents the predominant source of variation in our data, the cellcycle factor can be recovered irrespective of other sources of variation. Therefore, we first learn the cellcycle factor X_{cc} as described above. Then we extend the single factor model by conditioning on the inferred factor X_{cc} and including an interaction term, which we define by a pointwise product (Supplementary Notes). Alternative analysis approaches are discussed below; see Alternative approaches to fit the T_{H}2 factor and Supplementary Figure 15.
Analysis details.
Validation of scLVM using mESC data. We used the annotated cellcycle state in the ESC data sets to validate the accuracy of the modelbased cellcycle reconstruction carried out by scLVM. Briefly, instead of fitting the scLVM covariance from the data, we used the known grouping of cells into G1, S and G2/Mphase as covariates and estimated the proportion of variance explained by the sum of all three covariates. In the case of the unfiltered mESC data (this study used the C1 protocol; see Data set and processing), the variance estimates of scLVM were compared to a model with a cellcycle covariance and an additional factor that explains cell size variation. To estimate cell size we used the ratio of endogenous reads to total mapped reads, thereby capturing large variations in cell size within individual cellcycle phases in the unfiltered data (Supplementary Notes). These estimates were then compared with the variance estimates using scLVM (Fig. 2a). In the same vein, the covariates can be included in pairwise genegene correlation analyses, again comparing the inference results based on the Hoechst staining to estimates obtained using the covariance structure inferred by scLVM. Further details on the estimation procedure using the gold standard are provided in Supplementary Notes.
Assessment of the effect of alternative cell cycle gene annotations. Unless otherwise stated, we considered the union of genes from CycleBase, and GO categories annotated as cell cycle related, resulting in 892 genes. Briefly, we combined all cell cycle–annotated genes (GO:0007049) in the Gene Ontology database along with the 600 topranked genes from CycleBase (Supplementary Notes). To assess to what extent the gene set annotation affects the performance of scLVM, we additionally considered either CycleBase genes or the GO annotated genes alone (Supplementary Figs. 3, 4, 5 and Supplementary Table 1), which yielded very similar results. Furthermore, we carried out a subsampling experiment, where random subsets of the full set of 892 genes were used to fit the cell cycle factor (Supplementary Fig. 5a–g). This showed that a relatively small set of 50 genes is sufficient to robustly identify the cell cycle. Finally, estimates for the variance explained by the cell cycle were consistent when alternative metrics were applied to quantify the proportion of variation explained by the cell cycle (Supplementary Fig. 5h and Supplementary Notes).
Identification of subclusters in the ESC and Tcell data. We considered nonlinear PCA^{31} for the analysis of subclusters in singlecell data sets, which has previously been considered for application to singlecell transcriptomics data^{50}. When correcting for cell cycle, we used the scLVM residual expression data sets (see Choosing the rank of the cellcycle factor) as input, otherwise we used the preprocessed log expression values.
Choosing the rank of the cellcycle factor. As described above, scree plots generated for both the Tcell and the mESC data suggested that the largest proportion of variance was explained by the first principal component (Supplementary Fig. 22). Consequently, we used a K = 1 rank covariance matrix to fit the cellcycle factor in most experiments. When omitting the filtering of cells (quality control, Supplementary Notes), a second component (K = 2) was necessary to fully capture the variation in the data. This second component likely captures intra cellphase differences in cell size (see also Supplementary Figs. 19–20 and Supplementary Notes).
Alternative approaches to fit the T_{H}2 factor. In order to assess the robustness of the conditional fitting for the T_{H}2 factor as described above (Applying scLVM to multiple annotated gene sets), we compared the results with a conceptually simpler 'iterative approach', where we first regressed out the cellcycle effects as described above (Genegene correlation analysis), before fitting the state of the differentiation factor on cell cycle–corrected expression values.
Reassuringly, the T_{H}2 differentiation factor recovered by either of the approaches was strikingly correlated (Pearson r^{2}= 0.82, Supplementary Fig. 15c) and was consistent with the subclusters of cells identified by the unsupervised PCA approach (Fig. 3e). In the variance decomposition, the factor determined by the iterative approach yielded a smaller proportion of variance attributable to the T_{H}2 differentiation factor (2.6% versus 5.3%), which can be attributed to the assumption of a common parameter for all genes in the conditional approach (the iterative approach allows a genespecific contribution of the cellcycle factor). Critically, the set of genes identified in the interaction component and the GO analysis for the set of genes with a strong interaction effect yielded consistent results (Supplementary Tables 8 and 9).
Accession codes.
mESC data have been deposited at ArrayExpress: EMTAB2805. RNAseq data from the T_{H}2 cells have previously been described^{20, 25} and are available under at ArrayExpress: EMTAB2512. Cell cycle–corrected and uncorrected expression values for the Tcell data as well as the mESC data are provided as Supplementary Data 1 and 2. An open source software implementation of scLVM is freely available on GitHub: https://github.com/PMBio/scLVM.
Accession codes
Primary accessions
ArrayExpress
References
 Levsky, J.M., Shenoy, S.M., Pezo, R.C. & Singer, R.H. Singlecell gene expression profiling. Science 297, 836–840 (2002).
 Taniguchi, Y. et al. Quantifying E. coli proteome and transcriptome with singlemolecule sensitivity in single cells. Science 329, 533–538 (2010).
 Raj, A., van den Bogaard, P., Rifkin, S.A., van Oudenaarden, A. & Tyagi, S. Imaging individual mRNA molecules using multiple singly labeled probes. Nat. Methods 5, 877–879 (2008).
 Liu, J., Hansen, C. & Quake, S.R. Solving the “worldtochip” interface problem with a microfluidic matrix. Anal. Chem. 75, 4718–4723 (2003).
 Citri, A., Pang, Z.P., Sudhof, T.C., Wernig, M. & Malenka, R.C. Comprehensive qPCR profiling of gene expression in single neuronal cells. Nat. Protoc. 7, 118–127 (2012).
 Wheeler, A.R. et al. Microfluidic device for singlecell analysis. Anal. Chem. 75, 3581–3586 (2003).
 Marcus, J.S., Anderson, W.F. & Quake, S.R. Microfluidic singlecell mRNA isolation and analysis. Anal. Chem. 78, 3084–3089 (2006).
 Guo, G. et al. Resolution of cell fate decisions revealed by singlecell gene expression analysis from zygote to blastocyst. Dev. Cell 18, 675–685 (2010).
 Burton, A. et al. Singlecell profiling of epigenetic modifiers identifies PRDM14 as an inducer of cell fate in the mammalian embryo. Cell Reports 5, 687–701 (2013).
 Luo, L. et al. Gene expression profiles of lasercaptured adjacent neuronal subtypes. Nat. Med. 5, 117–122 (1999).
 Chiang, M.K. & Melton, D.A. Singlecell transcript analysis of pancreas development. Dev. Cell 4, 383–393 (2003).
 Tang, F. et al. RNAseq analysis to capture the transcriptome landscape of a single cell. Nat. Protoc. 5, 516–535 (2010).
 Islam, S. et al. Characterization of the singlecell transcriptional landscape by highly multiplex RNAseq. Genome Res. 21, 1160–1167 (2011).
 Islam, S. et al. Quantitative singlecell RNAseq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014).
 Yan, L. et al. Singlecell RNAseq profiling of human preimplantation embryos and embryonic stem cells. Nat. Struct. Mol. Biol. 20, 1131–1139 (2013).
 Tang, F. et al. Tracing the derivation of embryonic stem cells from the inner cell mass by singlecell RNAseq analysis. Cell Stem Cell 6, 468–478 (2010).
 Shalek, A.K. et al. Singlecell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236–240 (2013).
 Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
 Pollen, A.A. et al. Lowcoverage singlecell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053–1058 (2014).
 Brennecke, P. et al. Accounting for technical noise in singlecell RNAseq experiments. Nat. Methods 10, 1093–1095 (2013).
 Leek, J.T. & Storey, J.D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
 Stegle, O., Parts, L., Durbin, R. & Winn, J. A Bayesian framework to account for complex nongenetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6, e1000770 (2010).
 Li, S. et al. Detecting and correcting systematic variation in largescale RNA sequencing data. Nat. Biotechnol. 32, 888–895 (2014).
 Xue, Z. et al. Genetic programs in human and mouse early embryos revealed by singlecell RNA sequencing. Nature 500, 593–597 (2013).
 Mahata, B. et al. Singlecell RNA sequencing reveals T helper cells synthesizing steroids de novo to contribute to immune homeostasis. Cell Reports 7, 1130–1142 (2014).
 Newman, J.R. et al. Singlecell proteomic analysis of S. cerevisiae reveals the architecture of biological noise. Nature 441, 840–846 (2006).
 Gold, D., Mallick, B. & Coombes, K. Realtime gene expression: statistical challenges in design and inference. J. Comput. Biol. 15, 611–623 (2008).
 Singh, A.M. et al. Cellcycle control of developmentally regulated transcription factors accounts for heterogeneity in human pluripotent cells. Stem Cell Reports 1, 532–544 (2013).
 Lippert, C. et al. FaST linear mixed models for genomewide association studies. Nat. Methods 8, 833–835 (2011).
 Fusi, N., Stegle, O. & Lawrence, N.D. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput. Biol. 8, e1002330 (2012).
 Lawrence, N.D. Gaussian process latent variable models for visualisation of high dimensional data. Adv. Neural Inf. Process. Syst. 16, 329–336 (2004).
 Sasagawa, Y. et al. QuartzSeq: a highly reproducible and sensitive singlecell RNA sequencing method, reveals nongenetic geneexpression heterogeneity. Genome Biol. 14, R31 (2013).
 Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for singlecell transcriptomics. Nat. Methods 11, 637–640 (2014).
 Fox, C.J., Hammerman, P.S. & Thompson, C.B. Fuel feeds function: energy metabolism and the Tcell response. Nat. Rev. Immunol. 5, 844–852 (2005).
 Nelms, K., Keegan, A.D., Zamorano, J., Ryan, J.J. & Paul, W.E. The IL4 receptor: signaling mechanisms and biologic functions. Annu. Rev. Immunol. 17, 701–738 (1999).
 Zhu, J., Yamane, H., CoteSierra, J., Guo, L. & Paul, W.E. GATA3 promotes T_{H}2 responses through three different mechanisms: induction of T_{H}2 cytokine production, selective growth of T_{H}2 cells and inhibition of Th1 cellspecific factors. Cell Res. 16, 3–10 (2006).
 Stritesky, G.L. et al. The transcription factor STAT3 is required for T helper 2 cell development. Immunity 34, 39–49 (2011).
 Zhou, M. et al. Kruppellike transcription factor 13 regulates T lymphocyte survival in vivo. J. Immunol. 178, 5496–5504 (2007).
 Betz, B.C. et al. Batf coordinates multiple aspects of B and T cell function required for normal antibody responses. J. Exp. Med. 207, 933–942 (2010).
 Sahoo, A. et al. Stat6 and cJun mediate T_{H}2 cellspecific IL24 gene expression. J. Immunol. 186, 4098–4109 (2011).
 Jensen, L.J. et al. STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 37, D412–D416 (2009).
 Chang, C.H. et al. Posttranscriptional control of T cell effector function by aerobic glycolysis. Cell 153, 1239–1251 (2013).
 GarciaSanz, J.A., Mikulits, W., Livingstone, A., Lefkovits, I. & Mullner, E.W. Translational control: a general mechanism for gene regulation during T cell activation. FASEB J. 12, 299–306 (1998).
 Bird, J.J. et al. Helper T cell differentiation is controlled by the cell cycle. Immunity 9, 229–237 (1998).
 Wilson, C.B., Makar, K.W. & PerezMelgosa, M. Epigenetic regulation of T cell fate and function. J. Infect. Dis. 185 (suppl. 1), S37–S45 (2002).
 Stegle, O., Teichmann, S.A. & Marioni, J.C. Computational and analytical challenges in singlecell transcriptomics. Nat. Rev. Genet. (in the press).
 Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
 GagnonBartsch, J.A. & Speed, T.P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012).
 Zhou, X. & Stephens, M. Genomewide efficient mixedmodel analysis for association studies. Nat. Genet. 44, 821–824 (2012).
 Buettner, F. & Theis, F.J. A novel approach for resolving differences in singlecell gene expression patterns from zygote to blastocyst. Bioinformatics 28, i626–i632 (2012).
Acknowledgments
We thank S. Anders and A. Baud for helpful discussions. We also thank the SangerEBI Single Cell Centre for technical support. We acknowledge support of the European Research Council (Starting grant no. 260507 thSWITCH to S.A.T., Starting Grant LatentCauses to F.J.T., Marie Curie FP7 fellowship 253524 to O.S.), the SangerEBI Single Cell Centre (K.N.N. & A.S.) and the European Molecular Biology Organization (shortterm fellowship to F.B.).
Author information
Author footnotes
These authors contributed equally to this work.
 Florian Buettner &
 Kedar N Natarajan
Affiliations

Helmholtz Zentrum München–German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.
 Florian Buettner &
 Fabian J Theis

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
 Florian Buettner,
 Kedar N Natarajan,
 F Paolo Casale,
 Valentina Proserpio,
 Antonio Scialdone,
 Sarah A Teichmann,
 John C Marioni &
 Oliver Stegle

Wellcome Trust Sanger Institute, Hinxton, UK.
 Kedar N Natarajan,
 Valentina Proserpio,
 Antonio Scialdone,
 Sarah A Teichmann &
 John C Marioni

Department of Mathematics, Technische Universität München, Munich, Germany.
 Fabian J Theis
Contributions
F.B. developed the method, performed the analysis and wrote the paper. K.N.N. performed the mESC experiments and contributed to the analysis. F.P.C. and A.S. contributed to method development and analysis. V.P., S.A.T. and F.J.T. helped interpret the biological results. S.A.T. and V.P. designed the mouse T_{H}2 differentiation experiment. J.C.M. and O.S. designed and supervised this study, contributed to the method development and wrote the paper.
Competing financial interests
The authors declare no competing financial interests.
Author details
Florian Buettner
Search for this author in:
Kedar N Natarajan
Search for this author in:
F Paolo Casale
Search for this author in:
Valentina Proserpio
Search for this author in:
Antonio Scialdone
Search for this author in:
Fabian J Theis
Search for this author in:
Sarah A Teichmann
Search for this author in:
John C Marioni
Search for this author in:
Oliver Stegle
Search for this author in:
Supplementary information
PDF files
 Supplementary Text and Figures (34,355 KB)
Supplementary Figures 1–22, Supplementary Tables 3, 6, 8–9 and Supplementary Notes
Excel files
 Supplementary Table 1 (62 KB)
List of genes annotated from cell cycle, either from GO or from CycleBase.
 Supplementary Table 2 (496 KB)
List of contribution of all variance components for the Tcell data.
 Supplementary Table 4 (48 KB)
List of significantly differentially expressed genes between identified cell sub clusters.
 Supplementary Table 5 (34 KB)
Manually curated list of 122 Th2 signature genes.
 Supplementary Table 7 (35 KB)
List of genes with more than 5% of the variance explained by interaction between cell cycle and differentiation.
 Supplementary Data 1 (9 KB)
Corrected and uncorrected expression values for Tcell data.
 Supplementary Data 2 (38 KB)
Corrected and uncorrected expression values for the newly generated mouse ESC data.