Abstract
Singlecell RNAsequencing (scRNAseq) is a powerful highthroughput technique that enables researchers to measure genomewide transcription levels at the resolution of single cells. Because of the low amount of RNA present in a single cell, some genes may fail to be detected even though they are expressed; these genes are usually referred to as dropouts. Here, we present a general and flexible zeroinflated negative binomial model (ZINBWaVE), which leads to lowdimensional representations of the data that account for zero inflation (dropouts), overdispersion, and the count nature of the data. We demonstrate, with simulated and real data, that the model and its associated estimation procedure are able to give a more stable and accurate lowdimensional representation of the data than principal component analysis (PCA) and zeroinflated factor analysis (ZIFA), without the need for a preliminary normalization step.
Introduction
Singlecell RNAsequencing (scRNAseq) is a powerful and relatively young technique enabling the characterization of the molecular states of individual cells through their transcriptional profiles^{1}. It represents a major advance with respect to standard “bulk” RNAsequencing, which is only capable of measuring average gene expression levels within a cell population. Such averaged gene expression profiles may be enough to characterize the global state of a tissue, but completely mask signal coming from individual cells, ignoring tissue heterogeneity. Assessing celltocell variability in expression is crucial for disentangling complex heterogeneous tissues^{2,3,4} and for understanding dynamic biological processes, such as embryo development^{5} and cancer^{6}. Despite the early successes of scRNAseq, to fully exploit the potential of this new technology, it is essential to develop statistical and computational methods specifically designed for the unique challenges of this type of data^{7}.
Because of the tiny amount of RNA present in a single cell, the input material needs to go through many rounds of amplification before being sequenced. This results in strong amplification bias, as well as dropouts, i.e., genes that fail to be detected even though they are expressed in the sample^{8}. The inclusion in the library preparation of unique molecular identifiers (UMIs) reduces amplification bias^{9}, but does not remove dropout events, nor the need for data normalization^{10,11}. In addition to the host of unwanted technical effects that affect bulk RNAseq, scRNAseq data exhibit much higher variability between technical replicates, even for genes with medium or high levels of expression^{12}.
The large majority of published scRNAseq analyses include a dimensionality reduction step. This achieves a twofold objective: (i) the data become more tractable, both from a statistical (cf. curse of dimensionality) and computational point of view; (ii) noise can be reduced while preserving the often intrinsically lowdimensional signal of interest. Dimensionality reduction is used in the literature as a preliminary step prior to clustering^{3,13,14}, the inference of developmental trajectories^{15,16,17,18}, spatiotemporal ordering of the cells^{5,19}, and, of course, as a visualization tool^{20,21}. Hence, the choice of dimensionality reduction technique is a critical step in the data analysis process.
A natural choice for dimensionality reduction is principal component analysis (PCA), which projects the observations onto the space defined by linear combinations of the original variables with successively maximal variance. However, several authors have reported on shortcomings of PCA for scRNAseq data. In particular, for real data sets, the first or second principal components often depend more on the proportion of detected genes per cell (i.e., genes with at least one read) than on an actual biological signal^{22,23}. In addition to PCA, dimensionality reduction techniques used in the analysis of scRNAseq data include independent components analysis (ICA)^{15}, Laplacian eigenmaps^{18,24}, and tdistributed stochastic neighbor embedding (tSNE)^{2,4,25}. Note that none of these techniques can account for dropouts, nor for the count nature of the data. Typically, researchers transform the data using the logarithm of the (possibly normalized) read counts, adding an offset to avoid taking the log of zero.
Recently, Pierson & Yau^{26} proposed a zeroinflated factor analysis (ZIFA) model to account for the presence of dropouts in the dimensionality reduction step. Although the method accounts for the zero inflation typically observed in scRNAseq data, the proposed model does not take into account the count nature of the data. In addition, the model makes a strong assumption regarding the dependence of the probability of detection on the mean expression level, modeling it as an exponential decay. The fit on real data sets is not always good and, overall, the model lacks flexibility, with its inability to include covariates and/or normalization factors.
Here, we propose a general and flexible method that uses a zeroinflated negative binomial (ZINB) model to extract lowdimensional signal from the data, accounting for zero inflation (dropouts), overdispersion, and the count nature of the data. We call this approach ZeroInflated Negative Binomialbased Wanted Variation Extraction (ZINBWaVE). The proposed model includes a samplelevel intercept, which serves as a globalscaling normalization factor, and gives the user the ability to include both genelevel and samplelevel covariates. The inclusion of observed and unobserved samplelevel covariates enables normalization for complex, nonlinear effects (often referred to as batch effects), whereas genelevel covariates may be used to adjust for sequence composition effects, such as gene length and GCcontent effects. ZINBWaVE is an extension of the RUV model^{27,28}, which accounts for zero inflation and overdispersion and for which unobserved samplelevel covariates may either capture variation of interest or unwanted variation. We demonstrate, with simulated and real data, that the model and its associated estimation procedure are able to give a more stable and accurate lowdimensional representation of the data than PCA and ZIFA, without the need for a preliminary normalization step. The approach is implemented in the opensource R package zinbwave, publicly available through the Bioconductor Project (https://bioconductor.org/packages/zinbwave).
Results
ZINBWaVE is a general and flexible model for scRNAseq
ZINBWaVE is a general and flexible model for the analysis of highdimensional zeroinflated count data, such as those recorded in singlecell RNAseq assays. Given n samples (typically, n single cells) and J features (typically, J genes) that can be counted for each sample, we denote with Y_{ij} the count of feature j (j = 1,…, J) for sample i (i = 1, …, n). To account for various technical and biological effects, typical of singlecell sequencing technologies, we model Y_{ij} as a random variable following a ZINB distribution (see Methods for details).
Both the mean expression level (μ) and the probability of dropouts (π) are modeled in terms of observed samplelevel and genelevel covariates (X and V, respectively, Fig. 1). In addition, we include a set of unobserved samplelevel covariates (W) that need to be inferred from the data. The matrix X can include covariates that induce variation of interest, such as cell types, or covariates that induce unwanted variation, such as batch or quality control (QC) measures. It can also include a constant column of ones for an intercept that accounts for genespecific differences in mean expression level or dropout rate (cf. scaling in PCA). The matrix V can also accommodate an intercept to account for cellspecific global effects, such as size factors representing differences in library sizes (i.e., total number of reads per sample). In addition, V can include genelevel covariates, such as gene length or GCcontent.
The unobserved matrix W contains unknown samplelevel covariates, which could correspond to unwanted variation as in RUV^{27,28} or could be of interest as in cluster analysis (e.g., cell type). The model extends the RUV framework to the ZINB distribution (thus far, RUV had only been implemented for linear^{27} and loglinear regression^{28}). It differs in interpretation from RUV in the Wα term, which is not necessarily considered unwanted; this term generally refers to unknown lowdimensional variation, that could be due to unwanted technical effects (as in RUV), such as batch effects, or to biological effects of interest, such as cell cycle or cell differentiation.
It is important to note that although W is the same, the matrices X and V could differ in the modeling of μ and π, if we assume that some known factors do not affect both. When X = 1_{n} and V = 1_{J}, the model is a factor model akin to PCA, where W is a factor matrix and α_{μ} and α_{π} are loading matrices. However, the model is more general, allowing the inclusion of additional sample and genelevel covariates that might help to infer the unknown factors.
ZINBWaVE leads to biologically meaningful clusters
We applied the ZINBWaVE procedure to several publicly available real data sets, from microfluidics, platebased, and dropletbased platforms (see Methods).
As previously shown^{22,23}, the first few principal components of scRNAseq data, even after normalization, can be influenced by technical rather than biological features, such as the proportion of genes with at least one read (detection rate) or the total number of reads (sequencing depth) per sample. Figure 2 illustrates this using the V1 data set: although the first two principal components somewhat segregated the data by layer of origin (Fig. 2a), the clustering was far from perfect. This is at least partly due to unwanted technical effects, such as sequencing depth and amount of starting material. To quantify such technical effects, we computed a set of QC measures, such as detection rate and total number of reads (see Methods). The first two principal components are especially correlated with detection rate (Fig. 2b).
ZIFA suffered from the same problem: the clustering of the samples in two dimensions was not qualitatively different from PCA (Fig. 2c) and the second component was highly correlated with detection rates (Fig. 2d).
Conversely, ZINBWaVE led to tighter clusters, grouping the cells by layer of origin (Fig. 2e). Furthermore, the two components inferred by ZINBWaVE showed lower correlation with the QC features (Fig. 2f), highlighting that the clusters shown in Fig. 2e are not driven by technical effects.
We repeated these analyses on the S1/CA1 data set (Supplementary Fig. 1), mESC data set (Supplementary Fig. 2), and glioblastoma data set (Supplementary Fig. 3). For all data sets, ZINBWaVE led to tighter clusters in two dimensions. However, it did not always lead to a decrease in correlation with the QC measures. See also Hicks et al.^{22} for additional data sets in which principal components are strongly correlated with detection rate.
As a measure of the goodness of the clustering results, we used the average silhouette width (see Methods), computed using the labels available from the original study: these were either known a priori (e.g., the patient ID in the glioblastoma data set) or inferred from the data (and validated) by the authors of the original publication (e.g., the cell types in the S1/CA1 data set). For all four data sets, ZINBWaVE led to generally tighter clusters, as shown by an increased percluster average silhouette width in the majority of the groups (Supplementary Fig. 5).
ZINBWaVE leads to novel biological insights
To demonstrate the ability of ZINBWaVE to lead to novel biological insights, we focused on two inferential questions typical of scRNAseq studies: (i) the identification of developmental lineages and (ii) the characterization of rare cell types.
First, we reanalyzed a set of cells from the mouse olfactory epithelium (OE) that were collected to identify the developmental trajectories that generate olfactory neurons (mOSN), sustentacular cells (mSUS), and microvillous cells (MV)^{29}. In the original publication, the data were normalized by fullquantile normalization, followed by a regressionbased adjustment for sample quality. Clustering on the first 50 principal components identified cellular states that were then ordered into developmental lineages by Slingshot^{30} using the first five PCs. In the original analysis, Slingshot was able to correctly infer the lineages in its supervised mode, by manually specifying lineage endpoints (i.e., clusters corresponding to the mature cell types). Figure 3a shows the minimum spanning tree (MST) obtained with the described supervised analysis. When running Slingshot in unsupervised mode, however, the inferred MST only correctly identified the neuronal (mOSN) and microvillous (MV) lineages, while it was unable to identify sustentacular (mSUS) cells as a mature cell type (Fig. 3b). By repeating the clustering and lineage reconstruction with Slingshot on the first 50 factors of ZINBWaVE, we were able to infer the correct lineages even in unsupervised mode (Fig. 3c). This suggests that the lowdimensional signal revealed by ZINBWaVE more closely matches the true developmental process. See Perraudeau et al.^{31} for a complete workflow that involves ZINBWaVE for dimensionality reduction and Slingshot for lineage reconstruction.
We next focused on a set of 68,579 peripheral blood mononuclear cells (PBMCs) assayed with the 10× Genomics Chromium system^{32}. This example allowed us to demonstrate how ZINBWaVE can be applied to a stateoftheart data set that comprises tens of thousands of cells, while also illustrating that existing clustering algorithms can be used on the lowrank matrix inferred by ZINBWaVE. In particular, we applied a popular clustering method, based on the identification of shared nearest neighbors^{33}, similar to that of Macosko et al.^{2} and implemented in the R package Seurat^{34}. We modified the clustering procedure to use ZINBWaVE (with K = 10) instead of PCA as the dimensionality reduction step. To visualize the clustering results, we applied tSNE to the inferred W matrix (Fig. 3d).
Our approach recapitulates the major cell populations found in Zheng et al.^{32} (Fig. 3d, e). In particular, 80% of the cells are Tcells (Clusters 0–1 CD4+ Tcells; Clusters 2–6 CD8+ Tcells). As in Zheng et al.^{32}, we are able to identify populations of activated cytotoxic Tcells (Cluster 5; 9%), natural killer cells (Cluster 7; 6%), and Bcells (Cluster 8; 5%) (Fig. 3d). In addition, we are able to identify subpopulations of myeloid cells that were not completely characterized in the original publication. In particular, we identified clusters corresponding to: CD14+ monocytes (Cluster 9; 3%), characterized by the expression of CD14, S100A9, LYZ^{35}; CD16+ monocytes (Cluster 10; 2.5%), which express FCGR3A/CD16, AIF1, FTL, LST1^{35}; CD1C+ dendritic cells (DC; Cluster 11; 1%), which express CD1C, LYZ, HLADR^{35}; plasmacytoid dentritic cells (pDC; Cluster 13; 0.5%), expressing GZMB, SERPINF1, ITM2C^{35}; megakaryocyte (Cluster 15; 0.2%), expressing PF4^{32} (Fig. 3e). We also identified several small clusters (collectively comprising about 1% of the cells) that were either enriched for mitochondrial genes (Cluster 14) or ribosomal genes (Clusters 17 and 18), or for which we could not find any markers (Clusters 12 and 16). We hypothesize that these clusters represent either doublets or lower quality libraries.
The above analysis differs from that of the original publication not only in terms of dimensionality reduction, but also in terms of clustering (Zheng et al.^{32} used kmeans on the first 50 PCs). We hence repeated the clustering using a sequential procedure based on kmeans (see Methods for details). This led to similar (albeit noisier) results (Supplementary Fig. 6), highlighting that ZINBWaVE is able to extract meaningful biological signal from the data, as input to a variety clustering procedures. In fact, extracting a twodimensional signal from ZINBWaVE already allowed us to identify most major cell types (Supplementary Fig. 7).
Impact of normalization
As for any highthroughput genomic technology, an important aspect of scRNAseq data analysis is normalization. Indeed, there are a variety of steps in scRNAseq experiments that can introduce bias in the data and whose effects need to be corrected for^{11}. Some of these effects, e.g., sequencing depth, can be captured by a globalscaling factor (usually referred to as size factor). Other more complex, nonlinear effects, such as those collectively known as batch effects, require more sophisticated normalization^{28}. Accordingly, a typical scRNAseq pipeline will include a normalization step, i.e., a linear or nonlinear transformation of read counts, to make the distributions of expression measures comparable between samples. The normalization step is usually carried out prior to any inferential procedure (e.g., clustering, differential expression analysis, or pseudotime ordering). In this work, we considered three popular betweensample normalization methods: total count (TC), trimmed mean of M values (TMM), and fullquantile (FQ) normalization (see Methods); we also compared the results to unnormalized data (RAW). TC, along with the related methods transcripts per million (TPM) and fragments per kilobase million (FPKM), is by far the most widely used normalization in the scRNAseq literature; hence, TCnormalized data were used for the results shown in Fig. 2 and in Supplementary Figs. 1–3.
Normalization highly affected the projection of the data in lower dimensions (Supplementary Figs. 8–15) and, consequently, the clustering results varied greatly between normalization methods in all four data sets (Fig. 4). Strikingly, the choice of normalization method was more critical than the choice between PCA and ZIFA. For instance, for the mESC data set (Fig. 4d), FQ normalization followed by either PCA or ZIFA led to very high average silhouette width. Critically, the ranking of normalization methods was not consistent across data sets (Fig. 4), highlighting that the identification of the best normalization method for a given data set is a difficult problem for scRNAseq^{11,36}.
One important feature of the ZINBWaVE model is that the samplelevel intercept γ (corresponding to a column of ones in the genelevel covariate matrix V, see Methods) acts as a globalscaling normalization factor, making a preliminary globalscaling step unnecessary. As a consequence, ZINBWaVE can be directly applied to unnormalized read counts, hence preserving the distributional properties of count data. ZINBWaVE applied to unnormalized counts led to results that were comparable, in terms of average silhouette width, to PCA and ZIFA applied using the best performing normalization (Fig. 4). In particular, ZINBWaVE outperformed PCA and ZIFA on the S1/CA1 (Fig. 4b) and glioblastoma (Fig. 4c) data sets, while it was slightly worse than PCA (after FQ normalization) on the mESC data set (Fig. 4d) and than ZIFA (after FQ normalization) on the V1 data set (Fig. 4a). Interestingly, the overall average silhouette width was lower for the S1/CA1 and glioblastoma data sets than it was for the V1 and mESC data sets, suggesting that ZINBWaVE leads to better clustering in more complex situations (e.g., lower sequencing depth).
Although the overall average silhouette width is a useful metric to rank the different methods in terms of how well they represent known global biological structure, looking only at the average across many different cell types may be misleading. For the V1 data set, for instance, ZINBWaVE led to a slightly lower overall average silhouette width than ZIFA (Fig. 4a). However, ZINBWaVE yielded higher clusterlevel average silhouette widths for the L6a, L5b, and L5 samples, whereas ZIFA produced higher average silhouette widths for the L4 and L5a samples (Supplementary Fig. 5a). In fact, certain cell types may be easier to cluster than others, leading to silhouette widths that differ greatly between different clusters, as is the case for the S1/CA1 data set: oligodendrocytes are much easier to cluster than the other cell types and all methods were able to identify them (Supplementary Fig. 5b); on the other hand, only ZINBWaVE was able to achieve a positive silhouette width for the pyramidal SS and CA1 neurons and it performed much better than PCA and ZIFA in the interneuron cluster; finally, certain groups, such as the endothelialmural and astrocytes, were simply too hard to distinguish in three dimensions (Supplementary Fig. 5b).
Accounting for batch effects
Although the samplelevel intercept γ (corresponding to a column of ones in the genelevel covariate matrix V) can act as a globalscaling normalization factor, this may not be sufficient to accurately account for complex, nonlinear technical effects that may influence the data (e.g., batch effects). Hence, we explored the possibility of including additional samplelevel covariates in X to account for such effects (see Methods).
We illustrate this using the mESC data set, which is a subset of the data described in Kolodziejczyk et al.^{37} and which comprises two batches of mESCs grown in three different media (see Methods). ZINBWaVE applied with no additional covariates led to three clusters, corresponding to the three media (Fig. 5a; Supplementary Fig. 2). However, the clusters corresponding to media 2i and a2i are further segregated by batch (Fig. 5a). Including indicator variables for batch in ZINBWaVE (as columns of X) removed such batch effects, consolidating the clustering by medium (Fig. 5b). This led to slightly larger average silhouette widths, both overall (Fig. 4d) and at the cluster level (Fig. 5c). Conversely, the average silhouette width for batch, a measure of how well the cells cluster by batch, was much larger for the model that did not include the batch covariate (Fig. 5d), indicating that explicitly accounting for batch in the ZINBWaVE model effectively removed the dependence of the inferred lowdimensional space on batch effects. Similar results were obtained by normalizing the data with the ComBat batch correction method^{38}, implemented in the Bioconductor R package sva^{39} (Supplementary Fig. 16). The advantage of ZINBWaVE is the ability of including batch effects in the same model used for dimensionality reduction, without the need for prior data normalization.
The mESC data set is an example of good experimental design, where each batch includes cells from each biological condition (this is known as a factorial design). Hence, it is relatively easy to correct for batch effects and, unsurprisingly, both ComBat and ZINBWaVE successfully do so. The glioblastoma data set is an example of a more complex situation, in which there is confounding between batch and biology, each patient being processed separately^{22}. Luckily, the glioblastoma design is not completely confounded, as patient MGH26 was processed in two batches (Fig. 5e). We used this feature of the data to test whether ZINBWaVE was able to adjust for batch effects even in the presence of confounding. ComBat was not able to correctly account for batch, removing the patient effects along with the batch effects (Supplementary Fig. 17a). Including the batch variable as a covariate in the ZINBWaVE model led to similar, unsatisfactory results (Supplementary Fig. 17b). One key observation, recently made by Townes et al.^{40}, is that the detection rate is markedly different between the two batches of MGH26 (Supplementary Fig. 17c): including the detection rate as a covariate in the ZINBWaVE model led to the removal of the batch effect, while preserving the biological differences between patients (Fig. 5f). Note that this is analogous of the inclusion of the “cellular detection rate” in the MAST model^{23}. Although this helped adjusting for batch effects, the confounding between patient and batch is still present in the data because of the poor experimental design and no normalization will be able to completely account for such confounding.
Goodnessoffit of ZINBWaVE model
We compared the goodnessoffit of our ZINBWaVE model to that of a negative binomial (NB) model (as implemented in edgeR^{41}) and a hurdle model (as implemented in MAST^{23}).
As previously noted in the literature, NB models, which are quite successful for bulk RNAseq, are not appropriate for singlecell RNAseq, as they cannot accommodate zero inflation. In particular, NB models poorly fit the data in terms of the overall mean count and zero probability (Supplementary Figs. 18 and 19) and appears to handle the excess of zeros by overestimating the dispersion parameter (Supplementary Fig. 20).
We also examined the goodnessoffit of the MAST hurdle model, which is specifically designed for scRNAseq. However, a direct comparison of MAST with NB and ZINB is cumbersome, due to differences in parameterization. In particular, MAST models log2(TPM + 1) rather than counts and does not have a dispersion parameter but only a variance parameter for the Gaussian component (see Methods for details). We found that MAST underestimated the overall mean log2(TPM + 1) and overestimated the zero probability, but had stable variance estimates over the observed proportion of zero counts (Supplementary Fig. 21).
By contrast, our ZINB model lead to better fit in terms of both the overall mean count and zero probability, as well as more stable estimators of the dispersion parameter (Supplementary Figs. 18–20).
ZINBWaVE estimators are asymptotically unbiased and robust
We next turned to simulations to explore in greater detail the performance of ZINBWaVE. First, we evaluated the ZINBWaVE estimation procedure on simulated data from a ZINB distribution, to assess both accuracy under a correctly specified model and robustness to model misspecification. The approach involves computing maximum likelihood estimators (MLE) for the parameters of the model of Fig. 1, namely, α, β, γ, and W, and hence μ and π. MLE are asymptotically unbiased and efficient estimators. However, because the likelihood function of our ZINBWaVE model is not convex, our numerical optimization procedure may converge to a local maximum, rather than to the true MLE (see Methods). We observed that our estimators are asymptotically unbiased and have decreasing variance as the number of cells n increases (Supplementary Fig. 22). This suggests that our estimates are not far from the true MLE.
To assess the robustness of the ZINBWaVE procedure, we examined bias and mean squared error (MSE) for estimators of μ and π, as well as the loglikelihood function, the Akaike information criterion (AIC), and the Bayesian information criterion (BIC), for models with misspecified number of unobserved covariates K (i.e., number of columns of W), genelevel covariate matrix V, and dispersion parameter φ (Fig. 6; Supplementary Figs. 23–25). In Fig. 6, the data were simulated with K = 2 unknown factors, X = 1_{n}, V = 1_{J}, and genewise dispersion. The model parameters were then estimated with K = 1, 2, 3, 4, both for a model that included a celllevel intercept (V = 1_{J}) and one that did not (V = 0_{J}). When the intercept was correctly included in the model, misspecification of K (with K > 2) resulted in no or very small bias (Fig. 6a, b), small MSE (Fig. 6c, d), and a greater impact on AIC and especially BIC (Supplementary Fig. 23). When no intercept was included in the fitted model, the sensitivity to K became more important. However, although the data were simulated with K = 2, specifying K ≥ 3 led to small bias and low MSE (Fig. 6). This is likely because one column of W acted as a de facto intercept, overcoming the absence of V and explaining why AIC and BIC are minimized at K = 4 (Supplementary Fig. 23). In addition, we observed robustness of the results to the choice of dispersion parameter (genewise or common), even though the data were simulated with genewise dispersion (Fig. 6). The estimators for ln(μ) and π were unbiased over the whole range of mean expression and zero inflation probability (Supplementary Fig. 26).
ZINBWaVE is more accurate than stateoftheart methods
We next compared ZINBWaVE to PCA and ZIFA, in terms of their ability to recover the true underlying lowdimensional signal and clustering structure. For data sets simulated from a ZINB model, our estimation procedure with correctly specified K led to almost perfect correlation between distances in the true and estimated lowdimensional space (Fig. 7a, b). The correlation remained high even for misspecified K, in most cases higher than for PCA and ZIFA. ZINBWaVE performed consistently well across a range of simulation scenarios, including different numbers of cells n, different zero fractions, and varying cluster tightness (Supplementary Fig. 28). We observed a consistent ranking, although noisier, when the methods were compared in terms of silhouette widths (Fig. 7c, d).
As with the real data sets, the choice of normalization influenced the simulation results. Critically, there was not an overall best normalization method; rather, the performance of the normalization methods depended on the dimensionality reduction method and on intrinsic characteristics of the data, such as the fraction of zero counts and the number and tightness of the clusters (Fig. 7; Supplementary Fig. 28). For instance, TC normalization worked best for PCA on data simulated from the V1 data set (Fig. 7a, c), whereas FQ and TMM normalization worked best for PCA and ZIFA, respectively, on data simulated from the S1/CA1 data set (Fig. 7b, d).
It is important to note that the previous results were obtained for data simulated from the ZINBWaVE model underlying our estimation procedure. It is hence not surprising that ZINBWaVE outperformed its competitors. To provide a fairer comparison, we also assessed the methods on data simulated from the model proposed by Lun & Marioni^{42}. Although this model is also based on a ZINB distribution, the distribution is parameterized differently and, in particular, does not involve regression on genelevel and samplelevel covariates (see Methods).
When the data were simulated to have a moderate fraction of zeros (namely, 40%), all methods performed well in terms of average silhouette width (Fig. 7e–g) and precision and recall (Supplementary Fig. 29). However, the performance of PCA decreased dramatically with 60% of zeros, independently of the number of cells n. Although ZIFA worked well with 60% or fewer zeros, its performance too decreased at 80% of zeros, especially when only 100 cells were simulated (Fig. 7e). Conversely, the performance of ZINBWaVE remained good even when simulating data with 80% of zeros, independently of the sample size.
Discussion
Recent advances in singlecell technologies, such as dropletbased methods^{2}, make it easy and inexpensive to collect hundreds of thousands of scRNAseq profiles, allowing researchers to study very complex biological systems at high resolution. The resulting libraries are often sequenced at extremely low depth (tens of thousands of reads per cell, only), making the corresponding read count data truly sparse. Hence, there is a growing need for developing reliable statistical methods that are scalable and that can account for zero inflation.
ZINBWaVE is a general and flexible approach to extract lowdimensional signal from noisy, zeroinflated data, such as those from scRNAseq experiments. We have shown with simulated and real data analyses that ZINBWaVE leads to robust and unbiased estimators of the underlying biological signals. The better performance of ZINBWaVE with respect to PCA comes at a computational cost, as we need to numerically optimize a nonconvex likelihood function. However, we empirically found that the computing time was approximately linear in both the number of cells and the number of genes, and approximately quadratic in the number of latent factors (Supplementary Fig. 30). The algorithm benefits from parallelization on multicore machines and takes a few minutes on a modern laptop to converge for thousands of cells.
One major difference between ZINBWaVE and previously proposed factor analysis models (such as PCA and ZIFA) is the ability to include samplelevel and genelevel covariates. In particular, by including a column of ones in the genelevel covariate matrix, the corresponding celllevel intercept acts as a globalscaling normalization factor, allowing the modeling of raw count data, with no need for prior normalization.
However, there is no guarantee that the lowdimensional signal extracted by ZINBWaVE is biologically relevant: If unwanted technical variation affects the data and is not accounted for in the model (or in prior normalization), the lowrank matrix W inferred by ZINBWaVE will capture such confounding effects. It is therefore important to explore the correlation between the latent factors estimated by our procedure and known measures of quality control that can be computed for scRNAseq libraries, using, for instance, the Bioconductor R package scater^{43} (see Fig. 2f). If one observes high correlation between one or more latent factors and some QC measures, it may be beneficial to include these QC measures as covariates in the model.
Several authors have recognized that highdimensional genomic data are affected by a variety of unwanted technical effects (e.g., batch effects) that can be confounded with the biological signal of interest, and have proposed methods to account for such effects in either a supervised^{38} or unsupervised way^{27,44}. Recently, Lin et al.^{45} proposed a model that can extend PCA to adjust for confounding factors. This model, however, does not seem to be ideal for zeroinflated count data. In the scRNAseq literature, MAST^{23} uses the inferred cellular detection rate to adjust for the main source of confounding, in a differential expression setting, but is not designed to infer lowdimensional signal.
The removal of batch effects is an important example of how including additional covariates in the ZINBWaVE model may lead to better lowdimensional representations of the data. However, ZINBWaVE is not limited to including batch effects, as other samplelevel (e.g., QC metrics) and/or genelevel (e.g., GCcontent) covariates may be included in the model. Although we did not find any compelling examples in which adding a genelevel covariate leads to improve signal extraction, it is interesting to note the relationship between GCcontent and batch effects^{46}. With large collaborative efforts, such as the Human Cell Atlas^{47}, on the horizon, we anticipate that the ability of our model to include genelevel covariates that can potentially help accounting for differences in protocols will prove important.
Although the lowdimensional signal inferred by ZINBWaVE can be used to visually inspect hidden structure in the data, visualization is not the main point of our proposed method. The lowdimensional factors are intended to be the closest possible approximation to the true signal, which is assumed to be intrinsically lowdimensional. Such a lowdimensional representation can be used in downstream analyses, such as clustering or pseudotime ordering of the cells^{15}.
Visualization of highdimensional data sets is an equally important area of research and many algorithms are available, among which tSNE^{25} has become the most popular for scRNAseq data. Recently, Wang et al.^{48} have proposed a novel visualization algorithm that can account for zero inflation and showed improvement over tSNE. As tSNE takes as input a matrix of cell pairwise distances, which may be noisy in high dimensions, a typical pipeline involves computing such distances in PCA space, selecting, for example, the first 50 PCs. An alternative approach is to derive such distances from the lowdimensional space defined by the factors inferred by ZINBWaVE. This strategy was used effectively in Fig. 3c to visualize the PBMC data set.
In this article, we have focused on an unsupervised setting, where the goal is to extract a lowdimensional signal from noisy zeroinflated data. However, our proposed ZINB model is more general and can be used, in principle, for supervised differential expression analysis, where the parameters of interest are regression coefficients β corresponding to known samplelevel covariates in the matrix X (e.g., cell type, treatment/control status). Differentially expressed genes may be identified via likelihood ratio tests or Wald tests, with standard errors of estimators of β obtained from the Hessian matrix of the likelihood function. In addition, posterior dropout probabilities can be readily derived from the model and used as weights to unlock standard bulk RNAseq methods^{49}, such as edgeR^{41}. We envision a future version of the zinbwave package with this added capability.
Methods
ZINBWaVE model
For any μ ≥ 0 and θ > 0, let f_{NB}(⋅; μ, θ) denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean μ and inverse dispersion parameter θ, namely:
Note that another parametrization of the NB PMF is in terms of the dispersion parameter φ = θ^{−1} (although θ is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is μ and its variance is:
In particular, the NB distribution boils down to a Poisson distribution when φ = 0 ⇔ θ = +∞.
For any π ∈ [0, 1], let f_{ZINB}(⋅ ; μ, θ, π) be the PMF of the ZINB distribution given by:
where δ_{0}(⋅) is the Dirac function. Here, π can be interpreted as the probability that a 0 is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.
Given n samples (typically, n single cells) and J features (typically, J genes) that can be counted for each sample, let Y_{ij} denote the count of feature j (for j = 1,…, J) for sample i (i = 1,…, n). To account for various technical and biological effects frequent, in particular, in singlecell sequencing technologies, we model Y_{ij} as a random variable following a ZINB distribution with parameters μ_{ij}, θ_{ij}, and π_{ij}, and consider the following regression models for the parameters:
where
and elements of the regression models are as follows.
X is a known n × M matrix corresponding to M celllevel covariates and β = (β_{μ}, β_{π}) its associated M × J matrices of regression parameters. X can typically include covariates that induce variation of interest, such as cell types, or covariates that induce unwanted variation, such as batch or quality control measures. It can also include a constant column of ones, 1_{n}, to account for genespecific intercepts.
V is a known J × L matrix corresponding to J genelevel covariates, such as gene length or GCcontent, and γ = (γ_{μ}, γ_{π}) its associated L × n matrices of regression parameters. V can also include a constant column of ones, 1_{J}, to account for cellspecific intercepts, such as size factors representing differences in library sizes.
W is an unobserved n × K matrix corresponding to K unknown celllevel covariates, which could be of “unwanted variation” as in RUV^{27,28} or of interest (such as cell type), and α = (α_{μ}, α_{π}) its associated K × J matrices of regression parameters.
O_{μ} and O_{π} are known n × J matrices of offsets. \(\zeta \in {\Bbb R}^J\) is a vector of genespecific dispersion parameters on the log scale. This model deserves a few comments.
By default, X and V contain a constant column of ones, to account, respectively, for genespecific (e.g., baseline expression level) and cellspecific (e.g., library size) variation. In that case, X and V are of the form X = [1_{n}, X^{0}] and V = [1_{J}, V^{0}] and we can similarly decompose the corresponding parameters as β = [β^{1}, β^{0}] and γ = [γ^{1}, γ^{0}], where \(\beta ^1 \in {\Bbb R}^{1 \times J}\) is a vector of genespecific intercepts and \(\gamma ^1 \in {\Bbb R}^{1 \times n}\) a vector of cellspecific intercepts. The representation \({\bf{1}}_n\beta ^1 + \left( {{\bf{1}}_J\gamma ^1} \right)^ \top\) is then not unique, but could be made unique by adding a constant and constraining β^{1} and γ^{1} to each have elements summing to zero.
Although W is the same, the matrices X and V could differ in the modeling of μ and π, if we assume that some known factors do not affect both μ and π. To keep notation simple and consistent, we use the same matrices, but will implicitly assume that some parameters may be constrained to be 0 if needed.
By allowing the models to differ for μ and π, we can model and test for differential expression in terms of either the NB mean or the ZI probability.
We limit ourselves to a genedependent dispersion parameter. More complicated models for θ_{ij} could be investigated, such as a model similar to μ_{ij} or a functional of the form θ_{ij} = f(μ_{ij}), but we restrict ourselves to a simpler model that has been shown to be largely sufficient in bulk RNAseq analysis.
ZINBWaVE estimation procedure
The input to the model are the matrices X, V, O_{μ}, and O_{π} and the integer K; the parameters to be inferred are β = (β_{μ}, β_{π}), γ = (γ_{μ}, γ_{π}), W, α = (α_{μ}, α_{π}), and ζ. Given an n × J matrix of counts Y, the loglikelihood function is
where μ_{ij}, θ_{ij}, and π_{ij} depend on (β, γ, W, α, ζ) through Eq. (4).
To infer the parameters, we follow a penalized maximum likelihood approach, by trying to solve
where Pen(⋅) is a regularization term to reduce overfitting and improve the numerical stability of the optimization problem in the setting of many parameters. For nonnegative regularization parameters \(\left( {{\it{\epsilon }}_\beta ,{\it{\epsilon }}_\gamma ,{\it{\epsilon }}_W,{\it{\epsilon }}_\alpha ,{\it{\epsilon }}_\zeta } \right)\), we set
where β^{0} and γ^{0} denote the matrices β and γ without the rows corresponding to the intercepts if an unpenalized intercept is included in the model, \(\left\ \cdot \right\\) is the Frobenius matrix norm (\(\left\ A \right\ = \sqrt {{\mathrm{tr}}\left( {A^ \ast A} \right)}\), where A^{*} denotes the conjugate transpose of A), and \({\mathrm{Var}}(\zeta ) = 1{\mathrm{/}}(J  1)\mathop {\sum }\limits_{i = 1}^J \left( {\zeta _i  \left( {\mathop {\sum }\limits_{j = 1}^J \zeta _j} \right){\mathrm{/}}J} \right)^2\) is the variance of the elements of ζ (using the unbiased sample variance statistic). The penalty tends to shrink the estimated parameters to 0, except for the cell and genespecific intercepts which are not penalized and the dispersion parameters which are not shrunk towards 0 but instead towards a constant value across genes. Note also that the likelihood only depends on W and α through their product R = Wα and that the penalty ensures that at the optimum W and α have the structure described in the following result, which generalizes standard results such as Srebro et al.^{50} (Lemma 1) and Mazumder et al.^{51} (Lemma 6).
Lemma 1: For any matrix R and positive scalars s and t, the following holds:
where \(\left\ A \right\_ \ast = {\mathrm{tr}}\left( {\sqrt {A^ \ast A} } \right)\). If R = R_{L}R_{Σ}R_{R} is a singular value decomposition (SVD) of R, then a solution to this optimization problem is:
Proof: Let \(\tilde S = \sqrt s S\), \(\tilde T = \sqrt t T\), and \(\tilde R = \sqrt {st} R\). Then, \(\left\ {\tilde S} \right\^2 = s\left\ S \right\^2\), \(\left\ {\tilde T} \right\^2 = t\left\ T \right\^2\), and \(\tilde S\tilde T = \sqrt {st} ST\), so that the optimization problem is equivalent to:
which by Mazumder et al.^{51} (Lemma 6) has optimum value \(\left\ {\tilde R} \right\_ \ast = \sqrt {st} \left\ R \right\_ \ast\) reached at \(\tilde S = \tilde R_L\tilde R_\Sigma ^{\frac{1}{2}}\) and \(\tilde T = \tilde R_\Sigma ^{\frac{1}{2}}\widetilde {R_R}\), where \(\tilde R_L\tilde R_\Sigma \tilde R_R\) is a SVD of \(\tilde R\). Observing that \(\tilde R_L = R_L\), \(\tilde R_R = R_R\), and \(\tilde R_\Sigma = \sqrt {st} R_\Sigma\), gives that a solution of the optimization problem is \(S = s^{  1/2}\tilde S = s^{  1/2}R_L(st)^{1/4}R_\Sigma ^{1/2} = (t{\mathrm{/}}s)^{1/4}R_LR_\Sigma ^{1/2}\). A similar argument for T concludes the proof.
This lemma implies in particular that at any local maximum of the penalized loglikelihood, W and \(\alpha ^ \top\) have orthogonal columns, which is useful for visualization or interpretation of latent factors.
To balance the penalties applied to the different matrices in spite of their different sizes, a natural choice is to fix \({\it{\epsilon }} > 0\) and set
In particular, from Lemma 1, we easily deduce the following characterization of the penalty on W and α, which shows that the entries in the matrices W and α have similar standard deviation after optimization:
Corollary 1: For any n × J matrix R and positive scalars \({\it{\epsilon }}\), the following holds.
If R = R_{L}R_{Σ}R_{R} is a SVD decomposition of R, then a solution to this optimization problem is:
In particular, for any i = 1, …, min(n, J),
The penalized likelihood is however not concave, making its maximization computationally challenging. We instead find a local maximum, starting from a smart initialization and iterating a numerical optimization scheme until local convergence, as described below.
Initialization: To initialize the set of parameters we approximate the count distribution by a lognormal distribution and explicitly separate zero and nonzero values, as follows:

1.
Set \({\cal P} = \left\{ {(i,j):Y_{ij} > 0} \right\}\).

2.
Set L_{ij} = ln(Y_{ij}) − (O_{μ})_{ij} for all \((i,j) \in {\cal P}\).

3.
Set \(\hat Z_{ij} = 0\) if \((i,j) \in {\cal P}\), \(\hat Z_{ij} = 1\) otherwise.

4.
Estimate β_{μ} and γ_{μ} by solving the convex ridge regression problem:
This is a standard ridge regression problem, but with a potentially huge design matrix, with up to nJ rows and MJ + nL columns. To solve it efficiently, we alternate the estimation of β_{μ} and γ_{μ}. Specifically, we initialize parameter values as:
and repeat the following two steps a few times (or until convergence):

a.
Optimization in γ_{μ}, which can be performed independently and in parallel for each cell:
$$\hat \gamma _\mu \in \mathop {{{\mathrm{argmin}}}}\limits_{\gamma _\mu } \mathop {\sum }\limits_{(i,j) \in {\cal P}} \left( {L_{ij}  \left( {X\hat \beta _\mu } \right)_{ij}  \left( {V\gamma _\mu } \right)_{ji}} \right)^2 + \frac{{{\it{\epsilon }}_\gamma }}{2}\left\ {\gamma _\mu ^0} \right\^2.$$ 
b.
Optimization in β_{μ}, which can be performed independently and in parallel for each gene:
$$\hat \beta _\mu \in \mathop {{{\mathrm{argmin}}}}\limits_{\beta _\mu } \mathop {\sum }\limits_{(i,j) \in {\cal P}} \left( {L_{ij}  \left( {V\hat \gamma _\mu } \right)_{ji}  \left( {X\beta _\mu } \right)_{ij}} \right)^2 + \frac{{{\it{\epsilon }}_\beta }}{2}\left\ {\beta _\mu ^0} \right\^2.$$ 
5.
Estimate W and α_{μ} by solving
$$\left( {\hat W,\hat \alpha _\mu } \right) \in \mathop {{{\mathrm{argmin}}}}\limits_{W,\alpha _\mu } \mathop {\sum }\limits_{(i,j) \in {\cal P}} \left( {L_{ij}  \left( {X\hat \beta _\mu } \right)_{ij}  \left( {V\hat \gamma _\mu } \right)_{ji}  \left( {W\alpha _\mu } \right)_{ij}} \right)^2 \\ + \frac{{{\it{\epsilon }}_W}}{2}\left\ W \right\^2 + \frac{{{\it{\epsilon }}_\alpha }}{2}\left\ {\alpha _\mu } \right\^2.$$Denoting by \(D = L  X\hat \beta  (V\hat \gamma )^ \top\), this problem can be rewritten as:
$$\mathop {{{\mathrm{min}}}}\limits_{W,\alpha } \left\ {D  W\alpha } \right\_{\cal P}^2 + \frac{1}{2}\left( {{\it{\epsilon }}_W\left\ W \right\^2 + {\it{\epsilon }}_\alpha \left\ \alpha \right\^2} \right),$$where \(\left\ A \right\_{\cal P}^2 = \mathop {\sum }\nolimits_{(i,j) \in {\cal P}} A_{ij}^2\). By Lemma 1, if K is large enough, one can first solve the convex optimization problem:
$$\hat R \in \mathop {{{\mathrm{argmin}}}}\limits_{R:{\mathrm{rank}}(R) \le K} \left\ {D  R} \right\_{\cal P}^2 + \sqrt {{\it{\epsilon }}_W{\it{\epsilon }}_\alpha } \left\ R \right\_ \ast$$(6)and set
$$W = \left( {\frac{{{\it{\epsilon }}_\alpha }}{{{\it{\epsilon }}_W}}} \right)^{\frac{1}{4}}R_LR_\Sigma ^{\frac{1}{2}},\quad \alpha = \left( {\frac{{{\it{\epsilon }}_W}}{{{\it{\epsilon }}_\alpha }}} \right)^{\frac{1}{4}}R_\Sigma ^{\frac{1}{2}}R_R,$$where \(\hat R = R_LR_\Sigma R_R\) is the SVD of \(\hat R\). This solution is exact when K is at least equal to the rank of the solution of the unconstrained problem (6), which we solve with the softImpute::softImpute() function^{51}. If K is smaller, then (6) becomes a nonconvex optimization problem whose global optimum may be challenging to find. In that case we also use the rankconstrained version of softImpute::softImpute() to obtain a good local optimum.

6.
Estimate β_{π}, γ_{π}, and α_{π} by solving the regularized logistic regression problem:
This is a standard ridge logistic regression problem, but with a potentially huge design matrix, with up to nJ rows and MJ + nL columns. To solve it efficiently, we alternate the estimation of β_{π}, γ_{π}, and α_{π}. Specifically, we initialize parameter values as:
and repeat the following two steps a few times (or until convergence):

a.
Optimization in γ_{π}:
$$\begin{array}{*{20}{c}} {\hat \gamma _\pi \in \mathop {{{\mathrm{argmin}}}}\limits_{\gamma _\pi } \mathop {\sum }\limits_{(i,j)} \left[ {  \hat Z_{ij}\left( {X\hat \beta _\pi + \left( {V\gamma _\pi } \right)^ \top + \hat W\hat \alpha _\pi } \right)} \right._{ij}} \\ {\left. { + {\mathrm{ln}}\left( {1 + e^{\left( {X\hat \beta _\pi + \left( {V\gamma _\pi } \right)^ \top + \hat W\hat \alpha _\pi } \right)_{ij}}} \right)} \right] + \frac{{{\it{\epsilon }}_\gamma }}{2}\left\ {\gamma _\pi } \right\^2.} \end{array}$$(8)Note that this problem can be solved for each cell (i) independently and in parallel. When there is no gene covariate besides the constant intercept, the problem is easily solved by setting \(\left( {\hat \gamma _\pi } \right)_i\) to the logit of the proportion of zeros in each cell.

b.
Optimization in β_{π} and α_{π}:

7.
Initialize \(\hat \zeta = 0\).
Optimization: After initialization, we maximize locally the penalized loglikelihood by alternating optimization over the dispersion parameters and left and rightfactors, iterating the following steps until convergence:

1.
Dispersion optimization:
$$\hat \zeta \leftarrow \mathop {{{\mathrm{argmax}}}}\limits_\zeta \left\{ {\ell \left( {\hat \beta ,\hat \gamma ,\hat W,\hat \alpha ,\zeta } \right)  \frac{{{\it{\epsilon }}_\zeta }}{2}{\mathrm{Var}}(\zeta )} \right\}.$$
To solve this problem, we start by estimating a common dispersion parameter for all the genes, by maximizing the objective function under the constraint that Var(ζ) = 0; in practice, we use a derivativefree onedimensional optimization vector over the range [−50, 50]. We then optimize the objective function by a quasiNewton optimization scheme starting from the constant solution found by the first step. To derive the gradient of the objective function used by the optimization procedure, note that the derivative of the NB logdensity is:
where Ψ(z) = Γ′(z)/Γ(z) is the digamma function. We therefore get the derivative of the ZINB density as follows, for any π ∈ [0, 1]:

If y > 0, f_{ZINB}(y; μ, θ, π) = (1 − π)f_{NB}(y; μ, θ) therefore
$$\frac{\partial }{{\partial \theta }}{\mathrm{ln}}f_{\mathrm{ZINB}}(y;\mu ,\theta ) = {{\Psi }}(y + \theta )  {{\Psi }}(\theta ) + {\mathrm{ln}}\theta + 1  {\mathrm{ln}}(\mu + \theta )  \frac{{y + \theta }}{{\mu + \theta }}.$$ 
For y = 0, \(\frac{\partial }{{\partial \theta }}{\mathrm{ln}}f_{\mathrm{NB}}(0;\mu ,\theta ) = {\mathrm{ln}}\theta + 1  {\mathrm{ln}}(\mu + \theta )  \frac{\theta }{{\mu + \theta }}\), therefore
The derivative of the objective function w.r.t. ζ_{j}, for j = 1,…, J, is then easily obtained by
(Note that the J − 1 term in the denominator comes from the use of the unbiased sample variance statistic in the penalty for ζ.)

2.
Leftfactor (cellspecific) optimization:
$$\left( {\hat \gamma ,\hat W} \right) \leftarrow \mathop {{{\mathrm{argmax}}}}\limits_{\left( {\gamma ,W} \right)} \left\{ {\ell \left( {\hat \beta ,\gamma ,W,\hat \alpha ,\hat \zeta } \right)  \frac{{{\it{\epsilon }}_\gamma }}{2}\left\ {\gamma ^0} \right\^2  \frac{{{\it{\epsilon }}_W}}{2}\left\ W \right\^2} \right\}.$$(10)Note that this optimization can be performed independently and in parallel for each cell i = 1, …, n. For this purpose, we consider a subroutine solveZinbRegression (y, A_{μ}, B_{μ}, C_{μ}, A_{π}, B_{π}, C_{π}, C_{θ}) to find a set of vectors (a_{μ}, a_{π}, b) that locally maximize the loglikelihood of a ZINB model for a vector of counts y parametrized as follows:
$$\begin{array}{*{20}{l}} {{\mathrm{ln}}(\mu )} \hfill & = \hfill & {A_\mu a_\mu + B_\mu b + C_\mu ,} \hfill \\ {{\mathrm{logit}}(\pi )} \hfill & = \hfill & {A_\pi a_\pi + B_\pi b + C_\pi ,} \hfill \\ {{\mathrm{ln}}(\theta )} \hfill & = \hfill & {C_\theta .} \hfill \end{array}$$We give more details on how to solve solveZinbRegression in the next section. To solve (10) for cell i we call solveZinbRegression with the following parameters:
$$\left\{ \begin{array}{l}a_\mu = \gamma _\mu [.,i]\\ a_\pi = \gamma _\pi [.,i]\\ b = W[i,.]^ \top \\ y = Y[i,]^ \top \\ A_\mu = V_\mu \\ B_\mu = \alpha _\mu ^ \top \\ C_\mu = \left( {X_\mu [i,.]\beta _\mu + O_\mu [i,.]} \right)^ \top \\ A_\pi = V_\pi \\ B_\pi = \alpha _\pi ^ \top \\ C_\pi = \left( {X_\pi [i,.]\beta _\pi + O_\pi [i,.]} \right)^ \top \\ C_\theta = \zeta \end{array} \right..$$ 
3.
Rightfactor (genespecific) optimization:
$$\left( {\hat \beta ,\hat \alpha } \right) \leftarrow \mathop {{{\mathrm{argmax}}}}\limits_{\left( {\beta ,\alpha } \right)} \left\{ {\ell \left( {\beta ,\hat \gamma ,\hat W,\alpha ,\hat \zeta } \right)  \frac{{{\it{\epsilon }}_\beta }}{2}\left\ {\beta ^0} \right\^2  \frac{{{\it{\epsilon }}_\alpha }}{2}\left\ \alpha \right\^2} \right\}.$$Note that this optimization can be performed independently and in parallel for each gene j = 1,…, J, by calling
solveZinbRegression
with the following parameters:
$$\left\{ {\begin{array}{*{20}{l}} {a_\mu } \hfill & { = \left( {{\mathrm{\beta }}_{\mathrm{\mu }}[.,{\mathrm{j}}];{\mathrm{\alpha }}_\mu [.,{\mathrm{j}}]} \right)} \hfill \\ {a_\pi } \hfill & { = \left( {{\mathrm{\beta }}_{\mathrm{\pi }}[.,{\mathrm{j}}];{\mathrm{\alpha }}_{\mathrm{\pi }}[.,{\mathrm{j}}]} \right)} \hfill \\ b \hfill & { = \emptyset } \hfill \\ y \hfill & { = {\mathrm{Y}}[.,{\mathrm{j}}]} \hfill \\ {A_\mu } \hfill & { = \left[ {{\mathrm{X}}_{\mathrm{\mu }},{\mathrm{W}}} \right]} \hfill \\ {B_\mu } \hfill & { = \emptyset } \hfill \\ {C_\mu } \hfill & { = \left( {{\mathrm{V}}_{\mathrm{\mu }}\left[ {{\mathrm{j}},.} \right]{\mathrm{\gamma }}_{\mathrm{\mu }}} \right)^ \top + {\mathrm{O}}_{\mathrm{\mu }}[.,{\mathrm{j}}]} \hfill \\ {A_\pi } \hfill & { = [{\mathrm{X}}_{\mathrm{\pi }},{\mathrm{W}}]} \hfill \\ {B_\pi } \hfill & { = \emptyset } \hfill \\ {C_\pi } \hfill & { = \left( {{\mathrm{V}}_{\mathrm{\pi }}\left[ {{\mathrm{j}},.} \right]{\mathrm{\gamma }}_{\mathrm{\pi }}} \right)^ \top + {\mathrm{O}}_{\mathrm{\pi }}[.,{\mathrm{j}}]} \hfill \\ {C_\theta } \hfill & { = {\mathrm{\zeta }}_{\mathrm{j}}{\bf{1}}_n} \hfill \end{array}} \right..$$ 
4.
Orthogonalization:
This is obtained by applying Lemma 1, starting from an SVD decomposition of the current \(\hat W\hat \alpha\). Note that this step not only allows to maximize locally the penalized loglikelihood, but also ensures that the columns of W stay orthogonal to each other during optimization.
Solving
solveZinbRegression Given a Ndimensional matrix of counts \(y \in {\Bbb N}^N\), and matrix \(A_\mu \in {\Bbb R}^{N \times p}\), \(B_\mu \in {\Bbb R}^{N \times r}\), \(C_\mu \in {\Bbb R}^N\), \(A_\pi \in {\Bbb R}^{N \times q}\), \(B_\pi \in {\Bbb R}^{N \times r}\), \(C_\pi \in {\Bbb R}^N\), and \(C_\theta \in {\Bbb R}^N\), for some integers p,q,r, the function attempts to find parameters \((a_\mu ,a_\pi ,b) \in {\Bbb R}^p \times {\Bbb R}^q \times {\Bbb R}^r\) that maximize the ZINB loglikelihood of y with parameters:
Starting from an initial guess (as explained in the different steps above), we perform a local minimization of this function F(a_{μ}, a_{π}, b) using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasiNewton method. Let us now give more details on how the gradient of F is computed.
Given a single count y (i.e., N = 1), we first explicit the derivatives of the loglikelihood of y with respect to the (μ,π) parameters of the ZINB distribution. We first observe that
and that by definition of the ZINB distribution the following holds:
Let us explicit these expressions, depending on whether or not y is null:

If y > 0, then δ_{0}(y) = 0 and f_{ZINB}(y;μ, θ, π) = (1 − π)f_{NB}(y; μ, θ), so we obtain:
$$\begin{array}{l}\frac{\partial }{{\partial \mu }}{\mathrm{ln}}f_{\mathrm{ZINB}}(y;\mu ,\theta ,\pi ) = \frac{y}{\mu }  \frac{{y + \theta }}{{\mu + \theta }},\\ \frac{\partial }{{\partial \pi }}{\mathrm{ln}}f_{\mathrm{ZINB}}(y;\mu ,\theta ,\pi ) = \frac{{  1}}{{1  \pi }}.\end{array}$$ 
If y = 0, then δ_{0}(y) = 1, and we get
When N ≥ 1, using standard calculus for the differentiation of compositions and the facts that:
we finally get that
where G and H are the Ndimensional vectors given by
Simulated data sets
Simulating from the ZINBWaVE model: In order to simulate realistic data, we fitted our ZINBWaVE model to two real data sets (V1 and S1/CA1) and used the resulting parameter estimates as the truth to be estimated in the simulation. Genes that did not have at least five reads in at least five cells were filtered out and J = 1000 genes were then sampled at random for each data set. The ZINBWaVE model was fit to the count matrix Y with the number of unknown celllevel covariates set to K = 2, genewise dispersion (ζ = lnθ = −lnϕ), X_{μ}, X_{π}, V_{μ}, and V_{π} as columns of ones (i.e., intercept only), and no offset matrices, to get estimates for W, α_{μ}, α_{π}, β_{μ}, β_{π}, γ_{μ}, γ_{π}, and ζ. The parameters which were varied in the simulations are the number of cells, the proportion of zero counts, and the ratio of within to betweencluster sums of squares for W. Details on the parameter choices follow.
The number of cells was set to n = 100; 1000; 10,000.
The proportion of zero counts, \(zfrac = \mathop {\sum }\limits_{i,j} 1(Y_{ij} = 0){\mathrm{/}}nJ\), was set via the parameter γ_{π}: zfrac ≈ 0.25, 0.50, 0.75. As logit(π) = Xβ_{π} + (Vγ_{π})^{T} + Wα_{π}, the value of γ_{π} is directly linked to the dropout probability π, thus to the zero fraction. Note that by changing only γ_{π} but not γ_{μ}, we change the dropout rate but not the underlying, unobserved mean expression, i.e., this varies the number of technical zeros but not biological zeros.
The ratio of within to betweencluster sums of squares for W was set in the following way. Let C denote the number of clusters and n_{c} the number of cells in cluster c. For a given column of W (out of K columns), let W_{ic} denote the value for cell i in cluster c, \(\bar W\) the overall average across all n cells, \(\bar W_c\) the average for cells in cluster c, and TSS the total sum of squares. Then,
with \(\mathrm{WSS} = \mathop {\sum }\nolimits_{c = 1}^C \mathop {\sum }\nolimits_{i = 1}^{n_c} \left( {W_{ic}  \bar W_c} \right)^2\) and \(\mathrm{BSS} = \mathop {\sum }\nolimits_{c = 1}^C n_c\left( {\bar W_c  \bar W} \right)^2\) the within and betweencluster sums of squares, respectively. The level of difficulty of the clustering problem can be controlled by the ratio of within to betweencluster sums of squares. However, we want to keep the overall mean \(\bar W\) and overall variance (i.e., TSS) constant, so that the simulated values of W stay in the same range as the estimated W from the real data set; this prevents us from simulating an unrealistic count matrix Y.
Let us scale the betweencluster sum of squares by a^{2} and the withincluster sum of squares by a^{2}b^{2}, i.e., replace \(\left( {\bar W_c  \bar W} \right)\) by \(a\left( {\bar W_c  \bar W} \right)\) and \(\left( {W_{ic}  \bar W_c} \right)\) by \(ab\left( {W_{ic}  \bar W_c} \right)\), with a ≥ 0 and b ≥ 0 such that TSS and \(\bar W\) are constant. The total sum of squares TSS remains constant, i.e.,
provided
Requiring the overall mean \(\bar W\) to remain constant implies that
where the * superscript refers to the transformed W. Thus,
The above transformation results in a scaling of the ratio of within to betweencluster sums of squares by b^{2}, while keeping the overall mean and variance constant.
In our simulations, we fixed C = 3 clusters and considered three values for b^{2}, where the same value of b^{2} is applied to each of the K columns of W: b^{2} = 1, corresponding to the case where the within and betweencluster sums of squares are the same as the ones of the fitted W from the real data set; b^{2} = 5, corresponding to a larger ratio of within to betweencluster sums of squares and hence a harder clustering problem; b^{2} = 10, corresponding to a very large ratio of within to betweencluster sums of squares and hence almost no clustering.
Overall, 2 (real data sets)) × 3 (n) × 3 (zfrac)) × 3 (clustering) = 54 scenarios were considered in the simulation.
For each of the 54 scenarios, we simulated B = 10 data sets, resulting in a total of 54 × 10 = 540 data sets. Using the fitted W, α_{μ}, α_{π}, β_{μ}, β_{π}, γ_{μ}, γ_{π}, and ζ from one of the two real data sets, the data sets were simulated according to the following steps.

1.
Simulate W with desired clustering strength. First fit a Kvariate Gaussian mixture distribution to W inferred from one of the real data sets using the R function Mclust from the mclust package and specifying the number of clusters C. Then, for each of B = 10 data sets, simulate W cluster by cluster from Kvariate Gaussian distributions using the mvrnorm function from the MASS package, with the cluster means, covariance matrices, and frequencies output by Mclust. Transform W as in Eq. (11) to get the desired ratio of within to betweencluster sums of squares.

2.
Simulate γ_{μ} and γ_{π} to get the desired zero fraction. We only considered celllevel intercept nvectors γ_{μ} and γ_{π}, i.e., L = 1 and a matrix V of genelevel covariates consisting of a single column of ones. As the fitted γ_{μ} and γ_{π} from the original data sets are correlated nvectors, fit a bivariate Gaussian distribution to γ_{μ} and γ_{π} using the function Mclust from the mclust package with C = 1 cluster. Then, for each of B = 10 data sets, simulate γ_{μ} and γ_{π} from a bivariate Gaussian distribution using the mvrnorm function from the MASS package, with the mean and covariance matrix output by Mclust. To increase/decrease the zero fraction, increase/decrease each element of the mean vector for γ_{π} inferred from Mclust (shifts of {0,2,5} for V1 data set and {−1.5, 0.5, 2} for S1/CA1 data set).

3.
Create the ZINBWaVE model using the function zinbModel from the package zinbwave.

4.
Simulate counts using the function zinbSim from the package zinbwave.
Simulating from the Lun & Marioni^{42} model: To simulate data sets from a different model than our ZINBWaVE model, we simulated counts using the procedure described in Lun & Marioni^{42} (details in Supplementary Materials of original publication and code available from the Github repository https://github.com/MarioniLab/PlateEffects2016). Although the Lun & Marioni^{42} model is also based on a ZINB distribution, the distribution is parameterized differently and also fit differently, gene by gene. In particular, the negative binomial mean is parameterized as a product of the expression level of interest and two nuisance technical effects, a genelevel effect assumed to have a lognormal distribution and a celllevel effect (cf. library size) whose distribution is empirically derived. The zeroinflation probability is assumed to be constant across cells for each gene and is estimated independently of the negative binomial mean. The ZINB distribution is fit gene by gene using the zeroinfl function from the R package pscl. We used the raw genelevel read counts for the mESC data set as input to the script reference/submitter.sh, to create a simulation function constructed by fitting a ZINB distribution to these counts. The script simulations/submitter.sh was then run to simulate counts based on the estimated parameters (negative binomial mean, zero inflation probability, and genewise dispersion parameter). We simulated C = 3 clusters with equal number of cells per cluster. The parameters which were varied in the simulations are as follows.
The number of cells was set to n = 100; 1000; 10,000.
The proportion of zero counts, \(\mathrm{zfrac} = \mathop {\sum }\nolimits_{i,j} 1(Y_{ij} = 0){\mathrm{/}}nJ\), was set via the zero inflation probability: zfrac ≈ 0.4, 0.6, 0.8. For zfrac = 0.4, we did not modify the code in Lun & Marioni^{42}. However, to simulate data sets with greater zero fractions, namely zfrac = 0.6 and zfrac = 0.8, we added respectively 0.3 and 0.6 to the zeroinflation probability (\(p_i^\prime\), in their notation).
For each of the 3 (n)) × 3 (zfrac) = 9 scenarios, we simulated B (B = 10) data sets, resulting in 9) × 10 = 90 simulated data sets in total.
Dimensionality reduction methods
Three different dimensionality reduction methods were applied to the real and simulated data sets: ZINBWaVE, zeroinflated factor analysis, and PCA. For all the methods, we selected K = 2 dimensions, unless specified otherwise. A notable exception is the S1/CA1 data set, for which, given the large number of cells and the complexity of the signal, we specified K = 3 dimensions.
ZINBWaVE: We applied the ZINBWaVE procedure using the function zinbFit from our R package zinbwave, with the following parameter choices. Number of unknown celllevel covariates K: K = 1, 2, 3, 4. Genelevel covariate matrix V: not included or set to a column of ones 1_{J}. Celllevel covariate matrix X: set to a column of ones 1_{n}. For the mESC data set, we also considered including batch covariates in X. Dispersion parameter ζ: the same for all genes (common dispersion) or specific to each gene (genewise dispersion).
Zeroinflated factor analysis: We used the zeroinflated factor analysis (ZIFA) method^{26}, as implemented in the ZIFA python package (Version 0.1) available at https://github.com/epierson9/ZIFA, with the block algorithm (function block_ZIFA.fitModel, with default parameters). The output of ZIFA is an n × K matrix corresponding to a projection of the counts onto a latent lowdimensional space of dimension K.
PCA: We used the function prcomp from the R package stats for the simulation study and, for computational efficiency, the function jsvds from the R package rARPACK for the real data sets.
Normalization methods
As normalization is essential, especially for zeroinflated distributions, PCA and ZIFA were applied to both raw and normalized counts. The following normalization methods were used.
Total count normalization (TC): Counts are divided by the total number of reads sequenced for each sample and multiplied by the mean total count across all the samples. This method is related to the popular transcripts per million (TPM)^{52} and fragments per kilobase million (FPKM)^{53} methods.
Fullquantile normalization (FQ)^{54}: The quantiles of the distributions of the genelevel read counts are matched across samples. We used the function between Lane Normalization from the Bioconductor R package EDASeq.
Trimmed mean of M values (TMM)^{55}: The TMM globalscaling factor is computed as the weighted mean of logratios between each sample and a reference sample. If the majority of the genes are not differentially expressed (DE), TMM should be close to 1, otherwise, it provides an estimate of the correction factor that must be applied in order to fulfill this hypothesis. We used the function calcNormFactors from the Bioconductor R package edgeR to compute these scaling factors.
Clustering methods
Clustering of the OE data set: We used the resamplingbased sequential ensemble clustering (RSEC) framework implemented in the RSEC function from the Bioconductor R package clusterExperiment^{56}. Briefly, RSEC implements a consensus clustering algorithm which generates and aggregates a collection of clusterings, based on resampling cells and using a sequential tight clustering algorithm^{57}. See Fletcher et al.^{29} for details on the parameters used for RSEC in the original analysis and Perraudeau et al.^{31} for details on the parameters used for RSEC in the ZINBWaVE workflow.
Clustering of the 10× Genomics 68k PBMCs data set: We used two different clustering methods to assess the ability of ZINBWaVE to extract biologically meaningful signal from the data.
First, we used a a clustering procedure similar to the one implemented in the R package Seurat^{34} (Version 2.0.1). In particular, we used ZINBWaVE (K = 10) instead of PCA for dimensionality reduction. The clustering is based on shared nearest neighbor modularity^{33}. We used the following parameters: k.param=10, k.scale=10, resolution=0.6. All the other parameters were left at their default values.
We also clustered the data using a sequential kmeans clustering approach^{57}, implemented in the clusterSingle function of the clusterExperiment package^{56}. We used the following parameters: sequential=TRUE, subsample=FALSE, k0=15, beta=0.95, clusterFunction="kmeans".
Models for count data
We compared the goodnessoffit of our ZINBWaVE model to that of two other models for count data: a standard negative binomial model, that does not account for zero inflation, and the MAST hurdle model, that is specifically designed for scRNAseq data^{23}.
Goodnessoffit was assessed on the V1 data set using meandifference plots (MDplots) of estimated vs. observed mean count and zero probability, as well as plots of the estimated dispersion parameter against the observed zero frequency.
ZINBWaVE model: For our ZINBWaVE model, the overall mean and zero probability are
ZINBWaVE was fit to the V1 data set using the function zinbFit from our R package zinbwave, with the following parameter choices: K = 0 unknown celllevel covariates, genelevel intercept (X = 1_{n}), celllevel intercept (V = 1_{J}), and genewise dispersion.
Negative binomial model: The negative binomial (NB) distribution is a special case of the ZINB distribution that does not account for zero inflation, i.e., for which π = 0. Thus,
A NB distribution was fit gene by gene to the V1 data set, after fullquantile normalization, using the Bioconductor R package edgeR^{41} (Version 3.16.5) with only an intercept (i.e., default value for the design argument as a single column of ones) and genewise dispersion.
Modelbased analysis of singlecell transcriptomics: The modelbased analysis of singlecell transcriptomics (MAST) hurdle model proposed by Finak et al.^{23} is defined as follows:
where Y_{ij} is log2(TPM +1) for cell i and gene j, \(X_i \in {\Bbb R}^k\) is a known covariate vector, and Z_{ij} indicates whether gene j is truly expressed in cell i. For each gene j, the parameters of the MAST model are: the regression coefficients \(\beta _j^D \in {\Bbb R}^k\) for the discrete part and the regression coefficients \(\beta _j^C \in {\Bbb R}^k\) and variance \(\sigma _j^2 \in {\Bbb R}\) for the Gaussian continuous part. Note that we follow the notation in Finak et al.^{23}, which is different from that in the ZINBWaVE model of Eq. (4).
For the MAST model, the overall mean and zero probability are
MAST is implemented in the Bioconductor R package MAST^{58} (which allows different covariate vectors X_{i} for the continuous and discrete components). We use the function zlm to fit MAST with an intercept and a covariate for the cellular detection rate (as recommended in the MAST vignette for the MAIT data analysis) for both the discrete and continuous parts.
Note that in contrast to the NB and ZINB models, the MAST hurdle model is for log2(TPM+1) instead of counts and has no dispersion parameter. To be able to compare the fits of the three models, the MAST goodnessoffit plots display estimated vs. observed mean log2(TPM+1) and estimated variance \(\sigma _j^2\) vs. observed zero frequency (Supplementary Fig. 21). Although not a direct comparison, we think this allows a fair assessment of the goodnessoffit of the different models.
Evaluation criteria
Clustering: Silhouette width: Given a set of labels for the cells (e.g., biological condition, batch), silhouette widths provide a measure of goodness of the clustering of the cells with respect to these labels. Silhouette widths may be averaged within clusters or across all observations to assess clustering strength at the level of clusters or overall, respectively. The silhouette width s_{i} of a sample i is defined as follows:
where \(a_i = d\left( {i,{\cal C}_{cl(i)}} \right)\), \(b_i = {\mathrm{min}}_{l \ne cl(i)}d\left( {i,{\cal C}_l} \right)\), \({\cal C}_{cl(i)}\) is the cluster to which i belongs, and \(d(i,{\cal C}_l)\) is the average distance between sample i and the samples in cluster \({\cal C}_l\).
Average silhouette widths were used to compare ZINBWaVE, PCA, and ZIFA on both real and simulated data sets. For simulated data, the cluster labels correspond to the true simulated W and, for each scenario, silhouette widths were computed and averaged over B data sets. For real data, the authors’ cluster labels or known cell types were used.
Clustering: Precision and recall: Two different clusterings (i.e., partitions) may be compared quantitatively using the precision and recall coefficients. These measures involve assessing whether pairs of cells cluster together in each of the two clusterings. Let YY (respectively NY; YN; and NN) be the total number of pairs of cells which are in the same cluster in both Clustering 1 and Clustering 2 (respectively, in different clusters in Clustering 1 but in the same cluster in Clustering 2; in the same cluster in Clustering 1 but in different clusters in Clustering 2; in different clusters in both Clustering 1 and Clustering 2). Then, using Clustering 1 as a reference, the precision coefficient is defined as the proportion of cells clustered together in Clustering 2 which are also (correctly) clustered together in Clustering 1
Similarly, the recall coefficient is defined as the proportion of cells (correctly) clustered together in Clustering 1 which are also clustered together in Clustering 2
Precision and recall were used to compare ZINBWaVE, PCA, and ZIFA on simulated data sets, where the reference Clustering 1 corresponds to the true simulated labels and Clustering 2 was performed using the R function kmeans in the reduceddimensional space: the first two columns of W for ZINBWaVE, the first two principal components for PCA, and the first two latent variables for ZIFA. We simulated 3 clusters and set the number of clusters equal to 3 in kmeans (argument centers). Precision and recall were computed using the extCriteria function from the R package clusterCrit^{59}.
Dimensionality reduction: Correlation with QC measures: To evaluate the dependence of the inferred lowdimensional signal on unwanted variation, we computed the absolute correlation between each dimension (e.g., principal component) and a set of quality control (QC) measures.
For the V1 data set, FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and Picard tools (https://broadinstitute.github.io/picard/) were used to compute a set of 16 QC measures. These measures are available as part of the Bioconductor R package scRNAseq (https://bioconductor.org/packages/scRNAseq). For the glioblastoma and mESC data sets, we used the scater Bioconductor R package^{60} (Version 1.2.0) to compute a set of 7 QC measures. For the S1/CA1 data set, we used a set of 6 QC measures provided by the authors (http://linnarssonlab.org/cortex/).
Dimensionality reduction: Correlation of pairwise distances between observations: For simulated data, we assessed different dimensionality reduction methods (ZINBWaVE, PCA, and ZIFA) in terms of the correlation between pairwise distances between observations in the true and in the estimated reduceddimensional space. In particular, we monitored the influence of the number of unknown celllevel covariates K when the other parameters are set correctly (V is a column of ones 1_{J} and genewise dispersion). For each simulation scenario, the correlation between true and estimated pairwise distances was computed and averaged over B data sets.
Bias and MSE of the ZINBWaVE estimators: For data simulated from the ZINBWaVE model, let θ and \(\hat \theta _b\), b = 1,…, B, respectively denote the true parameter and an estimator of this parameter for the bth simulated data set. Then, for each scenario, the performance of the estimator \(\hat \theta\) can be assessed in terms of bias and mean squared error (MSE) as follows:
Bias and MSE were computed for β_{μ}, β_{π}, γ_{μ}, γ_{π}, ζ, Wα_{μ}, Wα_{π}, ln(μ), and logit(π). When the parameter to be estimated was an n × J matrix, the matrix was converted to a 1 × nJ row vector and bias and MSE were averaged over the elements of the vector.
Model selection for ZINBWaVE: The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are widely used for model selection and are defined as follows:
where \(\ell \left( {\hat \beta ,\hat \gamma ,\hat W,\hat \alpha ,\hat \zeta } \right)\) is the loglikelihood function evaluated at the MLE (Eq. (5)), n is the sample size, and N is the total number of estimated parameters, i.e., N = J(M_{μ} + M_{π}) + n(L_{μ} + L_{π}) + 2KJ + nK + J when the model is fit with genewise dispersion and N = J(M_{μ} + M_{π}) + n(L_{μ} + L_{π}) + 2KJ + nK + 1 when the model is fit with common dispersion.
We use AIC and BIC in the simulation to select the number of unknown celllevel covariates (K).
Note that because of the complex nonconvex likelihood function for the ZINBWaVE model, there is no closedform expression for the MLE and our numerical optimization procedure only provides an approximation of the AIC and BIC. In practice, however, we found that our results closely approximated the true MLE (see Results and Supplementary Fig. 22).
Real data sets
V1 data set: Tasic et al.^{3} characterized more than 1600 cells from the primary visual cortex (V1) in adult male mice, using a set of established Cre lines. Single cells were isolated by FACS into 96well plates and RNA was reverse transcribed and amplified using the SMARTer kit. Sequencing was performed using the Illumina HiSeq platform, yielding 100 bplong reads. We selected a subset of three Cre lines, Ntsr1Cre, Rbp4Cre, and Scnn1aTg3Cre, that label layer 4, layer 5, and layer 6 excitatory neurons, respectively. This subset consists of 379 cells, grouped by the authors into 17 clusters; we excluded the cells that did not pass the authors’ quality control filters and that were classified by the authors as “intermediate” cells between two clusters, retaining a total of 285 cells. Gene expression was quantified by genelevel read counts. Raw genelevel read counts and QC metrics (see below) are available as part of the scRNAseq Bioconductor R package (https://bioconductor.org/packages/scRNAseq). We applied the dimensionality reduction methods to the 1000 most variable genes.
S1/CA1 data set: Zeisel et al.^{4} characterized 3005 cells from the primary somatosensory cortex (S1) and the hippocampal CA1 region, using the Fluidigm C1 microfluidics cell capture platform followed by Illumina sequencing. Gene expression was quantified by UMI counts. In addition to gene expression measures, we have access to metadata that can be used to assess the methods: batch, sex, number of mRNA molecules. Raw UMI counts and metadata were downloaded from http://linnarssonlab.org/cortex/.
mESC data set: Kolodziejczyk et al.^{37} sequenced the transcriptome of 704 mouse embryonic stem cells (mESCs), across three culture conditions (serum, 2i, and a2i), using the Fluidigm C1 microfluidics cell capture platform followed by Illumina sequencing. We selected only the cells from the second and third batch, after excluding the samples that did not pass the authors’ QC filtering. This allowed us to have cells from each culture condition in each batch and resulted in a total of 169 serum cells, 141 2i cells, and 159 a2i cells. In addition to gene expression measures, we have access to batch and plate information that can be included as covariates in our model. Raw genelevel read counts were downloaded from http://www.ebi.ac.uk/teichmannsrv/espresso/. Batch and plate information was extracted from the sample names, as done in Lun & Marioni^{42}. We applied the dimensionality reduction methods to the 1000 most variable genes.
Glioblastoma data set: Patel et al.^{6} collected 672 cells from five dissociated human glioblastomas. Transcriptional profiles were generated using the SMARTSeq protocol. We analyzed only the cells that passed the authors’ QC filtering. The raw data were downloaded from the NCBI GEO database (accession GSE57872). Reads were aligned using TopHat with the following parameters: –rglibrary Illumina –rgplatform Illumina –keepfastaorder G N 3 –readeditdist 3 –nocoveragesearch x 1 M p 12. Counts were obtained using htseqcount with the following parameters (http://wwwhuber.embl.de/HTSeq/doc/count.html): a 10 q s no m union. We applied the dimensionality reduction methods to the 1000 most variable genes.
OE data set: Fletcher et al.^{29} characterized 849 FACSpurified cells from the mouse OE, using the Fluidigm C1 microfluidics cell capture platform followed by Illumina sequencing. Genelevel read counts were downloaded from GEO (GSE95601; file GSE95601_oeHBCdiff_Cufflinks_eSet_counts_table.txt.gz). As done in Perraudeau et al.^{31}, we filtered the cells that exhibited poor sample quality using SCONE^{36} (v. 1.1.2). A total of 747 cells passed this filtering procedure. To compare with the original results, we also reanalyze the final repertoire of 13 stable clusters found in Fletcher et al.^{29}, consisting of 616 cells, downloaded from https://github.com/rufletch/p63HBCdiff. See Fletcher et al.^{29} for details on the original analysis and Perraudeau et al.^{31} for details on the ZINBWaVE based workflow.
10× Genomics 68k PBMCs data set: Zheng et al.^{32} characterized 68,579 peripheral blood mononuclear cells (PBMCs) from a healthy donor. The genelevel UMI counts were downloaded from https://www.10xgenomics.com/singlecell/ using the cellrangerRkit R package (Version 1.1.0). We applied the dimensionality reduction methods to the 1000 most variable genes.
Data availability
The ZINBWaVE method is implemented in the opensource R package zinbwave, available as part of the Bioconductor Project (https://bioconductor.org/packages/zinbwave). See the package vignette for a detailed example of a typical use. The code to reproduce all the analyses and figures of this article is available at https://github.com/drisso/zinb_analysis.
Change history
04 February 2019
The original PDF version of this Article contained errors in two equations. In Eq. (1), all Γ symbols were inadvertently omitted. In the second equation in the subsection entitled ‘1. Dispersion optimization’ within the Methods section ‘ZINBWaVE estimation procedure’, all Ψ symbols were inadvertently omitted. These errors have been corrected in the PDF version of the Article; the HTML version was correct from the time of publication.
References
Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C. & Teichmann, S. A. The technology and biology of singlecell RNA sequencing. Mol. Cell 58, 610–620 (2015).
Macosko, E. Z. et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science 347, 1138–1142 (2015).
Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Singlecell RNAseq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343, 193–196 (2014).
Patel, A. P. et al. Singlecell RNAseq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014).
Bacher, R. & Kendziorski, C. Design and computational analysis of singlecell RNAsequencing experiments. Genome Biol. 17, 1 (2016).
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to singlecell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Islam, S. et al. Quantitative singlecell RNAseq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014).
Tung, P.Y. et al. Batch effects and the effective design of singlecell gene expression studies. Sci. Rep. 7, 39921 (2017).
Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing singlecell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 (2017).
Marinov, G. K. et al. From singlecell to cellpool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 24, 496–510 (2014).
Pollen, A. A. et al. Lowcoverage singlecell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053–1058 (2014).
Buettner, F. et al. Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 33, 155–160 (2015).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Ji, Z. & Ji, H. TSCAN: Pseudotime reconstruction and evaluation in singlecell RNAseq analysis. Nucleic Acids Res. 44, e117 (2016).
Shin, J. et al. Singlecell RNASeq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell 17, 360–372 (2015).
Campbell, K., Ponting, C. P. & Webber, C. Laplacian eigenmaps and principal curves for high resolution pseudotemporal ordering of singlecell RNAseq profiles. bioRxiv https://doi.org/10.1101/027219 027219 (2015).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of singlecell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Shalek, A. K. et al. Singlecell RNAseq reveals dynamic paracrine control of cellular variation. Nature 510, 363–369 (2014).
Gaublomme, J. T. et al. Singlecell genomics unveils critical regulators of Th17 cell pathogenicity. Cell 163, 1400–1412 (2015).
Hicks, S. C., Teng, M. & Irizarry, R. A. On the widespread and critical impact of systematic bias and batch effects in singlecell RNASeq data. bioRxiv https://doi.org/10.1101/025528 025528 (2015).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 16, 278 (2015).
Belkin, M. & Niyogi, P. Laplacian Eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, 1373–1396 (2003).
Van Der Maaten, L. & Hinton, G. Visualizing data using tSNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Pierson, E. & Yau, C. Dimensionality reduction for zeroinflated single cell gene expression analysis. Genome Biol. 16, 241 (2015).
GagnonBartsch, J. a. & Speed, T. P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552 (2012).
Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Fletcher, R. B. et al. Deconstructing olfactory stem cell trajectories at singlecell resolution. Cell Stem Cell 20, 817–830 (2017).
Street, K. et al. Slingshot: cell lineage and pseudotime inference for singlecell transcriptomics. bioRxiv 128843 (2017).
Perraudeau, F., Risso, D., Street, K., Purdom, E. & Dudoit, S. Bioconductor workflow for singlecell rna sequencing: Normalization, dimensionality reduction, clustering, and lineage inference. F1000Research 6, 1158 (2017).
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Waltman, L. & van Eck, N. J. A smart local moving algorithm for largescale modularitybased community detection. Eur. Phys. J. B 86, 471 (2013).
Satija, R., Butler, A. & Hoffman, P. Seurat: Tools for Single Cell Genomics (2017).
Villani, A.C. et al. Singlecell rnaseq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).
Cole, M. & Risso, D. Scone: Single Cell Overview of Normalized Expression Data (2017).
Kolodziejczyk, A. A. et al. Single cell RNAsequencing of pluripotent states unlocks modular transcriptional variation. Cell. Stem. Cell. 17, 471–485 (2015).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in highthroughput experiments. Bioinformatics 28, 882–883 (2012).
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Varyingcensoring aware matrix factorization for single cell RNAsequencing. bioRxiv 166736 (2017).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139 (2010).
Lun, A. T. L. & Marioni, J. C. Overcoming confounding plate effects in differential expression analyses of singlecell RNAseq data. Biostatistics 18, 451–464 (2016).
McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: preprocessing, quality control, normalisation and visualisation of singlecell RNAseq data in R. Bioinformatics 33, 1179–1186 (2016).
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
Lin, Z. et al. Simultaneous dimension reduction and adjustment for confounding variation. Proc. Natl Acad. Sci. USA 113, 14662–14667 (2016).
Love, M. I., Hogenesch, J. B. & Irizarry, R. A. Modeling of RNAseq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat. Biotechnol. 34, 1287 (2016).
Regev, A. et al. The human cell atlas. bioRxiv 121202 (2017).
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of singlecell RNAseq data by kernelbased similarity learning. Nat. Methods 14, 414–416 (2017).
Van den Berge, K., Soneson, C., Love, M. I., Robinson, M. D. & Clement, L. ZingeR: unlocking rnaseq tools for zeroinflation and single cell applications. bioRxiv 157982 (2017).
Srebro, N., Rennie, J. D. M. & Jaakkola, T. S. Maximummargin matrix factorization. Adv. Neural Inf. Process. Syst. 17, 1329–1336 (2005).
Mazumder, R., Hastie, T. & Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Trapnell, C. et al. Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Bullard, J. H., Purdom, E., Hansen, K. D. & Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics 11, 94 (2010).
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 11, R25 (2010).
Purdom, E. & Risso, D. clusterExperiment: Compare Clusterings for SingleCell Sequencing (2017).
Tseng, G. C. & Wong, W. H. Tight clustering: A resamplingbased approach for identifying stable and tight patterns in data. Biometrics 61, 10–16 (2005).
McDavid, A., Finak, G. & Yajima, M. MAST: ModelBased Analysis of Single Cell Transcriptomics (2017).
Desgraupes, B. clusterCrit: Clustering Indices (2016).
Mccarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: preprocessing, quality control, normalization and visualization of singlecell RNAseq data in R. Bioinformatics 33, 1179–1186 (2017).
Acknowledgements
We would like to thank Olivier Mirabeau from the Curie Institute for the preprocessing of the Glioblastoma data set. We also thank Aaron Lun and the Marioni Lab for making the code needed for the simulations available online. Finally, we are grateful to Russell Fletcher for his help with the reanalysis of the OE data. D.R. and S.D. are supported by the National Institutes of Health BRAIN Initiative (Grant U01 MH105979, PI: John Ngai). S.G. and J.P.V. are supported by the French National Research Agency (Grant ABS4NGS ANR11BINF0001). J.P.V. is supported by the European Research Council (Grant ERCSMAC280032), the Miller Institute for Basic Research in Science, and the Fulbright Foundation.
Author information
Authors and Affiliations
Contributions
D.R., S.D., and J.P.V. formulated the statistical model. S.G. and J.P.V. conceived and implemented the optimization algorithm. D.R., S.G., F.P., and J.P.V. wrote the R package. D.R. performed the real data analysis. D.R., F.P., S.D., and J.P.V. designed the simulation study. F.P. performed the simulations and analyzed the simulated data. D.R., F.P., S.D., and J.P.V. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Risso, D., Perraudeau, F., Gribkova, S. et al. A general and flexible method for signal extraction from singlecell RNAseq data. Nat Commun 9, 284 (2018). https://doi.org/10.1038/s41467017025545
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467017025545
This article is cited by

The shaky foundations of simulating singlecell RNA sequencing data
Genome Biology (2023)

SRTsim: spatial pattern preserving simulations for spatially resolved transcriptomics
Genome Biology (2023)

siVAE: interpretable deep generative models for singlecell transcriptomes
Genome Biology (2023)

Evaluating imputation methods for singlecell RNAseq data
BMC Bioinformatics (2023)

Multibatch singlecell comparative atlas construction by deep learning disentanglement
Nature Communications (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.