scImpute: accurate and robust imputation for single cell RNA-seq data

The emerging single cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at single-cell resolution. The analysis of scRNA-seq data is complicated by excess zero or near zero counts, the so-called dropouts due to the low amounts of mRNA sequenced within individual cells. Downstream analysis of scRNA-seq would be severely biased if the dropout events are not properly corrected. We introduce scImpute, a statistical method to accurately and robustly impute the dropout values in scRNA-seq data. ScImpute automatically identifies gene expression values affected by dropout events, and only perform imputation on these values without introducing new bias to the rest data. ScImpute also detects outlier or rare cells and excludes them from imputation. Evaluation based on both simulated and real scRNA-seq data on mouse embryos, mouse brain cells, human blood cells, and human embryonic stem cells suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropout events. scImpute is shown to correct false zero counts, enhance the clustering of cell populations and subpopulations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics.

similarity matrix of single cells. In the imputation of a single cell, the weights of the other cells are determined through the transition matrix. During the preparation of this manuscript, we also noticed another imputation method SAVER [13], which borrows information across genes using a Bayesian approach to estimate (unobserved) true expression levels of genes. Both MAGIC and SAVER would alter all gene expression levels including those not affected by dropouts, and this would potentially introduce new biases into the data and possibly eliminate meaningful biological variation. We think it is also inappropriate to treat all zero counts as missing values, since some of them may reflect true biological non-expression. Therefore, we propose a new imputation method for scRNA-seq data, scImpute, to simultaneously determine which values are affected by dropout events in data and perform imputation only on dropout entries. To achieve this goal, scImpute first learns each gene's dropout probability in each cell based on a mixture model. Next, scImpute imputes the (highly probable) dropout values in a cell by borrowing information of the same gene in other similar cells, which are selected based on the genes unlikely affected by dropout events

ScImpute recovers gene expression affected by dropout events
A key reason for performing imputation on scRNA-seq data is to recover biologically meaningful transcriptome dynamics in single cells so that we can determine cell identity and identify differentially expressed (DE) genes among different cell types. We first use three examples to illustrate scImpute's efficacy in imputing gene expressions. (All the imputation results in Section 2 are obtained without using true cell type information unless otherwise noted.) First, we show that scImpute recovers the true expression of the ERCC spike-in transcripts [14], especially low abundance transcripts that are impacted by dropout events. The ERCC spikeins are synthesized RNA molecules with known concentrations, which serve as gold standards of true expression levels, so the read counts can be compared with the true expression for accuracy evaluation. The dataset contains 3, 005 cells from the mouse somatosensory cortex region [15].
After imputation, the median correlation among the 57 transcripts' read counts and their true concentration increases from 0.92 to 0.95, and the minimum correlation increases from 0.81 to 0.89 (Figure S1). The read counts and true concentrations also present a stronger linear relationship in each single cell (Figure 2  The ERCC spike-ins' log 10(counts) and log 10(concentration) in four randomly selected mouse cortex cells. Imputed data presents a better linear relationship between the true concentration and the observed counts.
Second, we show that scImpute correctly imputes the dropout values of 892 annotated cellcycle genes in 182 embryonic stem cells (ESCs) that had been staged for cell-cycle phases (G1, S and G2M) [16]. These genes are known to modulate the cell cycle and are expected to have non-zero expression during different stages of the cell cycle. Before imputation, 22.5% raw counts of the cell-cycle genes are zeros, which are highly likely due to dropouts. After imputation, most of the dropout values are corrected, and true dynamics of these genes in the cell cycle are revealed (Figure S2 and S3). The imputed counts also better represents the true biological variation in these cell-cycle genes ( Figure 3).
Third, we use a simulation study to illustrate the efficacy of scImpute in enhancing the identification of cell types. We simulate expression data of three cell types c 1 , c 2 , and c 3 , each with 50 cells, and 810 among 20, 000 genes are truly differentially expressed (DE) (details in Methods).
Even though the three cell types are clearly distinguishable when we apply principal component analysis (PCA) to the complete data, they become less separated in the raw data with dropout events. However, the relationships among the 150 cells are clarified after we apply scImpute. The other two methods MAGIC and SAVER are also able to distinguish the three cell types, but MAGIC introduces artificial signals that largely alter the data and thus the PCA result, while SAVER only slightly improve the clustering result over that of the raw data (Figure 4). In addition, the dropout events obscure the differential pattern and thus increase the difficulty of detecting DE genes.
The imputed data by scImpute lead to a clearer comparison between the up-regulated genes in different cell types, while the imputed data by MAGIC and SAVER fail to recover this pattern (Figure 4). We also assess how the prevalence of dropout values influences the performance of scImpute. As expected, the DE analysis based on the imputed data has increased accuracy as the dropout proportion decreases. Yet scImpute still achieves > 80.0% area under the curve even when the proportion of zero count in raw data is 75.0% ( Figure S4).

ScImpute improves the identification of cell subpopulations in real data
To illustrate scImpute's capacity in aiding the identification of cell types or cell subpopulations, we apply our method to two real scRNA-seq datasets. The first one is a smaller dataset of mouse preimplantation embryos [17]. It contains RNA-seq profiles of 268 single cells from 10 developmental stages. Partly due to dropout events, 70.0% of read counts in the raw count matrix are zeros. To illustrate the dropout phenomenon, we plot the log 10-transformed read counts of two 16-cell stage cells as an example in Figure S5. Even though the two cells come from the same We compare the imputation results by investigating the clustering accuracy in the first two principal components (PCs). Although it is possible to differentiate the major developmental stages from the raw data, the imputed data by scImpute output more compact clusters ( Figure   5). MAGIC gives a clean pattern of developmental stages, but it has a high risk of removing biologically meaningful variation, given that many cells of the same stage have almost identical scores in the first two PCs. scImpute is the only method that is able to detect outlier cells. We then compare the clustering results of the spectral clustering [18] algorithm on the first two PCs. Since We also apply scImpute to a large dataset generated by the high-throughput droplet-based system [22]. The dataset contains 4, 500 peripheral blood mononuclear cells (PBMCs) of nine immune cell types, with 500 cells of each type. In the raw data, 92.6% read counts are exactly zeros. large and two small clusters, and we find that the three clusters reveal dynamics of two signature genes, FCER1A, which accumulates during the dendritic cell differentiation from monocytes [24], and S100A8, whose expression differs between subsets of human monocytes [25] (Figure S8 and 6). The large cluster (label 10) is characterized by high expression of S100A8 and moderate expression of FCER1A; one of the small clusters (label 1) presents high expression of both S100A8 and FCER1A, while in the other small cluster (label 2) FCER1A is largely non-expressed.
We also investigate the three clusters (labels 6, 9 and 12) of regulatory/memory/helper T cells.
The three clusters are supported by the expression of eight potential marker genes: cells in the same cluster have a similar expression pattern ( Figure S9). This example shows that scImpute provides an opportunity to discover new cell subpopulations and their marker genes.

ScImpute assists differential gene expression analysis on scRNA-seq data
ScRNA-seq data provide insights into the stochastic nature of gene expression in single cells, but suffer from a relatively low signal-to-noise ratio compared with bulk RNA-seq data. Thus an effective imputation method should lead to a better agreement between scRNA-seq and bulk RNA-seq data of the same biological condition on genes known to have little cell-to-cell heterogeneity. To evaluate whether the DE genes identified from single-cell data are more accurate after imputation, we utilize a real dataset with both bulk and single-cell RNA-seq experiments on human embryonic stem cells (ESC) and definitive endorderm cells (DEC) [26]. This dataset includes 6 samples of We apply scImpute, MAGIC, and SAVER to impute the gene expression for each cell type respectively, and then perform DE analysis on the raw data and the imputed data by each method, respectively. We use the R package DESeq2 [27] to identify DE genes from the bulk data, and the R packages DESeq2 and MAST [28] to identify DE genes from the scRNA-seq data. Inspecting the top 200 DE genes from the bulk data, we find that their expression profiles in the scRNA-seq data have stronger concordance with those in the bulk data after imputation by scImpute ( Figure S10). We apply different thresholds to false discovery rates (FDRs) of genes in the bulk data to obtain a DE gene list for every threshold. The same thresholds are applied to the FDRs of genes calculated from the raw and imputed scRNA-seq datasets to obtain DE gene lists respectively. Then we compare the DE gene lists obtained from the scRNA-seq data with those from the bulk data (i.e., the standard) to calculate precision and recall rates and F scores ( Figure S11). ScImpute leads to more similar DE gene lists to those from the bulk data, and achieves around 10% higher F scores compared with results on the raw data. We find that scImpute makes a good balance between the precision and recall rate, while MAGIC has low precision, and SAVER has low recall rate and is barely distinguishable from no imputation. We conclude that scImpute is preferred when users have a priority on the overall accuracy of the DE genes.
A comparison between the expression profiles of DEC and ESC marker genes [26,29,30] shows that the imputed data by scImpute best reflect the gene expression signatures by removing undesirable technical variation resulted from dropouts (Figure 7a and S13). To determine if the DE genes identified in scRNA-seq data are biologically meaningful, we performed gene ontology (GO) enrichment analysis [31]. In the ∼ 300 DEC up-regulated genes that are only detected in the imputed data by scImpute but not in the raw data, enriched GO terms are highly relevant to the functions of DECs (Figures 7c, S12, S14, and S15). However, in the ∼ 300 DEC up-regulated genes that are only detected in the raw data, enriched GO terms are general and not characteristic to DECs (Figures S16 and S17). These results also demonstrate that scImpute can facilitate the usage of DE method that were not designed for single-cell data.

ScImpute recovers gene expression dynamics in time course scRNA-seq data
Aside from the data used in Section 2.3, Chu et al. [26] also generated bulk and single-cell time-course RNA-seq data profiled at 0, 12, 24, 36, 72, and 96 h of differentiation during DEC emergence (Supplementary Table S2). We utilize this dataset to show that scImpute can help recover the DE signals that are difficult to identify in the raw time-course data, and reduce false discoveries resulted from dropouts. We first apply scImpute to the raw scRNA-seq data with true cell type labels, and then study how the time-course expression patterns change in imputed data.
The imputed data better distinguish cells of different time points ( Figure S18), suggesting that imputed read counts reflect more accurate transcriptome dynamics along the time course. Even though the scRNA-seq data present more biological variation than the bulk data, it is reasonable to expect that the average gene expression signal across cells in scRNA-seq should correlate with the signal in bulk RNA-seq. For a genome wide comparison, the imputed data have significantly higher Pearson correlations with the bulk data ( Figure S19). We study 70 genes associated with the GO term "endoderm development" [32] and found that a subset of these genes that are likely affected by dropout events show higher expression and better consistency with the bulk data after the imputation by scImpute (Figure 7b and S20). Similarly, we also study the marker genes (e.g.,

FOXA2
, HHEX, and CXCR4) of DECs [26, 29, 30] and these genes' expression levels at time point 96h are recovered by scImpute even though they have a median read count of zero in the raw data ( Figure S21).

Discussion
We propose a statistical method scImpute to address the dropout events prevalent in scRNAseq data. ScImpute focuses on imputing the missing expression values of dropout genes, while retaining the expression levels of genes that are largely unaffected by dropout events. Hence, scImpute can reduce technical variation resulted from scRNA-seq and better represent cell-to-cell biological variation, while it also avoids introducing excess bias during its imputation process. To achieve the above goal, scImpute first learns each gene's dropout probability in each cell by fitting a mixture model for each cell type. Next, scImpute imputes the (highly probable) dropout values of genes in a cell by borrowing information of the same gene in other similar cells, which are selected based on the genes not severely affected by dropout events. Comprehensive studies on both simulated and real data suggest that compared with the raw scRNA-seq data, the imputed data by scImpute better present cell type identity and lead to more accurate DE analysis results.
An attractive advantage of scImpute is that it can be incorporated into any existing pipelines or downstream analysis of scRNA-seq data, such as normalization [3,33], differential expression analysis [6], clustering and classification [8,9], etc. scImpute takes the raw read count matrix as input and outputs an imputed count matrix of the same dimensions, so it can be seamlessly combined with other computational tools without data reformatting or transformation. Another important feature of scImpute is that it only involves two parameters that can be easily understood and selected. The first parameter K denotes the potential number of cell populations. It can be selected based on clustering of raw data and the resolution level desired by the users. The second parameter is a threshold t on dropout probabilities. We show in a sensitivity analysis that scImpute is robust to the different parameters ( Figure S24), and a default threshold value 0.5 is sufficient for most scRNA-seq data. Moreover, cell type information is not necessary for the scImpute method. When cell type information is available, separate imputation on each cell type is expected to produce more accurate results. But as illustrated by simulation and real data results, scImpute is able to infer cell-type-specific expression even when the true labels are not supplied.
scImpute scales up well when the number of cells increases, and the computation efficiency can be largely improved if a filtering step on cells can be performed based on biological knowledge.
Aside from computational complexity, another future direction is to further improve imputation efficiency when dropout rates in raw data are severely high, as with the droplet-based technologies. Imputation task becomes more difficult when proportion of missing values increases. More complicated models that account for gene similarities may yield more accurate imputation results, but the prevalence of dropout events may require additional prior knowledge on similar genes to assist modeling. Despite the availability of computational methods that directly model zeroinflation in data [6,28], scImpute takes the imputation perspective to improve the data quality, and its applicability is not restricted to a specific task. Hence, scImpute is an useful tool that benefits all types of scRNA-seq downstream analyses.

Data processing and normalization
The input of our method is a count matrix X C with rows representing genes and columns representing cells, and our eventual goal is to construct an imputed count matrix with the same dimensions. We start by normalizing the count matrix by the library size of each sample (cell) so that all samples have one million reads. Denote the normalized matrix by X N , we then make a matrix X by taking log 10 transformation with a pseudo count 1.01: where I is the total number of genes and J is the total number of cells. The pseudo count is added to avoid infinite values in parameter estimation in a later step.

Dectection of cell subpopulations and outliers
Since scImpute borrows information of the same gene from similar cells to impute the dropout values, a critical step is to first determine which cells are from the same subpopulation. Due to excess zero counts in scRNA-seq data, it is difficult to accurately cluster cells into true cell types.
Hence, the goal of this step is to find a candidate pool of "neighbors" for each cell. ScImpute will select similar cells from the candidate neighbors in a subsequent imputation step. Suppose that scImpute clusters the cells in a dataset into K subpopulations in this step. For each cell, its candidate neighbors are the other cells in the same cluster.
1. PCA is performed on matrix X for dimension reduction and the resulting matrix is denoted as Z, where columns representing cells and rows representing principal components (PCs).
The purpose of dimension reduction is to reduce the impact of large portions of dropout values. The PCs are selected such that at least 40% of the variance in data could be explained.
2. Based on the PCA-transformed data Z, the distance matrix D J×J between the cells could be calculated. For each cell j, we denote its distance to the nearest neighbor as l j . For the set L = {l 1 , . . . , l J }, we denote its first quartile as Q 1 , and third quartile as Q 3 . The outlier cells are those cells which do not have close neighbors: For each outlier cell, we set its candidate neighbor set N j = ∅. Please note that the outlier cells could be a result of experimental/technical error or bias, but they may also represent real biological variation as rare cell types. ScImpute would not impute gene expression values in outlier cells, nor use them to impute gene expression values in other cells.
3. The remaining cells {1, . . . , J}\O are clustered into K groups by spectral clustering [18]. We denote g j = k if cell j is assigned to cluster k (k = 1, . . . , K). Hence, cell j has the candidate neighbor set N j = {j : g j = g j , j = j}.

Identification of dropout values
Once we obtain the transformed gene expression matrix X and the candidate neighbors of each cell N j , the next step is to infer which genes are affected by the dropout events in which cells.
Instead of treating all zero values as dropout events, we construct a statistical model to systematically determine whether a zero value comes from a dropout event or not. With the existence of dropout events, most genes have a bimodal expression pattern across similar cells, and that pattern can be described by a mixture model of two components (Supplementary Figure S22).
The first component is a Gamma distribution used to account for the dropouts, while the second component is a Normal distribution to represent the actual gene expression levels. For each gene, the proportions and parameters of the two components could be different in various cell types, so we construct separate mixture models for different cell subpopulations.
For each gene i, its expression in cell subpopulation k is modeled as a random variable X with density function are the shape and rate parameters of Gamma distribution, and µ are the mean and standard deviation of Normal distribution. The intuition behind this mixture model is that if a gene has high expression and low variation in most cells within a cell subpopulation, a zero count is more likely to be a dropout value; on the other hand, if a gene has constantly low or medium expression with high variation, then a zero count may reflect real biological variability. An advantage of this model is that it does not assume an empirical relationship between dropout rates and mean expression levels of genes, as [6] did, allowing more flexibility in the model estimation. The parameters in the mixture model can be estimated by the Expectation-Maximization (EM) algorithm and we denote their estimates i . It follows that the dropout probability of gene i in cell j, which belongs to subpopulation k, can be estimated as Therefore, each gene i has an overall dropout rateλ

Imputation of dropout values
Now we impute the gene expressions cell by cell. For each cell j, we select a gene set A j in need of imputation based on the genes' dropout probabilities in cell j: Recall that N j represents the indices of cells that are candidate neighbors of cell j. The response X B j ,j is a vector representing the B j rows in the j-th column of X, the design matrix X B j ,N j is a sub-matrix of X with dimensions |B j | × |N j |, and the coefficients β (j) is a vector of length |N j |.
Note that NNLS itself has the property of leading to a sparse estimateβ (j) , whose components may have exact zeros [35], so NNLS can be used to select similar cells of cell j from its neighbors N j . Finally, the estimated coefficientsβ (j) from the set B j are used to impute the expression of gene set A j in cell j: We construct a separate regression model for each cell to impute the expression of genes with high dropout probabilities ( Figure 1). This method simultaneously determines the values that need imputation, and would not introduce bias to the high expressions of accurately measured genes.
The application of scImpute involves two parameters. The first parameter is K, which determines the number of initial clusters to help identify candidate neighbors of each cell. The imputation results does not heavily rely on the choice of K is since scImpute uses a model-based method to select similar cells in a later stage. However, setting K to a value close to the true number of cell subpopulations can assist the selection of similar cells. The second parameter is a threshold t, and the imputation is only applied to the genes with dropout probabilities larger than t in a cell to avoid over-imputation. Please note that the threshold is set on the dropout probability (the probability that a gene being a dropout in a cell), not on the dropout rate (the proportion of cells in which the gene is affected by dropout events). The sensitivity analysis based on the mouse embryo data [17] suggests that scImpute is robust to varying parameter values (Supplementary

Generation of simulated scRNA-seq data
We suppose there are three cell types c 1 , c 2 , and c 3 , each with 50 cells, and there are 20, 000 genes in total. In the gene population, only 810 genes are truly differentially expressed, with one third having higher expression in each cell type respectively. We directly generate genes' log 10- standard deviation parameters obtained in the first two steps. We refer to the resulting gene expression data as the complete data. Finally, we suppose the dropout rate of each gene follows a double exponential function exp(−0.1 * mean expression 2 ), as assumed in [11]. Zero values are then introduced into the simulated data for each gene based on a Bernoulli distribution defined by the dropout rate of the gene, resulting in a gene expression matrix with excess zeros and in need of imputation. We refer to the gene expression data after introducing zero values as the raw data.
Please note that the generation of gene expression values does not directly follow the mixture model used in scImpute, so that we use this simulation to investigate the efficacy and robustness of scImpute in a fair way.

Availability of data and software
The scRNA-seq data used in this manuscript are all publicly available and their sources are summarized in Supplementary Table S3  We use U = {u 1 , . . . , u P } to denote the true partition of P class and V = {v 1 , . . . , v K } to denote the partition given by K-means clustering results. Let n i· and n ·j be the number of observations in class u i and cluster v j respectively, and n is the total number of observations.
The normalized mutual information is calculated as where I(U, V ) is mutual information, and H(U ) and H(V ) are the entroy of partition U and V .
The purity score is calculated as