Controlling for Confounding Effects in Single Cell RNA Sequencing Studies Using both Control and Target Genes

Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is a crucial step for accurate downstream analysis. Here, we present a novel statistical method, which we refer to as scPLS (single cell partial least squares), for robust and accurate inference of confounding effects. scPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve the inference of confounding effects. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.

experimental protocol and can cause amplification bias, but can be easily mitigated by using new protocols 26 . Yet other confounding effects are due to observable batches and can be adjusted for by including batch labels and technician ids as covariates or dealt with other statistical methods 27,28 . However, many confounding factors are hidden and are difficult or even impossible to measure. Common hidden confounding factors include various technical artifacts during library preparation and sequencing, and unwanted biological confounders such as cell cycle status. These hidden confounding factors can cause systematic bias, are notoriously difficult to control for, and are the focus of the present study.
To effectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel statistical method, which we refer to as scPLS (single cell partial least squares). scPLS takes advantage of the fact that genes in a scRNAseq study can often be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve the inference of confounding factors. scPLS is closely related to and bridges between two existing subcategories of methods for transcriptome analysis: a subcategory of methods that treat control and target genes in the same fashion (e.g. PCA [29][30][31][32] and LMM [33][34][35], and another subcategory of methods that use control genes alone for inferring confounding factors (e.g. RUV 29,36 and scLVM 37 ). By bridging between the two subcategories of methods, scPLS enjoys robust performance across a range of application scenarios. scPLS is also computationally efficient: with a new block-wise expectation maximization (EM) algorithm, it is scalable to thousands of cells and tens of thousands of genes. Using simulations and two real data applications, we show how scPLS can be used to remove confounding effects and enable accurate down-stream analysis in scRNAseq studies. Our method is implemented as a part of the Citrus project and is freely available at: http://chenmengjie.github.io/Citrus/.
The paper is organized as follows. In the Review of Previous Methods section, we provide a brief review of existing statistical methods for removing confounding effects in transcriptome analysis and describe how scPLS is related to and motivated from these methods. In the Method Overview Section, we provide a methodological description of the scPLS model, with inference details provided in the Methods Section. In the Simulations section we present comparisons between scPLS and several existing methods using simulations. In Real Data Applications section, we apply scPLS to two real scRNAseq data sets to remove technical confounding effects or cell cycle effects. Finally, we conclude the paper with a summary and discussion.

Review of Previous Methods
Many statistical methods have been developed in sequencing-and array-based genomic studies to infer hidden confounding factors and control for hidden confounding effects. Based on their targeted application, these statistical methods can be generally classified into two categories.
The first category of methods are supervised and application-specific: these methods are designed to infer the confounding factors in the presence of a known predictor variable, and to correct for the confounding effects without removing the effects of the predictor variable. For example, scientists are often interested in identifying genes that are differentially expressed between two pre-determined treatment conditions or that are associated with a measured predictor variable of interest. To remove the confounding effects in these applications, methods, include SVA 30 , sparse regression models 38,39 , and, more recently, RUV 40,41 , are developed. Although these application-specific methods are widely applied in many genomics studies, their usage is naturally restricted to cases where the primary variable of interest is known. The application-specific methods become inconvenient in cases where there are multiple variables of interest (e.g. in eQTL mapping problems). They also become inapplicable when the primary variable of interest is not observed (e.g. in clustering problems).
The second category of methods are unsupervised, and are designed to infer the confounding factors without knowing or using the predictor variable of interest. Our scPLS belongs to this category. Notable applications of unsupervised methods in scRNAseq studies include cell type clustering and classification [1][2][3][4][5][6][7][8] . Existing unsupervised statistical methods can be further classified into two subcategories. The first subcategory of methods treat all genes in the same fashion and use all of them to infer the confounding factors. For example, the principal component analysis (PCA) or the factor model extracts the principal components or factors from all genes (or all highly variable genes) as surrogates for the confounding factors [29][30][31][32] . The inferred factors are treated as covariates whose effects are further removed from gene expression levels before downstream analyses. Similarly, the linear mixed models (LMMs) construct a sample relatedness matrix based on all genes to capture the influence of the confounding factors [33][34][35] . The relatedness matrix are then included in the downstream analyses to control for the confounding effects. In contrast, the second subcategory of unsupervised methods are recently developed to take advantage of a set of control genes for inferring the confounding factors 29,37 . These methods divide genes into two sets: a control set of genes that are known to be free of effects of interest a priori and a target set of genes that are of primary interest. Unlike the first subcategory, the second subcategory of methods treat the two gene sets differently in inferring the confounding factors: the confounding factors are only inferred from the control set, and are then used to remove the confounding effects in the target genes for subsequent downstream analysis. For example, scRNAseq studies often add ERCC spike-in controls prior to the PCR amplification and sequencing steps. The spike-in controls can be used to capture the hidden confounding technical factors associated with the experimental procedures, which are further used to remove technical confounding effects (e.g. reverse transcription or PCR amplification confounding effects) from the target genes 33 . Similarly, most scRNAseq studies include a set of control genes that are known to have varying expression levels across cell cycles. These cell cycle genes can be used to capture the unmeasured cell cycle status of each cell, which are further used to remove cell cycle effects in the target genes 37 . Prominent methods in the second subcategory include the unsupervised version of RUV 29,36 and scLVM 37 .
The two subcategories of unsupervised methods use different strategies to infer the confounding factors. Therefore, these two sets of methods are expected to perform well in different settings. Specifically, the first subcategory of methods have the advantage of using information contained in all genes to accurately infer the confounding effects. However, when the predictor variable of interest influences a large number of genes, then this subcategory of methods may incorrectly remove the primary effects of interest. On the other hand, the second subcategory of methods infer confounding factors only from the control genes and are thus not prone to mistakenly removing the primary effects of interest. However, these methods overlook one important fact-that the hidden confounding factors not only influence the control genes but also the target genes, i.e. the exact reason that we need to remove such confounding effects in the first place. Because the confounding factors influence both control and target genes, using control genes alone to infer the confounding factors can be suboptimal as it fails to use the information from target genes.
To more effectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel statistical method, which we refer to as scPLS (single cell partial least squares). scPLS bridges between the two subcategories of unsupervised methods and effectively includes each as a special case. Like the first subcategory of methods, scPLS models both control and target genes jointly to infer the confounding factors. Like the second subcategory of methods, scPLS is capable of taking advantage of a control set to guild the inference of confounding factors. scPLS builds upon the partial least squares regression model and relies on a key modeling assumption that only target genes contain the primary effects of interest or other systematic biological variations. By incorporating such systematic variations in the target genes only, we can jointly model both control and target genes to infer the confounding effects while avoiding mis-removing the primary effects of interest. Therefore, scPLS has the potential to make full use of the data to improve the inference of confounding factors and the removal of confounding effects.

Results
scPLS Method Overview. We provide modeling details for scPLS here. While the formulation of scPLS is general, we focus on its application in scRNAseq. The scRNAseq data resembles that of the bulk RNAseq data and consists of a gene expression matrix on n cells and p q + genes. We consider dividing the genes into two sets: a control set that contains q control genes and a target set that contains p genes of primary interest. The control genes are selected based on the purpose of the analysis. For example, the control set would contain ERCC spike-ins if we want to remove technical confounding factors, and would contain cell cycle genes if we want to remove cell cycle effects. We use the following partial least squares regression to jointly model both control and target genes: where for i'th individual cell, x i is a q-vector of expression levels for q control genes; y i is a p-vector of expression levels for p target genes; z i is k z -vector of unknown confounding factors that affect both control and target genes; the coefficients of the confounding factors are represented by the q by k z loading matrix Λ x for the control genes and the p by k z loading matrix y Λ for the target genes; u i is a k u -vector of unknown factors in the target genes and potentially represents the predictors of interest or other structured variations (see below); Λ u is a p by k u loading matrix; ε xi is a q-vector of idiosyncratic error with covariance diag( , , ) ; yi ε is a p-vector of idiosyncratic error with covariance diag( , , ) ; MVN denotes the multivariate normal distribution. We assume that ε xi , ε yi , z i , and u i are all independent from each other. Following standard latent factor models, we further assume that z I MVN(0, ) i~ and ũ I MVN(0, ) i . We model transformed data instead of the raw read counts. We also assume that the expression levels of each gene have been centered to have mean zero, which allows us to ignore the intercept. scPLS includes two types of unknown latent factors. The first set of factors, z i , represents the unknown confounding factors that affect both control and target genes. The effects of z i on the control and target genes are captured in the loading matrices Λ x and y Λ , respectively. We call z i the confounding factors throughout the text, and we aim to remove the confounding effects Λ z y i from the target genes. The second set of factors, u i , aims to capture a low dimensional structure of the expression level of p target genes. The factors u i can represent the unknown predictor variables of interest, specific experimental perturbations, cell subpopulations, gene signatures or other intermediate factors that coordinately regulate a set of genes. Therefore, the factors u i can be interpreted as cell subtypes, treatment status, transcription factors or regulators of biological pathways in different studies [42][43][44][45][46] . Although u i could be of direct biological interest in many data sets, we do not explicitly examine the inferred u i here. Rather, we view modeling u i in the target genes as a way to better capture the complex variance structure there and to facilitate the precise estimation of confounding factors z i . For simplicity, we call u i the biological factors throughout the text, though we note that u i could well represent non-biological processes such as treatment or environmental effects. Thus, the expression levels of the control genes can be described by a linear combination of the confounding factors z i and residual errors; the expression levels of the target genes can be described by a linear combination of the confounding factors z i , the biological factors u i and residual errors. For both types of confounding factors, we are interested in inferring the factor effects z y i Λ and Λ u u i rather than the individual factors z i and u i . Therefore, unlike in standard factor models, we are not concerned with the identifiability of the factors. Figure 1 shows an illustration of scPLS.
scPLS is closely related to the two subcategories of unsupervised methods described in the previous Section. Specifically, without the biological effects term Λ u u i , scPLS effectively reduces to the first subcategory of methods that treat all genes in the same fashion for inferring the confounding factors. Without the Equation 2 term, scPLS effectively reduces to the second subcategory of methods that use only control genes for inference. (Note that, after inferring the confounding factors z i from Equation 1, the second subcategory of methods still use a reduced version of Equation 2 without the biological effects term Λ u u i to remove the confounding effects.) By including both modeling terms, scPLS can robustly control for confounding effects across a range of scenarios. Therefore, scPLS provides a flexible modeling framework that effectively includes the two subcategories of unsupervised methods as special cases and has the potential to outperform these previous methods.

Simulations.
We performed a simulation study to compare scPLS with other methods. Specifically, we simulated gene expression levels for 50 control genes and 1,000 target genes for 200 cells. These 200 cells come from two equal-sized groups, representing two treatment conditions or two cell subpopulations. Among the 1,000 target genes, only 200 of them are differentially expressed (DE) between the two groups and thus represent the signature of the two groups. The effect sizes of the DE genes were simulated from a normal distribution and we scaled the effects further so that the group label explains twenty percent of phenotypic variation (PVE) in expression levels in the DE genes. In addition to the group effects, we set k k 2, 5 z u = = and simulated each element of z i and u i from a standard normal distribution. We simulated each element of x Λ from σ − . N( 0 25, ) l 2 and each element of y Λ from σ . N(0 25, ) l 2 . Note that Λ x and Λ y were simulated differently to capture the fact that the effect sizes of the confounding factors could be different for control and target genes. We simulated each element of Λ u from N(0, ) b 2 σ . We simulated each element of ε xi and ε yi from a standard normal distribution. We set σ = . to ensure that, in non-DE genes, the confounding factors z i explain 20% PVE in either the control or the target genes; the biological factors u i explain 30% PVE of the target genes; and the residual errors to explain the rest of PVE. To vary signal strength in the data, we also created a series of sub data sets by varying the number of non DE genes in the data, so that the proportion of variance explained by DE genes in total equal to a fixed percentage (PDE, in the range of 20-100%, with 10% increments). After we simulated gene expression levels, we further converted these continuous values into count data by using a Poisson distribution: the final observation for ith cell and jth gene c ij is from ij ij , with w ij being the continuous gene expression levels simulated above and N 500000, log (10/500000) µ = = , which ensures an average read count of 10. Note that, because of the residual errors, the resulting count data are over-dispersed with respect to a Poisson distribution.
We considered three different simulation scenarios. In scenario I, the confounding factors z i are independent of group labels. In scenario II, the confounding factors are correlated with group labels. To simulate correlated data, we simulated each element of z i from N(0, 1) if the corresponding sample belongs to the first group, but from − . N( 0 25, 1) if the corresponding sample belongs to the second group. Finally, we also considered a scenario III where there is no biological factor (i.e. data were simulated effectively under the PCA modeling assumption and all genes could be used to infer the confounding factors). We performed 10 simulation replicates for each scenario. For scenario I and II, we further introduced dropout events that are commonly observed in scRNAseq data. This was done by going through one gene at a time and setting the expression level for j th gene c ij to zero with probability ij π that depends on the expression level through log c ij We compared scPLS to four different methods: (1) PCA and (2) LMM (implemented in GEMMA 47,48 ) use all genes to infer the confounding effects; while (3) RUVseq (version 1.2.0); which we simply refer to as RUV in the following text) and (4) scLVM (version 0.99.1) use only control genes to infer the confounding effects. We note that while some of these methods are developed not specifically for scRNAseq, these methods represent a range of strategies to deal with confounding factors. We used default settings in all the above methods. We used the ij ) for all other methods. For PCA and RUV, we set the number of latent factors to be the true number (i.e. 2). Such number is determined automatically by the software itself for scLVM, and is not needed for LMM. We compared different methods based on clustering performance. In particular, for each of these methods, we obtained corrected data and applied k-means method to cluster cells into two subpopulations. We the compared the clusters inferred from the corrected data with the truth and used adjusted rand index (ARI) to measure clustering performance. ARI is computed across a range of signal strength that is measured as PDE explained above. Intuitively, if a method performs well in removing confounding factors, then the corrected data from this method can be used to better infer the two cell subpopulations and thus yields a higher ARI score.
Overall, scPLS performs the best in both scenarios I and II, with or without dropout events (Fig. 2a). The addition of dropout events in either of the two scenarios reduces the performance of all methods but does not change their relative rank of performance. The superior performance of scPLS also suggests that properly using both control and target genes can lead to effective removal of confounding effects. Among the rest of the methods, PCA and LMM performs better than RUV and scLVM, suggesting that target genes contain a substantial amount of information for removing confounding effects. Beside the comparison of clustering performance, for each gene in turn, we also used different methods to estimate the proportion of gene expression variance contributed by confounding factors. Consistent with the clustering performance comparison, we found that scPLS also yielded more accurate proportion of variance estimates (Fig. 2b).
To examine the robustness of scPLS, we applied scPLS to the same data but with a reduced number of control genes (Fig. 3a). Because scPLS does not completely rely on the information contained in the control genes, it achieves robust performance even if we only use a much smaller subset of control genes. We also examined the performance of scPLS in Scenario III where there is no biological effects (Fig. 3b) and found that scPLS performs well there. As it is often unknown whether a low-rank structural variation exists in a real data set, our simulation suggests that we can always include the biological factors u i in the model even in the absence of such factors. In addition, scPLS is not sensitive with respect to the number of biological factors used in fitting the model, and achieves similar power for a range of reasonable k u values (Fig. 3c).
Real Data Applications. Next, we applied scPLS to two real data sets. The first dataset is used to demonstrate the effectiveness of scPLS in removing the technical confounding effects by using ERCC spike-ins. Removing technical confounding effects is a common and important task in transcriptome analysis. The second dataset is used to demonstrate the effectiveness of scPLS in removing cell cycle effects by using a known set of cell cycle genes. Removing cell cycle effects can reveal gene expression heterogeneity that is otherwise obscured.
Removing Technical Confounding Factors. The first dataset consists of 251 samples from 22 . Among these, 119 are mouse embryonic stem cells (mESCs), including 74 mESCs cultured in a two-inhibitor (2i) medium and 45 mESCs cultured in a serum medium. The remaining 132 cells are control "cells" that are obtained by mixing single cells cultured in each condition (i.e. these control "cells" are similar to bulk seq data in terms of consisting a mixture of cell types, but are prepared and sequenced using single cell protocol). The control cells include 76 cells cultured in 2i and 56 cells cultured in serum. Because the control cells are homogeneous within each culture condition, when we cluster these cell, we would expect the only true cluster detectable among these cells is the culture condition. Therefore, we decide to focus on these control cells to compare the performance of different methods for removing technical effects.
We obtained the raw UMI counts data directly from the authors. The data contains measurements for 92 ERCC spike-ins and 23,459 genes. Due to the low coverage of this dataset (median coverage equals one), we filtered out lowly expressed genes and selected only genes that have at least five counts and spike-ins that have at least one count in more than a third of the cells. This filtering step resulted in a total of 32 ERCC spike-ins that were used as the controls and 2,795 genes that were used as the targets.
As in the simulations, we log transformed the count data and centered the transformed values for scPLS, PCA, LMM and scLVM. We used the count data for RUV. In this data, scPLS infers k 1 z = confounding factors and k 1 u = biological factors. In the target genes, the confounding factors and structured biological factors explain a median of 18% and 30% of gene expression variance, respectively. The PVE by the confounding and biological factors can be as high as 73.7% and 77.9%, respectively, in the target genes.
We applied scPLS and the other four methods to remove confounding effects in the data. Since control cells are homogeneous within each culture condition, we reasoned that if the method is effective in removing confounding effects, then the corrected data from the corresponding method could be used to better reveal two clusters that correspond to the two known culture conditions. For the clustering analysis, we applied the four different clustering approaches on the uncorrected or corrected data from different methods. The four clustering approaches include: (1) kmeans, where we applied the k-means method directly on the uncorrected or corrected data; (2) PCA, where we extracted the top five PCs from either the uncorrected or corrected data and then applied the k-means method using the top PCs; (3) tSNE, where we used tSNE to either the uncorrected or corrected data and then applied the k-means method on the extracted tSNE factors; (4) SC3, where we used a recently developed state-of-the-art single cell clustering method single cell census clustering (SC3) 49 . For all these clustering approaches, we set the number of clusters to two and measured clustering performance by the adjusted Rand Index (ARI). The results are shown in Table 1 and are overall consistent with the simulations. Specifically, scPLS outperforms the other methods in three out of the four clustering approaches. scPLS performs slightly worse than RUV when tSNE was used to cluster data-but tSNE works extremely poorly in this data presumably because tSNE's non-linearity assumption does not fit the data well.
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w Removing Cell Cycle Effects. Our method can also be used to remove cell cycle effects. To demonstrate its effectiveness there, we applied scPLS and several other methods to a second dataset that was used for demonstrating cell cycle influence 37 . This dataset contains gene expression measurements on 9,570 genes from 182 embryonic stem cells (ESCs) with pre-determined cell-cycle phases (G1, S and G2M). The uncorrected data we obtained are already pre-processed by the original study to remove the technical effects and are thus continuous. Therefore, we did not apply RUV here. To remove cell cycle effects, we used 629 annotated cell-cycle genes as controls and the other genes as targets. scPLS infers k 1 z = cell cycle confounding factors, and = k 1 u biological factors. These factors explain a median of 0.4% and 0.1% of gene expression variance, respectively. The PVE by cell cycle factors and biological factors can be as high as 7% and 2%, respectively. We visualized the uncorrected data and scPLS corrected data on a PCA plot (Fig. 4). In the uncorrected data, there is a clear separation of cells according to cell-cycle stage. Such separation of cells is not observed in the corrected data, indicating that the cell cycle related expression signature is effectively removed.
We compared scPLS and the other three methods in their effectiveness in removing cell cycle effects. Following the original study 37 , we evaluated method performance based on the following criteria. Specifically, we computed for each gene the proportion of expression variance explained by the cell cycle factor. We denote this quantity as PVEi, which stands for inferred PVE. Because the cell-cycle stage of each cell had been experimentally determined in this data set, we further computed the variance explained by the true cell cycle labels. We denote this quantity as PVEt, which stands for true PVE. For scPLS, PVEi and PVEt are highly correlated ( = .
r 0 94 2 ), demonstrating the efficacy of scPLS. The correlation remains the same whether we use the full control set or with a subset of 300 controls. The correlation between PVEi and PVEt in scPLS is slightly higher, with statistical significance, than scLVM (r 0 92 2 = . ; p-value 10 16 < − comparing scPLS vs scLVM), LMM ( = . r 0 92 2 ; p-value 10 16 < − comparing scPLS vs LMM), and PCA ( = . r 0 92 2 ; p-value < − 10 16 comparing scPLS vs PCA). In addition, as an alternative measurement, the median of the absolute difference between PVEi and PVEt across genes from scPLS, scLVM, LMM and PCA are 0.018, 0.023, 0.019 and 0.019, respectively, again supporting a small advantage of scPLS. However, we do want to acknowledge that all methods work reasonably well in this data (which is    Table 1. Clustering performance on the uncorrected data or data corrected by different methods (columns). Different clustering approaches (rows) are applied in order to examine the robustness of the comparison results. Clustering performance is measured by the adjusted Rand Index. All performance measurements are averaged across 10 runs and are multiplied by a factor of 100. The top performer is colored blue. consistent with the low variance explained by the confounding factors), suggesting that removing cell cycle effects is a relatively trivial task at least in this data set.

Discussion
We have presented scPLS for removing hidden confounding effects in scRNAseq studies. scPLS models both control and target genes jointly to infer the confounding factors and shows robust performance across a range of application scenarios. With simulations and applications to two real data sets, we have demonstrated its effectiveness for removing technical confounding effects or cell cycle effects in scRNAseq studies.
Although we have focused on its applications to scRNAseq studies, scPLS can be readily applied to other genomic sequencing studies. For instance, our method can be used to remove confounding effects from gene expression levels in bulk RNAseq studies 50 or from methylation levels in bisulfite sequencing studies 51 . The main requirement of our method is a set of pre-specified control genes that are measured together with the target genes in the sequencing studies. It is often straightforward to obtain such control genes. For example, many scRNAseq studies include a set of ERCC spike-in controls that could be used to model and remove technical confounding effects 33 . Even when such ERCC spike-in controls are not present or when they are unreliable 29 , we can select a known set of house-keeping genes as controls to remove technical confounding 29 . Similarly, we can use a set of known cell cycle genes to remove cell cycle effects. Importantly, the performance of scPLS is robust to the number of genes included in the control set and yields comparable results even when a much smaller number of control genes is used. This is because scPLS not only uses information from control genes but also relies on information from target genes. Insensitivity to the control set makes scPLS especially suited to removing confounding factors in studies where a control set is not clearly defined. Because of its effectiveness and robustness, we expect scPLS to be useful in removing confounding effects in a wide variety of sequencing studies.
One important feature of scPLS is that it includes a low-rank component to model the structured biological variation often observed in real data. By decomposing the (residual) gene expression variation into a low-rank structured component that is likely to be contributed by a sparse set of biological factors, and an unstructured component that reflects the remaining variation, scPLS can better model the residual error structure for accurate inference of confounding effects. Although here we have focused on using the biological factors to better infer the confounding effects, we note that the low-rank biology factors themselves could be of direct interest. In fact, low-rank factors inferred from many data sets using standard factor models have been linked to important biological pathways or transcription factors [42][43][44][45][46] . Inferring the biological factors using scPLS is not feasible at the moment, however: because of model identifiability, scPLS can only be used to infer the biological effects (i.e. Λ u u i ) but not the biological factors (i.e. u i ). That said, additional assumptions can be made on the structure of the factors or the factor loading matrices to make factor inference possible 52 . For example, we could impose sparsity assumptions on the low-rank factors to facilitate the inference of a parsimonious set of biological factors. Exploring the use of biological factors in scPLS is an interesting avenue for future research.
We have been mainly focused on comparing the performance of different confounding effects removing methods by evaluating the clustering performance as the target downstream analysis. It has been well recognized that the choice of data normalization in scRNA-Seq is highly dependent on the specific biological question and the target downstream analysis 53 . Indeed, different downstream analysis (e.g. differential expression, lineage reconstruction, detecting allele-specific expression, spatial reconstruction etc.) can be affected differently by different choices of normalization. While evaluating the performance of various confounding effects removing methods for other downstream analysis is beyond the scope of the present study, we acknowledge that the "best" confounding effects removing method may vary depending on the question of interest. Therefore, it would be important to evaluate the performance of scPLS in other analysis settings in future studies. Nevertheless, we believe scPLS represent an important addition to the existing tools for removing confounding effects. Finally, in simulations we have also mainly focused on using the k-means clustering method to evaluate the clustering performance. Many other clustering methods are being developed recently, some of which are specifically targeted to single cell RNAseq studies. Those methods include RaceID 54 , SCUBA 55 , SNN-Cliq 56 , ZIFA 57 , t-SNE 4 , SC3 49 ; just to name a few. Because scPLS does not rely on a particular clustering method, scPLS can be paired with any clustering methods to take advantage of their benefits. Indeed, we have applied different clustering approaches to measure the performance of scPLS and other methods for removing confounding effects in the real data and obtained consistent results.
Like many other methods for scRNAseq 21 or bulk 58,59 RNAseq studies, scPLS requires a data transformation step that converts the count data into quantitative expression data. Different transformation methods can affect the interpretation of the data and are advantageous in different situations 16 . Because scPLS does not rely on a particular transformation procedure, scPLS can also be paired with any transformation methods to take advantage of their benefits. One potential disadvantage of scPLS is that it does not model raw count data directly. In bulk RNAseq studies, despite the count nature of sequencing data, it has been show that there is often a limited advantage of modeling the raw read counts directly, at least for RNAseq studies 60,61 . Statistical methods that convert and model the quantitative expression data have been shown to be robust 58,59 and most large scale bulk RNAseq studies in recent years have used transformed data instead of count data 31,[62][63][64] . However, we note that, unlike bulk RNAseq studies, single cell RNAseq data often come with low read depth. In low read depth cases, modeling count data while accounting for over-dispersion or dropout events in single cell RNAseq studies may have added benefits 17,18 . Therefore, extending our framework to modeling count data 65,66 is another promising avenue for future research.

Methods
EM Algorithms for scPLS. We develop an expectation-maximization (EM) algorithm for inference in scPLS. Specifically, we first initialize the factor loading matrices Λ Λ Λ ( , , ) x y u based on sequential single value decompositions on the gene expression matrices    Table 2. Comparison of the naive EM algorithm and the EM-in-chunks algorithm in terms of accuracy and speed. The EM-in-chunks algorithm uses either a chunk size of 500 genes or a chunk size of 1,000 genes. Accuracy is measured by the estimation error of the loading matrix in terms of the normalized Frobenius norm ). Because of the dimensionality of the loading matrix, the estimation error is not guaranteed to decrease with increasing sample size n. Speed is measured by CPU time in seconds for 100 iterations on an Intel Xeon E5-2670 2.6 GHz CPU. Standard deviations across 10 replicates are listed inside parenthesis. s: number of genes per chunk. n: the number of cells. p: the number of genes in the target set. The number of genes in the control set is q = 50 in all simulations.
We refer to the above algorithm (Algorithm 2) as the naive EM algorithm. The naive EM algorithm is computationally expensive: it scales quadratically with the number of genes and linearly with the number of cells/samples. To improve the computational speed, we develop a new EM-in-chunks algorithm (Algorithm 3). Our algorithm is based on the observation that the expression levels of the target genes are determined by the same set of underlying factors and that these factors can be estimated accurately even with a small subset set of target genes. This allows us to randomly divide target genes into dozens of chunks, compute the expectation of the factors in each chunk separately in the E-step, and then average these expectations across chunks. With the averaged expectations, we then update the factor loading matrices in the M-step. Thus, our new algorithm modifies the E-step in the naive algorithm and becomes K times faster than the naive one, where K is the number of chunks. This same idea has also been applied in the ZIFA algorithm 57 . Simulations (detailed in the simulations Section) show that our EM-in-chunks algorithm yields almost comparable results to the naive EM algorithm with respect to estimation errors, but can be close to an order of magnitude faster (Table 2). With the EM-in-chunks algorithm, our method is easily scalable to handle tens of thousands of cells (Fig. 5). For example, on a single Xeon desktop CPU, we can analyze 10,000 cells and 1,000 genes using our method in approximately 40 min. Therefore, we apply the EM-in-chunks algorithm with chunk size 500 throughout the rest of the paper.  Finally, we use the Bayesian information criterion (BIC) to determine the number of confounding factors k z and the number of biological factors k u . Specifically, we evaluate the likelihood on a grid of k z (1 to 3) and k u values (1 to 10) and choose the optimal combination that minimizes the BIC. After estimating the model parameters on the optimal set of k z and k u , we use the residuals Λ = −ˆŷ y z i i y i as the de-noised values for subsequent analysis. Note that the residuals are only free of the confounding effects Λ z y i but still contain the biological effects u u i Λ .

EM Algorithm Derivation.
To derive the EM algorithm, we first integrate out the latent variables z i and u i and obtain The latent variable x i and z i follow a joint normal distribution In the M step, we maximize the above expectation. To do so, we take derivatives of the log-likelihood function with respect to x Λ , Λ y and Λ u , and obtain where the conditional expectations are The above equations form the basis of our EM algorithms.