Abstract
Despite their widespread applications, singlecell RNAsequencing (scRNAseq) experiments are still plagued by batch effects and dropout events. Although the completely randomized experimental design has frequently been advocated to control for batch effects, it is rarely implemented in real applications due to time and budget constraints. Here, we mathematically prove that under two more flexible and realistic experimental designs—the reference panel and the chaintype designs—true biological variability can also be separated from batch effects. We develop Batch effects correction with Unknown Subtypes for scRNAseq data (BUSseq), which is an interpretable Bayesian hierarchical model that closely follows the datagenerating mechanism of scRNAseq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events, and detect differentially expressed genes without requiring a preliminary normalization step. We demonstrate that BUSseq outperforms existing methods with simulated and real data.
Introduction
Singlecell RNAsequencing (scRNAseq) technologies enable the measurement of the transcriptome of individual cells, which provides unprecedented opportunities to discover cell types and understand cellular heterogeneity^{1}. However, like the other highthroughput technologies^{2,3,4}, scRNAseq experiments can suffer from severe batch effects^{5}. Moreover, compared with bulk RNAseq data, scRNAseq data can have an excessive number of zeros that result from either biological zeros—that is, a gene is not expressed in a given cell—or dropout events—that is, the expression of some genes are not detected even though they are actually expressed in the cell due to amplification failure prior to sequencing^{6}. Consequently, despite the widespread adoption of scRNAseq experiments, the design of a valid scRNAseq experiment that allows the batch effects to be removed, the biological cell types to be discovered, and the missing data to be imputed remains an open problem.
One of the major tasks of scRNAseq experiments is to identify cell types for a population of cells^{1}. The cell type of each individual cell is unknown and is often the target of inference. Classic batch effects correction methods, such as ComBat^{7} and SVA^{8,9}, are designed for bulk experiments and require knowledge of the subtype information of each sample a prior. For scRNAseq data, this subtype information corresponds to the cell type of each individual cell. Clearly, these methods are thus infeasible for scRNAseq data. Alternatively, if one has knowledge of a set of control genes whose expression levels are constant across cell types, then it is possible to apply RUV^{10,11}. However, selecting control genes is still challenging for scRNAseq experiments, and recently there has been active research on identifying stably expressed genes that are reproducible and conserved across species for single cells^{12}.
To identify unknown subtypes, MetaSparseKmeans^{13} jointly clusters samples across batches. Unfortunately, MetaSparseKmeans requires all subtypes to be present in each batch. Suppose that we conduct scRNAseq experiments for blood samples from a healthy individual and a leukemia patient, one person per batch. Although we can anticipate that the two batches will share T cells and B cells, we do not expect that the healthy individual will have cancer cells as the leukemia patient. Therefore, MetaSparseKmeans is too restrictive for many scRNAseq experiments.
The mutualnearestneighbor (MNN) based approaches, including MNN^{14} and Scanorama^{15}, allow each batch to contain some but not all cell types. However, these methods require batch effects to be almost orthogonal to the biological subspaces and much smaller than the biological variations between different cell types^{14}. These are strong assumptions and cannot be validated at the design stage of the experiments. Seurat^{16,17}, LIGER^{18}, and scMerge^{19} attempt to identify shared variations across batches by lowdimensional embeddings and treat them as shared cell types. However, they may mistake the technical artifacts as the biological variability of interest if some batches share certain technical noises, for example when each patient is measured by several batches. To handle severe batch effects for microarray data, Luo and Wei^{20} developed BUS to simultaneously cluster samples across multiple batches and correct batch effects. However, none of the above methods considers features unique to scRNAseq data, such as the count nature of the data, overdispersion^{21}, dropout events^{6}, or cellspecific size factors^{22}. ZIFA^{23} and ZINBWaVE^{24} incorporate dropout events into the factor model, whereas scVI^{25} and SAVERX^{26} couple the modeling of dropout events with neural networks. However, as is the case with the other stateoftheart methods, these papers do not discuss the designs of scRNAseq experiments under which their methods are applicable.
Nevertheless, it is crucial to understand the conditions under which biological variability can be separated from technical artifacts. Obviously, for completely confounded designs—for example one in which batch 1 measures cell type 1 and 2, whereas batch 2 measures cell type 3 and 4—no method is applicable.
Here, we propose Batch effects correction with Unknown Subtypes for scRNAseq data (BUSseq), an interpretable hierarchical model that simultaneously corrects batch effects, clusters cell types, and takes care of the count data nature, the overdispersion, the dropout events, and the cellspecific size factors of scRNAseq data. We mathematically prove that it is legitimate to conduct scRNAseq experiments under not only the commonly advocated completely randomized design^{1,5,27,28}, in which each batch measures all cell types, but also the reference panel design and the chaintype design, which allow some cell types to be missing from some batches. Furthermore, we demonstrate that BUSseq outperforms the existing approaches in both simulation data and real applications. The theoretical results answer the question about when we can integrate multiple scRNAseq datasets and analyze them jointly. We envision that the proposed experimental designs will be able to guide biomedical researchers and help them to design better scRNAseq experiments.
Results
BUSseq is an interpretable hierarchical model for scRNAseq
We develop a hierarchical model BUSseq that closely mimics the data generating procedure of scRNAseq experiments (Fig. 1a, Supplementary Fig. 1 and Supplementary Note 1). Given that we have measured B batches of cells each with a sample size of n_{b}, let us denote the underlying gene expression level of gene g in cell i of batch b as X_{big}. X_{big} follows a negative binomial distribution with mean expression level μ_{big} and a genespecific and batchspecific overdispersion parameter ϕ_{bg}. The mean expression level is determined by the cell type W_{bi} with the cell type effect β_{gk}, the logscale baseline expression level α_{g}, the location batch effect ν_{bg}, and the cellspecific size factor δ_{bi}. The cellspecific size factor δ_{bi} characterizes the impact of cell size, library size and sequencing depth. It is of note that the cell type W_{bi} of each individual cell is unknown and is our target of inference. Therefore, we assume that a cell on batch b comes from cell type k with probability Pr(W_{b }= k) = π_{bk} and the proportions of cell types (π_{b1}, ⋯ , π_{bK}) vary among batches.
Unfortunately, it is not always possible to observe the expression level X_{big}. Without dropout (Z_{big} = 0), we can directly observe Y_{big} = X_{big}. However, if a dropout event occurs (Z_{big} = 1), then we observe Y_{big} = 0 instead of X_{big}. In other words, when we observe a zero read count Y_{big} = 0, there are two possibilities: a nonexpressed gene—biological zeros—or a dropout event. When gene g is not expressed in cell i of batch b (X_{big} = 0), we always have Y_{big} = 0; when gene g is actually expressed in cell i of batch b (X_{big} > 0) but a dropout event occurs, we can only observe Y_{big} = 0, and hence Z_{big} = 1. It has been noted that highly expressed genes are lesslikely to suffer from dropout events^{6}. We thus model the dependence of the dropout rate Pr(Z_{big} = 1∣X_{big}) on the expression level using a logistic regression with batchspecific intercept γ_{b0} and logodds ratio γ_{b1}.
Noteworthy, BUSseq includes the negative binomial distribution without zero inflation as a special case. When all cells are from a single cell type and the cellspecific size factor δ_{bi} is estimated a priori according to spikein genes, BUSseq can reduce to a form similar to BASiCS^{21}.
We only observe Y_{big} for all cells in the B batches and the total G genes. We conduct statistical inference under the Bayesian framework and adopt the MetropoliswithinGibbs algorithm^{29} for the Markov chain Monte Carlo (MCMC) sampling^{30} (Supplementary Note 2). Based on the parameter estimates, we can learn the cell type for each individual cell, impute the missing underlying expression levels X_{big} for dropout events, and identify genes that are differentially expressed among cell types. Moreover, our algorithm can automatically detect the total number of cell types K that exists in the dataset according to the Bayesian information criterion (BIC)^{31}. BUSseq also provides a batcheffect corrected version of count data, which can be used for downstream analysis as if all of the data were measured in a single batch (“Methods”).
Valid experimental designs for scRNAseq experiments
If a study design is completely confounded, as shown in Fig. 1b, then no method can separate biological variability from technical artifacts, because different combinations of batcheffect and celltypeeffect values can lead to the same probabilistic distribution for the observed data, which in statistics is termed a nonidentifiable model. Formally, a model is said to be identifiable if each probability distribution can arise from only one set of parameter values^{32}. Statistical inference is impossible for nonidentifiable models because two sets of distinct parameter values can give rise to the same probability distribution function. We prove that the BUSseq model is identifiable under conditions that are very easily met in reality. Thus, a wide range of designs of scRNAseq experiments are valid as their batch effects can be adjusted at least by BUSseq.
For the complete setting, in which each batch measures all of the cell types (Fig. 1c and Theorem 1 in “Methods”), BUSseq is identifiable as long as: (I) the logodds ratio γ_{b1}s in the logistic regressions for the dropout rates are negative for all of the batches, (II) every two cell types have more than one differentially expressed gene, and (III) the ratios of mean expression levels between two cell types \((\frac{\exp ({\beta }_{1k})}{\exp ({\beta }_{1\tilde{k}})},\cdots ,\frac{\exp ({\beta }_{Gk})}{\exp ({\beta }_{G\tilde{k}})})\) are different for each celltype pair \((k,\tilde{k})\) (see Theorem 1 in “Methods”). Condition (I) requires that the highly expressed genes are less likely to have dropout events, which is routinely observed for scRNAseq data^{6}. Condition (II) always holds in reality. Because scRNAseq experiments measure the whole transcriptome of a cell, condition (III) is also always met in real data. For example, if there exists one gene g such that for any two distinct celltype pairs (k_{1}, k_{2}) and (k_{3}, k_{4}) their mean expression levels’ ratios \(\frac{\exp ({\beta }_{g{k}_{1}}\!)}{\exp ({\beta }_{g{k}_{2}}\!)}\) and \(\frac{\exp ({\beta }_{g{k}_{3}}\!)}{\exp ({\beta }_{g{k}_{4}}\!)}\) are not the same, then condition (III) is already satisfied.
The commonly advocated completely randomized experimental design is a special case of the complete setting design. In a completely randomized design, cells are assigned to different batches completely at random. As a result, all of the batches have similar compositions of cell populations. In contrast, under the complete setting design, cells from different cell types can be distributed to different batches very unevenly. The requirement that each batch has similar cellular compositions is crucial for traditional batch effects correction methods developed for bulk experiments such as ComBat^{7} to work well for scRNAseq data. In contrast, BUSseq is not limited to this balanced design constraint and is applicable to not only the completely randomized design but also the general complete setting design.
Ideally, we would wish to adopt completely randomized experimental designs. However, in reality, it is always very challenging to implement complete randomization due to time and budget constraints. For example, when we recruit patients sequentially, we often have to conduct scRNAseq experiments patientbypatient rather than randomize the cells from all of the patients to each batch, and the patients may not have the same set of cell types. Fortunately, we can prove that BUSseq also applies to two sets of flexible experimental designs, which allow cell types to be measured in only some but not all of the batches.
Assuming that conditions (I)–(III) are satisfied, if there exists one batch that contains cells from all cell types and the other batches have at least two cell types (Fig. 1d), then BUSseq can tease out the batch effects and identify the true biological variability (see Theorem 2 in “Methods”). We call this setting the reference panel design.
Sometimes, it can still be difficult to obtain a reference batch that collects all cell types. In this case, we can turn to the chaintype design, which requires every two consecutive batches to share two cell types (Fig. 1e). Under the chaintype design, given that conditions (I)–(III) hold, BUSseq is also identifiable and can estimate the parameters well (see Theorem 3 in “Methods”).
A special case of the chaintype design is when two common cell types are shared by all of the batches, which is frequently encountered in real applications. For instance, when blood samples are assayed, even if we perform scRNAseq experiment patientbypatient with one patient per batch, we know a priori that each batch will contain at least both T cells and B cells, thus satisfying the requirement of the chaintype design.
The key insight is that despite batch effects, differences between cell types remain constant across batches. The differences between a pair of cell types allow us to distinguish batch effects from biological variability for those batches that measure both cell types. Therefore, BUSseq can separate batch effects from cell type effects under more general designs beyond the easily understood and commonly encountered reference panel design and chaintype design. If we regard each batch as a node in a graph and connect two nodes with an edge if the two batches share at least two cell types, then BUSseq is identifiable as long as the resulting graph is connected (Supplementary Fig. 2 and Theorem 4 in “Methods”).
For scRNAseq data, dropout rate depends on the underlying expression levels^{6}. Such missing data mechanism is called missing not at random (MNAR) in statistics. It is very challenging to establish identifiability for MNAR. Miao et al.^{33} showed that for many cases even when both the outcome distribution and the missing data mechanism has parametric forms, the model can be nonidentifiable. However, fortunately, despite the dropout events and the cellspecific size factors, we are able to prove Theorems 1–4 (Supplementary Note 3). The reference panel design, the chaintype design, and the connected design liberalize researchers from the ideal but often unrealistic requirement of the completely randomized design.
BUSseq accurately learns the parameters and the missing data
We first evaluated the performance of BUSseq via a simulation study. We simulated a dataset with four batches and a total of five cell types under the chaintype design (Fig. 2a–d and Theorem 3). Every two consecutive batches share at least two cell types, but none of the batches contains all of the cell types. The sample sizes for each batch are (n_{1}, n_{2}, n_{3}, n_{4}) = (300, 300, 200, 200) (Supplementary Table 1), and there are a total of 3000 genes, out of which 500 genes are differentially expressed between cell types. The remaining 2500 genes have no biological differences between different cell types, so they are pure noises with only batch effects. In real datasets, batch effects are often much larger than the cell type effects (Fig. 3a) and not orthogonal to the cell type effects (Supplementary Fig. 3). In the simulation study, we choose the magnitude of the batch effects, cell type effects, the dropout rates, and the cellspecific size factors to mimic real data scenarios (Fig. 3a). The simulated observed data suffer from severe batch effects and dropout events (Figs. 2d, 3c). The dropout rates for the four batches are 26.79%, 24.53%, 28.36%, and 31.29%, with the corresponding total zero proportions given by 44.13%, 48.85%, 53.07%, and 61.38%.
BUSseq correctly identifies the presence of five cell types among the cells (Fig. 2e). Moreover, despite the dropout events, BUSseq accurately estimates the cell type effects β_{gk}s (Fig. 2a, f), the batch effects ν_{bg}s (Fig. 2b, g), and the cellspecific size factors δ_{bi}s (Fig. 2j). In particular, BUSseq outperforms existing normalization methods, including DESeq normalization^{34}, trimmed mean of Mvalues (TMM) normalization^{35}, library size normalization, and the deconvolution normalization method^{36}, in estimating the cellspecific size factors δ_{bi}s (Supplementary Fig. 4 and Supplementary Note 4). When controlling the Bayesian False Discovery Rate (FDR) at 5%^{37,38}, we identify all intrinsic genes that differentiate cell types with the true FDR being 2% (“Methods”).
Figure 2h demonstrates that BUSseq can learn the underlying expression levels X_{big}s well based on the observed data Y_{big}s, which are subject to dropout events. This success arises because BUSseq uses an integrative model to borrow strengths both across genes and across cells from all batches. In comparison, we also benchmarked BUSseq with three stateoftheart imputation methods for scRNAseq data—SAVER^{39}, DrImpute^{40}, and scImpute^{41}. Once again, BUSseq performs the best in identifying the true biological zeros and recovering the underlying expression levels X_{big}s for the dropout events (Supplementary Table 2 and Supplementary Note 5).
ComBat offers a version of data that have been adjusted for batch effects^{7}. Here, we also provide batcheffectscorrected count data based on quantile matching (“Methods”). The adjusted count data no longer suffer from batch effects and dropout events, and they even do not need further cellspecific normalization (Fig. 2i). Therefore, they can be treated as if measured in a single batch for downstream analysis.
To evaluate the robustness of BUSseq, we conducted extensive sensitivity analyses, and they show that BUSseq is robust to the choice of hyperparameters, high zero rates, model misspecification and gene filtering (Supplementary Figs. 5–7, Supplementary Tables 3 and 4, and Supplementary Note 6).
BUSseq outperforms existing methods in simulation study
We benchmarked BUSseq with the stateoftheart methods for batch effects correction for scRNAseq data—LIGER^{18}, MNN^{14}, Scanorama^{15}, scVI^{25}, Seurat^{17}, and ZINBWaVE^{24}. The adjusted Rand index (ARI) measures the consistency between two clustering results and is between zero and one, a higher value indicating better consistency^{42} (Supplementary Note 7). The ARI between the inferred cell types \({\widehat{W}}_{bi}\)s by BUSseq and the true underlying cell types W_{bi}s is one. Thus, BUSseq can perfectly recover the true cell type of each cell. In comparison, we applied each of the compared methods to the dataset and then performed their own clustering approaches (Supplementary Note 8). The ARI is able to compare the consistency of two clustering results even if the numbers of clusters differ, therefore, we chose the number of cell types by the default approach of each method rather than set it to a common number. The resulting ARIs are 0.837 for LIGER, 0.654 for MNN, 0.521 for Scanorama, 0.480 for scVI, 0.632 for Seurat, and 0.571 for ZINBWaVE. Moreover, the tSNE plots (Fig. 3c, d) show that only BUSseq can perfectly cluster the cells by cell types rather than batches. We also calculated the silhouette score for each cell for each compared method (Supplementary Note 7). A high silhouette score indicates that the cell is well matched to its own cluster and separated from neighboring clusters. Figure 3b shows that BUSseq gives the best segregated clusters.
BUSseq outperforms existing methods on hematopoietic data
We reanalyzed the two hematopoietic datasets^{43,44} previously studied by Haghverdi et al.^{14} (Fig. 4a and Supplementary Fig. 8a,b). The two datasets shared at least three cell types, including the common myeloid progenitors (CMP), megakaryocyteerythrocyte progenitors (MEP) and granulocytemonocyte progenitors (GMP), thus they follow the chaintype design.
BUSseq fits the zero rates (Table 1 and Supplementary Note 9) and the meanvariance trends (Fig. 5a, Supplementary Fig. 9 and Supplementary Note 7) of the real data very well. In order to compare BUSseq with existing methods, we compute the ARIs between the clustering of each method and the FACS labels. The resulting ARIs are 0.582 for BUSseq, 0.307 for LIGER, 0.575 for MNN, 0.518 for Scanorama, 0.197 for scVI, 0.266 for Seurat, and 0.348 for ZINBWaVE (Supplementary Table 5 and Supplementary Note 8). BUSseq thus outperforms all of the other methods in being consistent with FACS labeling. BUSseq also has silhouette coefficients that are comparable to those of MNN, which are better than those of all the other methods (Fig. 4b and Supplementary Fig. 10a, b).
Specifically, BUSseq learns six cell types from the dataset. According to the FACS labels (Methods), Cluster 2, Cluster 5, and Cluster 6 correspond to CMP, MEP, and GMP, respectively (Figs. 4c, 6a–c). Cluster 1 is composed of longterm hematopoietic stem cells and multipotent progenitors (MPP). These are cells from the early stage of differentiation. Cluster 4 consists of a mixture of MEP and CMP, while Cluster 3 is dominated by cells labeled as other. Comparison between the subpanel for BUSseq in Figs. 4c and 6b indicates that Cluster 4 are cells from an intermediate cell type between CMP and MEP. In particular, according to Fig. 6e, the marker genes Apoe and Gata2 are highly expressed in Cluster 4 but not in CMP (Cluster 2) and MEP (Cluster 6), and the marker gene Ctse is expressed in MEP (Cluster 6) but not in Cluster 4 and CMP (Cluster 2). Therefore, cells in Cluster 4 do form a unique group with distinct expression patterns. This intermediate cell stage between CMP and MEP is missed by all of the other methods considered. Moreover, we find that well known Bcell lineage genes^{45}, Ebf1, Vpreb1, Vpreb3, and Igll1, are highly expressed in Cluster 3, but not in the other clusters (Fig. 6c, e). To identify Cluster 3, which is dominated by cells labeled as other by Nestorowa et al.^{43}, we map the mean expression profile of each cluster learned by BUSseq to the Haemopedia RNAseq dataset^{46}. It turns out that Cluster 3 aligns well to common lymphoid progenitors (CLP) that give rise to Tlineage cells, Blineage cells and natural killer cells (Fig. 6d). Therefore, Cluster 3 represents cells that differentiate from lymphoidprimed multipotent progenitors (LMPP)^{44}. Once again, all the other methods fail to identify these cells as a separate group.
Thus, although BUSseq does not assume any temporal ordering between cell types, it is able to preserve the differentiation trajectories (Fig. 6a, b); although BUSseq assumes that each cell belongs to one cell type rather than conducts semisoft clustering^{47}, it is capable of capturing the subtle changes across cell types and within a cell type due to continuous processes such as development and differentiation (Supplementary Fig. 11 and Supplementary Note 10).
We further inspect the functions of the intrinsic genes that distinguish different cell types. BUSseq detects 1,419 intrinsic genes at the Bayesian false discovery rate (FDR) cutoff of 0.05 (“Methods”). The gene set enrichment analysis^{48} shows that 51 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways^{49} are enriched among the intrinsic genes (p values < 0.05) (Supplementary Note 11). The highest ranked pathway is the Hematopoietic Cell Lineage Pathway, which corresponds to the exact biological process studied in the two datasets. Among the remaining 50 pathways, 13 are related to the immune system, and another 9 are associated with cell growth and differentiation (Supplementary Table 6). Therefore, the pathway analysis demonstrates that BUSseq is able to capture the underlying true biological variability, even if the batch effects are severe, as shown in Figs. 3a and 4a.
BUSseq outperforms existing method on pancreas data
We further studied the four scRNAseq datasets of human pancreas cells^{50,51,52} analyzed in Haghverdi et al.^{14}. These cells were isolated from deceased organ donors with and without type 2 diabetes. As each patient has at least two pancreas cell types—alpha cells and beta cells, the four datasets follow the chaintype design. We obtained 7095 cells after quality control (Methods) and treated each dataset as a batch following Haghverdi et al.^{14}.
BUSseq recapitulates the properties of real scRNAseq data very well in terms of the zero rates (Table 1 and Supplementary Note 9) and the meanvariance trend (Fig. 5b and Supplementary Fig. 12). In particular, the posterior predictive check shows that BUSseq fits the zero rates much better than a model that ignores dropout events, especially when scRNAseq data are assayed by protocols that do not incorporate UMI counts, such as SMARTseq2.
We can compare the clustering results from each batch effects correction method with the celltype labels provided by Segerstolpe et al.^{52} and Lawlor et al.^{51} (Fig. 7a, b and Supplementary Fig. 8c, d). The pancreas is highly heterogeneous and consists of two major categories of cells: islet cells and nonislet cells. Islet cells include alpha, beta, gamma, and delta cells, while nonislet cells include acinar and ductal cells. BUSseq identifies a total of eight cell types: five for islet cells, two for nonislet cells and one for the labeled other cells. Specifically, the five islet cell types identified by BUSseq correspond to three groups of alpha cells, a group of beta cells, and a group of delta and gamma cells. The two nonislet cell types identified by BUSseq correspond exactly to the acinar and ductal cells. Compared with all of the other methods, BUSseq gives the best separation between islet and nonislet cells, as well as the best segregation within islet cells. In particular, the median silhouette coefficient by BUSseq is higher than that of any other method (Fig. 7c and Supplementary Fig. 10c).
The ARIs of all methods are 0.608 for BUSseq, 0.542 for LIGER, 0.279 for MNN, 0.527 for Scanorama, 0.282 for scVI, 0.287 for Seurat, and 0.380 for ZINBWaVE (“Methods” and Supplementary Table 5). Thus, BUSseq outperforms all of the other methods in being consistent with the celltype labels according to marker genes. In Fig. 7d, the locally high expression levels of marker genes for each cell type show that BUSseq correctly clusters cells according to their biological cell types.
BUSseq identifies 426 intrinsic genes at the Bayesian FDR cutoff of 5% (Methods). We conducted the gene set enrichment analysis^{48} with the KEGG pathways^{49} (Supplementary Note 11). There are 14 enriched pathways (p values < 0.05). Among them, three are diabetes pathways; two are pancreatic and insulin secretion pathways; and another two pathways are related to metabolism (Supplementary Table 7). Recall that the four datasets assayed pancreas cells from type 2 diabetes and healthy individuals, therefore, the pathway analysis once again confirms that BUSseq provides biologically and clinically valid cell typing.
BUSseq is applicable to dropletbased scRNAseq data
We further analyzed a dataset that contains samples assayed by dropletbased scRNAseq protocols. Comparing the performance of different methods on real scRNAseq data is challenging due to the lack of true cell type labels in real application. Fortunately, Tian et al.^{53} created scRNAseq datasets with known cell type labels by profiling cells from cancer cell lines. In one experiment, they assayed three lung adenocarcinoma (LUAD) cell lines—HCC827, H1975, and H2228 on three platforms with CELseq2, 10x Chromium and Dropseq protocols, respectively. As a result, 1401 cells were totally measured on three batches. Each batch consists of all of the three cell types, and data from different batches have different levels of sparsity. Consequently, this study satisfies the complete setting, which is a special case of both the referencepanel design and the chaintype design.
We selected the top 6000 highly variable genes (HVGs) within each batch and obtained 2267 common HVGs across three batches (“Methods”). The tSNE and PCA plots of the raw count data show that significant batch effects occur across the three protocols (Fig. 8a, b and Supplementary Fig. 13a, b). We applied BUSseq and varied the number of cell type K from 2 to 6. Although the BIC selects four cell types instead of three cell lines (Supplementary Fig. 14), two of the four identified clusters correspond to two subpopulations of the H1975 cell lines (Supplementary Table 8). We further visualized the logscale mean expression levels of intrinsic genes of the four learned cell types (Fig. 8e). The first two cell types have similar expression patterns, but some differentially expressed genes are observed between them. Moreover, the tSNE (Fig. 8c, d) and PCA (Supplementary Fig. 13c, d) plots demonstrate the high level of similarity of the first two estimated cell types and confirm that the corrected count data \({\tilde{x}}_{big}\) obtained by BUSseq cluster cells by cell type instead of by batch (Fig. 8f).
We also applied the benchmarked methods to compare their clustering accuracy. The ARIs of all methods are 0.841 for BUSseq, 0.825 for LIGER, 0.650 for MNN, 0.637 for Scanorama, 0.429 for scVI, 0.324 for Seurat, and 0.398 for ZINBWaVE. Thus, BUSseq outperforms all of the other methods in clustering accuracy. We further compared BUSseq with a recently proposed semisupervised batcheffectcorrection methods, CellAssign. CellAssign requires the number of cell types and the input of a set of marker genes for each cell type. It then annotates scRNAseq into predefined or de novo cell types^{54}. To allow a fair comparison, we also set the number of cell types as the priori known three for BUSseq, and the resulting ARI for BUSseq becomes 0.993. Even though CellAssign is semisupervised whereas BUSseq is unsupervised, BUSseq outperforms CellAssign in the LUAD dataset as well (ARI for CellAssign is 0.972, Supplementary Table 9). Thus, BUSseq also works very well for scRNAseq data with high levels of sparsity, such as those generated by dropletbased protocols.
Discussion
For the completely randomized experimental design, it seems that everyone is talking but no one is listening. Due to time and budget constraints, it is always difficult to implement a completely randomized design in practice. Consequently, researchers often pretend to be blind to the issue when carrying out their scRNAseq experiments. In this paper, we mathematically prove and empirically show that under the more realistic reference panel and chaintype designs, batch effects can also be adjusted for scRNAseq experiments. We hope that our results will alarm researchers of confounded experimental designs and encourage them to implement valid designs for scRNAseq experiments in real applications.
BUSseq provides onestop services. In contrast, most existing methods are multistage approaches—clustering can only be performed after the batch effects have been corrected and the differential expressed genes can only be called after the cells have been clustered. The major issue with multistage methods is that uncertainties in the previous stages are often ignored. For instance, when cells have been first clustered into different cell types and then differential gene expression identification is conducted, the clustering results are taken as if they were the underlying truth. As the clustering results may be prone to errors in practice, this can lead to false positives and false negatives. In contrast, BUSseq simultaneously corrects batch effects, clusters cell types, imputes missing data, and identifies intrinsic genes that differentiate cell types. BUSseq thus accounts for all uncertainties and fully exploits the information embedded in the data. As a result, BUSseq is able to capture subtler changes between cell types, such as the cluster corresponding to LMPP lineage that is missed by all of the stateoftheart methods.
BUSseq employs MCMC algorithm for statistical inference. Although MCMC algorithms are wellknown for heavy computation load, fortunately, the computational complexity of BUSseq is \(O(\mathop{\sum }\nolimits_{b = 1}^{B}{n}_{b}GK)\), which is both linear in the number of genes G and in the total number of cells \(N=\mathop{\sum }\nolimits_{b = 1}^{B}{n}_{b}\). Moreover, most steps of the MCMC algorithm for BUSseq are parallelizable (Supplementary Note 12). We implement a parallel multicoreCPU version and a parallel GPU version of the algorithm, respectively. Running the GPU version of the algorithm with a single core of an Intel Xeon Gold 6132 Processor and one NVIDIA Tesla P100 GPU took 0.35, 1.15, 1.5 h for the simulation, the hematopoietic, and the human pancreas data, respectively (Supplementary Table 10). Experiments show that the running time and randomaccess memory (RAM) usage are indeed linear in the number of genes G and the number of cells N for both the CPU and the GPU parallel version of BUSseq (Fig. 9 and Supplementary Note 13). Moreover, by writing the posterior samples to the hard disk every a few iterations, we can further reduce the RAM usage so that BUSseq is affordable by a commonly available cluster node rather than a highend one (Supplementary Table 11 and Supplementary Fig. 15). Compared with the time for preparing samples and conducting the scRNAseq experiments, the computation time of BUSseq is affordable and worthwhile for the accuracy.
Practical and valid experimental designs are urgently required for scRNAseq experiments. We envision that the flexible reference panel and the chaintype designs will be widely adopted in scRNAseq experiments and BUSseq will greatly facilitate the analysis of scRNAseq data.
Methods
BUSseq model
The hierarchical model of BUSseq can be summarized as:
Collectively, \({\bf{Y}}={\{{Y}_{big}\}}_{b = 1,\cdots ,B;i = 1,\ \cdots ,{n}_{b}}^{g = 1,\cdots ,G}\) are the observed data; the underlying expression levels \({\bf{X}}={\{{X}_{big}\}}_{b = 1,\cdots ,B;i = 1,\cdots ,{n}_{b}}^{g = 1,\cdots ,G}\), the dropout indicators \({\bf{Z}}={\{{Z}_{big}\}}_{b = 1,\cdots , B;i = 1, \cdots , {n}_{b}}^{g = 1, \cdots , G}\) and the cell type indicators \({\bf{W}}={\{{W}_{bi}\}}_{b = 1, \cdots , B;i = 1, \cdots , {n}_{b}}\) are all missing data; the logscale baseline gene expression levels \({\boldsymbol{\alpha }}={\{{\alpha }_{g}\}}_{g = 1, \cdots , G}\), the cell type effects \({\boldsymbol{\beta }}={\{{\beta }_{gk}\}}_{k = 2, \cdots , K}^{g = 1,\cdots ,G}\), the location batch effects \({\boldsymbol{\nu }}={\{{\nu }_{bg}\}}_{b = 2, \cdots , B}^{g = 1, \cdots , G}\), the overdispersion parameters \({\boldsymbol{\phi }}={\{{\phi }_{bg}\}}_{b = 1, \cdots , B}^{g = 1, \cdots , G}\), the cellspecific size factors \({\boldsymbol{\Delta }}={\{{\delta }_{bi}\}}_{b = 1, \cdots , B}^{i = 2, \cdots , {n}_{b}}\), the dropout parameters \({\boldsymbol{\Gamma }}={\{{\gamma }_{b0},{\gamma }_{b1}\}}_{b = 1, \cdots , B}\) and the cell compositions \({\boldsymbol{\pi }}={\{{\pi }_{bk}\}}_{b = 1, \cdots , B}^{k = 1, \cdots , K}\) are the parameters. Without loss of generality, for model identifiability, we assume that the first batch is the reference batch measured without batch effects with ν_{1g} = 0 for every gene and the first cell type is the baseline cell type with β_{g1} = 0 for every gene. Similarly, we take the cellspecific size factor δ_{b1} = 0 for the first cell of each batch. We gather all the parameters as Θ = {α, β, ν, ϕ, Δ, Γ, π}.
Consequently, the observed data likelihood function becomes
where
and \({f}_{{\rm{NB}}}(x;\mu ,\phi )={C}_{x}^{\phi +x1}{(\frac{\mu }{\mu +\phi })}^{x}{(\frac{\phi }{\mu +\phi })}^{\phi }\) denotes the probability mass function of the negative binomial distribution NB(μ, ϕ). For y_{big} = 0, \({f}_{{\rm{NB}}}(0;\exp ({\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi}),{\phi }_{bg})\) corresponds to a biological zero, whereas \(\mathop{\sum }\nolimits_{x = 1}^{\infty }\frac{\exp ({\gamma }_{b0}+{\gamma }_{b1}x)}{1+\exp ({\gamma }_{b0}+{\gamma }_{b1}x)}{f}_{{\rm{NB}}}(x;\exp ({\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi}),{\phi }_{bg})\) corresponds to a dropout event.
Experimental designs
By creating a set of functions similar to the probability generating function, we prove that BUSseq is identifiable, in other words, if two sets of parameters are different, then their probability distribution functions for the observed data are different, for not only the complete setting but also the reference panel and the chaintype designs (see the proofs in Supplementary Note 3).
Theorem 1 (The Complete Setting) If π_{bk} > 0 for every batch b and cell type k, given that (I) γ_{b1} < 0 for every b, (II) for any two cell types k_{1} and k_{2}, there exist at least two differentially expressed genes g_{1} and g_{2}—\({\beta }_{{g}_{1}{k}_{1}}\ne \ {\beta }_{{g}_{1}{k}_{2}}\) and \({\beta }_{{g}_{2}{k}_{1}}\ne \ {\beta }_{{g}_{2}{k}_{2}}\), and (III) for any two distinct celltype pairs (k_{1}, k_{2}) ≠ (k_{3}, k_{4}), their differences in celltype effects are not the same \({{\boldsymbol{\beta }}}_{{k}_{1}}{{\boldsymbol{\beta }}}_{{k}_{2}}\ne \ {{\boldsymbol{\beta }}}_{{k}_{3}}{{\boldsymbol{\beta }}}_{{k}_{4}}\), then BUSseq is identifiable (up to label switching) in the sense that L_{o}(Θ∣y) = L_{o}(Θ^{*}∣y) for any y implies that \({\pi }_{bk}={\pi }_{b\rho (k)}^{* },({\gamma }_{b0},{\gamma }_{b1})=({\gamma }_{b0}^{* },{\gamma }_{b1}^{* }),{\alpha }_{g}+{\beta }_{gk}={\alpha }_{g}^{* }+{\beta }_{g\rho (k)}^{* },{\nu }_{gb}={\nu }_{gb}^{* },{\delta }_{bi}={\delta }_{bi}^{* }\) and \({\phi }_{bg}={\phi }_{bg}^{* }\) for every gene g and batch b, where ρ is a permutation of {1, 2, ⋯ , K}.
In the following, we denote the cell types that are present in batch b as C_{b} and count the number of cell types existing in batch b as K_{b}= ∣C_{b}∣.
Theorem 2 (The Reference Panel Design) If there are a total of K cell types \({\cup }_{b = 1}^{B}{C}_{b}=\{1,2, \cdots , K\}\), K_{b }≥ 2 for every batch b, and there exists a batch \(\tilde{b}\) such that it contains all of the cell types \({C}_{\tilde{b}}=\{1,2, \cdots , K\}\), then given that conditions (I)–(III) hold, BUSseq is identifiable (up to label switching).
Theorem 3 (The Chaintype Design) If there are a total of K cell types \({\cup }_{b = 1}^{B}{C}_{b}=\{1,2, \cdots , K\}\) and every two consecutive batches share at least two cell types ∣C_{b} ∩ C_{b−1}∣ ≥ 2 for all b ≥ 2, then given that conditions (I)–(III) hold, BUSseq is identifiable (up to label switching).
Therefore, even for the reference panel and chaintype designs that do not assay all cell types in each batch, batch effects can be removed; cell types can be clustered; and missing data due to dropout events can be imputed. Both the reference panel design and the chaintype design belong to the more general connected design.
Theorem 4 (The Connected Design) We define a batch graph G = (V, E). Each node b ∈ V represents a batch. There is an edge e ∈ E between two nodes b_{1} and b_{2} if and only if batches b_{1} and b_{2} share at least two cell types. If the batch graph is connected and conditions (I)–(III) hold, then BUSseq is identifiable (up to label switching).
Statistical inference
We conduct the statistical inference under the Bayesian framework. We assign independent priors to each component of Θ as follows (Supplementary Table 3):
We are interested in detecting genes that differentiate cell types. Therefore, we impose a spikeandslab prior^{55} using a normal mixture to the celltype effect β_{gk}. The spike component concentrates on zero with a small variance \({\tau }_{\beta 0}^{2}\), whereas the slab component tends to deviate from zero, thus having a larger variance \({\tau }_{\beta 1}^{2}\). We introduce another latent variable L_{gk} to indicate which component β_{gk} comes from. L_{gk} = 0 if gene g is not differentially expressed between cell type k and cell type one, and L_{gk} = 1, otherwise. We further define \({D}_{g}=\mathop{\sum }\nolimits_{k = 2}^{K}{L}_{gk}\). If D_{g} > 0, then the expression level of gene g does not stay the same across cell types. Following Huo et al.^{13}, we call such genes intrinsic genes, which differentiate cell types. To control for multiple hypothesis testing, we let L_{gk} ~ Bernoulli(p) and assign a conjugate prior Beta(a_{p}, b_{p}) to p. We set τ_{β1} to a large number and let τ_{β0} follow an inversegamma prior Inv—Gamma(a_{τ}, b_{τ}) with a small prior mean.
We develop an MCMC algorithm to sample from the posterior distribution (Supplementary Note 2). After the burnin period, we take the mean of the posterior samples to estimate γ_{b}, α_{g}, β_{gk}, ν_{bg}, δ_{bi}, and ϕ_{bg} and use the mode of posterior samples of W_{bi} to infer the cell type for each cell.
We have actually also implemented an ExpectationMaximization (EM) algorithm^{56} for a simplified version of the BUSseq model. Unfortunately, consistent with the literature^{57,58}, we found that inference by the EM algorithm can be very sensitive to small disturbance of observed data and the initial values. Thus, we choose to use the MCMC algorithm for inference. The extra benefit of the MCMC algorithm is that it not only provides point estimates but also explores the entire posterior distributions and hence allow the users to quantify the uncertainty of estimates.
Identification of intrinsic genes
When inferring the differential expression indicator L_{gk}, we control the Bayesian FDR^{37} defined as
where ξ_{gk} = Pr(L_{gk} = 0∣y) is the posterior marginal probability that gene g is not differentially expressed between cell type k and cell type one, which can be estimated by the T posterior samples \({L}_{gk}^{(t)}\)s collected after the burnin period as \(\frac{1}{T}\mathop{\sum }\nolimits_{t = 1}^{T}(1{L}_{gk}^{(t)})\). Given a control level α such as 0.1, we search for the largest κ_{0} ≤ 0.5 such that the estimated \(\widehat{{\rm{FDR}}}(\kappa )\) based on \({\widehat{\xi }}_{gk}\)s is smaller than α and declare \({\widehat{L}}_{gk}=1\) if \({\widehat{\xi }}_{gk}\le {\kappa }_{0}\). The upper bound 0.5 for κ_{0} prevents us from calling differentially expressed genes with small posterior probability Pr(L_{gk} = 1∣y). Consequently, we identify the genes with \({\widehat{D}}_{g}=\mathop{\sum }\nolimits_{k = 2}^{K}{\widehat{L}}_{gk}> 0\) as the intrinsic genes. We set α = 0.05 in both the simulation study and the real applications. Here, we follow Huo et al.^{13} to define intrinsic genes as genes that are differentially expressed between at least two cell types. In contrast, marker genes are genes that feature certain cell types according to the literature. For example, in the pancreas study, GCG gene is known to be highly expressed in alpha islet cells, so this gene often serves as a marker to label alpha islet cells^{51}.
Convergence of the MCMC algorithm
To rigorously assess the convergence of the Markov chain, we adopt the EPSR factors criterion^{59} (Supplementary Note 14). We are interested in the logscale baseline expression level {α_{g}, g = 1, 2, ⋯ , G}, the cell type effects {β_{gk}, g = 1, 2, ⋯ , G, k = 2, 3, ⋯ , K}, the location batch effects {ν_{bg}, g = 1, 2, ⋯ , G, b = 2, 3, ⋯ , B} and the overdispersion parameters {ϕ_{bg}, g = 1, 2, ⋯ , G, b = 1, 2, ⋯ , B}. To avoid the impact of label switching of cell types (Supplementary Fig. 16 and Supplementary Note 14), we consider the logscale celltypespecific expression level θ_{gk} = α_{g} + β_{gk}, g = 1, 2, ⋯ , G, k = 1, 2, ⋯ , K and match the cell type indicators in different chains such that most cells in the different chains are assigned to the same cell types. If the EPSR factors of most parameters are close to one, we treat the posterior sampling as attaining stationary. Thus, we use the following rule to diagnose the convergence of the MCMC algorithm for BUSseq:
 1.
More than 80% of {EPSR(θ_{gk})} are <1.3;
 2.
More than 80% of {EPSR(ν_{bg})} are <1.3;
 3.
More than 80% of {EPSR(ϕ_{bg})} are <1.3.
Implementation of the MCMC algorithm
In the simulation study, we ran the MCMC algorithm for 4000 iterations and discarded the first 2000 iterations as burnins. In the three real data analysis, we ran BUSseq for 8000 iterations and discarded the first 4000 iterations as burinins. Both the estimated potential scale reduction (EPSR) factors (Supplementary Table 12) and the acceptance rates of the Metropolis steps of the MCMC algorithm (Supplementary Tables 13 and 14, and Supplementary Note 14) demonstrate that the Markov chain has converged with good mixing.
Selection of cell type numbers
BUSseq allows the user to input the total number of cell types K according to prior knowledge. When K is unknown, BUSseq selects the number of cell types \(\widehat{K}\) such that it achieves the minimum BIC^{31}. BIC adds a penalty term to the observed data loglikelihood \({L}_{o}(\widehat{{\boldsymbol{\Theta }}} {\bf{y}})\) as Eq. (1).
where \(\widehat{{\boldsymbol{\Theta }}}=(\widehat{{\boldsymbol{\alpha }}},\widehat{{\boldsymbol{\beta }}},\widehat{{\boldsymbol{\gamma }}},\widehat{{\boldsymbol{\nu }}},\widehat{{\boldsymbol{\phi }}},\widehat{{\boldsymbol{\delta }}},\widehat{{\boldsymbol{\pi }}})\) denotes the posterior mean of parameters. As a result, the penalty in BIC helps the model selection to balance between goodnessoffit and the model complexity (Supplementary Figs. 17–19, Supplementary Tables 15 and 16, and Supplementary Note 15).
Inference of dropout events
In the BUSseq model, a dropout event occurs for gene g in cell i of batch b if the observed value y_{big} = 0 but the imputed count data \({\widehat{x}}_{big}> 0\). The identification allows us to calculate the frequency of dropout events in each batch. We calculate the zero rate of each batch as following:
and compute the dropout rate as the proportion of dropout events among the observations with zero counts:
Posterior predictive check
We evaluate how well BUSseq fits the data via posterior predictive checks^{60}. In particular, we focus on the zero rates. In the posterior predictive check, we take MCMC samples of all the parameters after the burnin iterations to simulate replicated datasets \({Y}_{j}^{rep},j=1,2, \cdots , J\) for G genes and \(N=\mathop{\sum }\nolimits_{b = 1}^{B}{n}_{b}\) cells, where J denotes the total number of collected iterations after burnins. In our real data analyses, we ran 8000 iterations with the first 4000 iterations as burnins, so we generated J = 8000 − 4000 = 4000 replicated datasets for both the hematopoietic and Pancreas studies. For each generated replicate dataset, we calculated the zero rates of each batch according to Eq. (2). Finally, we average the zero rates over all the J iterations to calculate the posterior mean \({\widehat{\rho }}_{0}\) of the zero rate of each batch and compare it with the corresponding observed zero rate. Moreover, we also compare BUSseq with a reduced model of BUSseq which ignores dropout events and hence uses negative binomial distribution without zero inflation, abbreviated as BUSseqnzf (Supplementary Note 9), via the posterior predictive check (Supplementary Note 16).
Batcheffectscorrected values
To facilitate further downstream analysis, we also provide a version of count data \(\widetilde{{\bf{X}}}={\{{\widetilde{X}}_{big}\}}_{b = 1, \cdots , B;i = 1, \cdots , {n}_{b}}^{g = 1, \cdots , G}\) for which the batch effects are removed and the biological variability is retained similar to that of ComBat^{7}. Ideally, if x_{big} is the αth percentile of \({\rm{NB}}(\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}+{\widehat{\nu }}_{bg}+{\widehat{\delta }}_{bi}),{\widehat{\phi }}_{bg})\), we aim to take the αth percentile of \({\rm{NB}}(\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}),{\widehat{\phi }}_{bg})\) as the corrected value \({\tilde{x}}_{big}\). However, the negative binomial distribution is a discrete distribution. As a result, several \(\tilde{x}\)s can lie between the Pr(x ≤ x_{big} − 1)percentile and Pr(x ≤ x_{big})percentile of the distribution of \({\widetilde{X}}_{big}\). For example, if \({X}_{big} \sim {\rm{NB}}(\exp (2),3)\), \({\widetilde{X}}_{big} \sim {\rm{NB}}(\exp (3),5)\), and our observed value x_{big} = 8, then Pr(x_{big} ≤ 7) and Pr(x_{big} ≤ 8) correspond to the 58.67th and 65.76th percentiles of \({\rm{NB}}(\exp (2),3)\). However, three numbers—21, 22, and 23—lie between 58.67th and 65.76th percentile of \({\rm{NB}}(\exp (3),5)\). Thus, to avoid bias, we draw one number uniformly from 21, 22, and 23 rather than take the maximum or the minimum to calculate \({\widetilde{x}}_{big}\).
Thus, we develop a quantile matching approach based on inverse sampling. Specifically, given the fitted model and the inferred underlying expression level \({\widehat{x}}_{big}\), we first sample u_{big} from \({\rm{Unif}}[{F}_{{\rm{NB}}}({\widehat{x}}_{big}1;\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}+{\widehat{\nu }}_{bg}+{\widehat{\delta }}_{bi}),{\widehat{\phi }}_{bg}),{F}_{{\rm{NB}}}({\widehat{x}}_{big};\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}+{\widehat{\nu }}_{bg}+{\widehat{\delta }}_{bi}),{\widehat{\phi }}_{bg})]\) where Unif[a, b] denotes the uniform distribution on the interval [a, b] and F_{NB}( ⋅ ; μ, r) denotes the cumulative distribution function of a negative binomial distribution with mean μ and overdispersion parameter r. Next, we calculate the \({u}_{big}^{th}\) quantile of \({\rm{NB}}(\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}),{\widehat{\phi }}_{1g})\) as the corrected value \({\widetilde{x}}_{big}\).
The corrected data \(\widetilde{{\bf{X}}}\) are not only protected from batch effects but also impute the missing data due to dropout events. Moreover, further cellspecific normalization is not needed. Meanwhile, the biological variability is retained thanks to the quantile transformation and sampling step. Therefore, we can directly perform downstream analysis on \(\widetilde{{\bf{X}}}\).
Preprocessing of the real datasets
Gene filtering
A common practice of scRNAseq data analysis is to focus on the set of HVGs^{14,17,18,24,25} or the genes with the high mean expression levels across cells^{61}. Although BUSseq is robust to gene filtering strategies in real studies (Supplementary Tables 17 and 18), comparing the ARIs resulting from the two gene filtering strategies, we recommend filtering HVGs in preprocessing (Supplementary Note 17). An intrinsic gene that well distinguishes cell types may be highly expressed in one cell type but lowly expressed in other cell types. As a result, its mean expression level across all of the cells may be low, and hence such a gene will be missed by filtering according to mean expression levels. Thus, filtering genes according to mean expression levels is likely to select genes whose expression levels are high but remain the same across all of the cell types. Unfortunately, such genes can provide very limited information for differentiating cell types. We therefore filter out HVGs for the downstream analysis in real data analyses.
Hematopoietic study
For the two hematopoietic datasets, we downloaded the read count matrix of the 1920 cells profiled by Paul et al.^{44} and the 2729 cells labeled as myeloid progenitor cells by Nestorowa et al.^{43} from the NCBI Gene Expression Omnibus (GEO) with the accession numbers GSE72857 and GSE81682. Following Brennecke et al.^{62}, we first labeled cells using FACS labels and then performed the size factor normalization within each batch. Next, we filtered out the common HVGs identified by Nestorowa et al.^{43} between two datasets. These HVGs were denoted by Ensembl ID. The genes in the GSE81682 dataset were named by Ensembl ID, but the genes in the GSE72857 dataset were named by Gene Symbol. The R package biomaRt was used to query the corresponding Gene Symbol by Ensembl ID. Finally, we obtained 3,470 common HVGs shared by the two datasets.
Pancreas study
Two of the pancreas datasets profiled by the CELseq2 platform were downloaded from GEO with accession number GSE80176^{50} and GSE85241^{63}. The two datasets assayed by the SMARTseq2 platform were obtained from GSE86473^{51} and from ArrayExpress accession number EMATB5061^{52}. Following Haghverdi et al.^{14}, we excluded cells with low library sizes (<100,000 reads), low numbers of expressed genes (>40% total counts from ribosomal RNA genes), or high ERCC content (>20% of total counts from spikein transcripts) resulting in 7095 cells. We selected the 2480 HVGs shared by the four datasets according to Brennecke et al.^{62} by sorting the ratio of variance and mean expression level after adjusting technical noise with the variances of spikein transcripts. GSE86473 and EMATB5061 have the cell type labels for all of the cells, but the cell type labels of GSE81076 and GSE85241 were inferred by the marker genes used in the original publications by Lawlor et al.^{51} and Grün et al.^{50}.
To assign cell type labels for the GSE81076 and GSE85241 datasets, following Haghverdi et al.^{14}, we first extracted the normalized expression levels of the selected HVGs within each dataset. Next, we obtained the low dimensional embedding of HVGs by tSNE for visualization. At the same time, we applied robust kmeans clustering to the normalized expression levels of the selected HVGs using the pam function in the R package cluster. The number of clusters was set as nine. Next, we drew tSNE plots colored by the expression levels of the marker genes. It is known that GCG is highly expressed in alpha islet cells, INS in beta islet cells, SST in delta islet cells, PPY in gamma islet cells (pancreatic polypeptide cells), PRSS1 in acinar cells, KRT19 in ductal cells and COL1A1 in mescenchymal cells^{50,51}, so we labeled each cluster by its corresponding highly expressed marker gene.
LUAD cancer cell line study
We downloaded the raw count data from the GitHub repository https://github.com/LuyiTian/sc_mixology with accession number GSE118767. We selected the top 6000 HVGs within each batch using the trendVar and decomposeVar functions in the R package scran^{64} and obtained 2,267 common HVGs across three batches (Supplementary Note 15).
Naming clusters learned by BUSseq according to FACS labels
In the two real data examples, we first identify the cell type of each individual cell according to FACS labeling. Then, for each cluster learned by BUSseq, we calculate the proportion of labeled cell types. If a cell type accounts for more than onethird of the cells in a given cluster, we assign this cell type to the cluster. Although a cluster may be assigned more than one cell type, most identified clusters by BUSseq are dominated by only one cell type. For example, in the hematopoietic study, BUSseq identifies 1165 cells for Cluster 5. According to FACS labels, 1127 of the 1165 cells are megakaryocyteerythrocyte progenitors (MEP). Therefore, we name Cluster 5 as MEP.
Mapping clusters to haemopedia
Haemopedia is a database of gene expression profiles from diverse types of hematopoietic cells^{46}. It collected flow sorted cell populations from healthy mice.
To understand Cluster 3 learned by BUSseq for the hematopoietic data, which is dominated by cells classified as other according to the FACS labeling, we mapped the cluster means learned by BUSseq to the Haemopedia RNAseq dataset.
We first applied TMM normalization^{35} to all the samples in the Haemopedia RNAseq dataset. Then, we extracted seven types of hematopoietic stem and progenitor cells from Haemopedia, including Lin^{−}Sca1^{+}cKit^{+} cells, shortterm hematopoietic stem cells, MPP, CLP, CMP, MEP, and GMP. Each selected cell type had two RNAseq samples in Haemopedia, so we averaged over the two replicates for each cell type. Further, we added one to the normalized expression levels as a pseudo read count to handle genes with zero read count and logtransformed the data. Finally, we scaled the data across the seven cell types for each gene. To be comparable, we transformed the cluster mean learned by BUSseq as \({m}_{gk}=\mathrm{log}\,(1+\exp ({\alpha }_{g}+{\beta }_{gk}))\) for gene g in the cluster k and scaled m_{gk} across all cell types as well. Finally, we calculated the correlation between the cluster means inferred by BUSseq and the reference expression profiles in Haemopedia for 37 marker genes. The 37 marker genes were retrieved from Paul et al.^{44} (31 maker genes for HSPC) and Herman et al.^{45} (6 maker genes for LMPP).
Software availability
The C++ source code of the parallel multicoreCPU version of BUSseq is available on GitHub https://github.com/songfd2018/BUSseq1.0, and the CUDA C source code of the GPU version of BUSseq is available on GitHub https://github.com/Anguscgm/BUSseq_gpu. All code for producing results and figures in this manuscript are also available on GitHub (https://github.com/songfd2018/BUSseq1.1_implementation). Furthermore, we wrap C++ source code as an R package, BUSseq (https://github.com/songfd2018/BUSseqRpackage).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The published datasets used in this manuscript are available through the following accession numbers: SMARTseq2 platform hematopoietic data with GEO GSE81682 by Nestorowa et al.^{43}; MARSseq platform hematopoietic data with GEO GSE72857 by Paul et al.^{44}; CELseq platform pancreas data with GEO GSE81076 by Grün et al.^{50}; CELseq2 platform pancreas data with GEO GSE85241 by Muraro et al.^{63}; SMARTseq2 platform pancreas data with GEO GSE86473 by Lawlor et al.^{51}; and SMARTseq2 platform pancreas data with ArrayExpress EMTAB5061 by Segerstolpe et al.^{52}; human lung adenocarcinoma cell line data with GEO GSE118767 by Tian et al.^{65}. The parameter settings for the simulation study and the simulated data are available on GitHub (https://github.com/songfd2018/BUSseq1.1_implementation).
References
 1.
Bacher, R. & Kendziorski, C. Design and computational analysis of singlecell RNAsequencing experiments. Genome Biol. 17, 63 (2016).
 2.
Irizarry, R. A. et al. Multiplelaboratory comparison of microarray platforms. Nat. Methods 2, 345–350 (2005).
 3.
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in highthroughput data. Nat. Rev. Genet. 11, 733–739 (2010).
 4.
Taub, M. A., CorradaBravo, H. & Irizarry, R. A. Overcoming bias and systematic errors in next generation sequencing data. Genome Med. 2, 87 (2010).
 5.
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in singlecell RNAsequencing experiments. Biostatistics 19, 562–578 (2018).
 6.
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to singlecell differential expression analysis. Nat. Methods 11, 740 (2014).
 7.
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
 8.
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
 9.
Leek, J. T. Svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161–e161 (2014).
 10.
Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896 (2014).
 11.
Jacob, L., GagnonBartsch, J. A. & Speed, T. P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28 (2015).
 12.
Lin, Y. et al. Evaluating stably expressed genes in single cells. GigaScience 8, giz106 (2019).
 13.
Huo, Z., Ding, Y., Liu, S., Oesterreich, S. & Tseng, G. Metaanalytic framework for sparse kmeans to identify disease subtypes in multiple transcriptomic studies. J. Am. Stat. Assoc. 111, 27–42 (2016).
 14.
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421 (2018).
 15.
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous singlecell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685 (2019).
 16.
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411 (2018).
 17.
Stuart, T. et al. Comprehensive integration of singlecell data. Cell 177, 1888–1902.e21 (2019).
 18.
Welch, J. D. et al. Singlecell multiomic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887.e17 (2019).
 19.
Lin, Y. et al. scmerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple singlecell RNAseq datasets. Proc. Natl Acad. Sci. USA. 116, 9775–9784 (2019).
 20.
Luo, X. & Wei, Y. Batch effects correction with unknown subtypes. J. Am. Stat. Assoc. 114, 581–594 (2019).
 21.
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of singlecell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).
 22.
Wang, J. et al. Gene expression distribution deconvolution in singlecell RNA sequencing. Proc. Natl Acad. Sci. USA. 115, E6437–E6446 (2018).
 23.
Pierson, E. & Yau, C. Zifa: dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 16, 241 (2015).
 24.
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.P. A general and flexible method for signal extraction from singlecell RNAseq data. Nat. Commun. 9, 284 (2018).
 25.
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053 (2018).
 26.
Wang, J. et al. Data denoising with transfer learning in singlecell transcriptomics. Nat. Methods 16, 875–878 (2019).
 27.
BaranGale, J., Chandra, T. & Kirschner, K. Experimental design for singlecell RNA sequencing. Brief. Funct Genom. 17, 233–239 (2017).
 28.
Dal, M. A. & Di, C. B. How to design a singlecell RNAsequencing experiment: pitfalls, challenges and perspectives. Brief. Bioinform. 20, 1384–1394 (2018).
 29.
Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 22, 1701–1728 (1994).
 30.
Robert, C., Casella, G. Monte Carlo Statistical Methods (Springer Science, Business Media, 2013).
 31.
Schwarz, G. et al. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
 32.
Casella, G., Berger, R. L. Statistical Inference, vol. 2 (Duxbury Pacific Grove, CA, 2002).
 33.
Miao, W., Ding, P. & Geng, Z. Identifiability of normal and normal mixture models with nonignorable missing data. J. Am. Stat. Assoc. 111, 1673–1683 (2016).
 34.
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNAseq data with deseq2. Genome Biol. 15, 550 (2014).
 35.
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 11, R25 (2010).
 36.
Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize singlecell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
 37.
Newton, M. A., Noueiry, A., Sarkar, D. & Ahlquist, P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5, 155–176 (2004).
 38.
Peterson, C., Stingo, F. C. & Vannucci, M. Bayesian inference of multiple Gaussian graphical models. J. Am. Stat. Assoc. 110, 159–174 (2015).
 39.
Huang, M. et al. Saver: gene expression recovery for singlecell rna sequencing. Nat. Methods 15, 539 (2018).
 40.
Gong, W., Kwak, I.Y., Pota, P., KoyanoNakagawa, N. & Garry, D. J. Drimpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinform. 19, 220 (2018).
 41.
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for singlecell RNAseq data. Nat. Commun. 9, 1–9 (2018).
 42.
Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
 43.
Nestorowa, S. et al. A single cell resolution map of mouse haematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
 44.
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
 45.
Herman, J. S. & Grün, D. et al. FateID infers cell fate bias in multipotent progenitors from singlecell RNAseq data. Nat. Methods 15, 379 (2018).
 46.
Choi, J. et al. Haemopedia RNAseq: a database of gene expression during haematopoiesis in mice and humans. Nucleic Acids Res. 47, D780–D785 (2018).
 47.
Zhu, L., Lei, J., Klei, L., Devlin, B. & Roeder, K. Semisoft clustering of singlecell data. Proc. Natl Acad. Sci. USA. 116, 466–471 (2019).
 48.
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44 (2009).
 49.
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
 50.
Grün, D. et al. De novo prediction of stem cell identity using singlecell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
 51.
Lawlor, N. et al. Singlecell transcriptomes identify human islet cell signatures and reveal celltypespecific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
 52.
Segerstolpe, Å et al. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
 53.
Tian, L. et al. Benchmarking single cell RNAsequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).
 54.
Zhang, A. W. et al. Probabilistic celltype assignment of singlecell RNAseq for tumor microenvironment profiling. Nat. Methods 16, 1007–1015 (2019).
 55.
George, E. I. & McCulloch, R. E. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88, 881–889 (1993).
 56.
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Series B (Methodological) 9, 1–22 (1977).
 57.
Willson, L., Folks, J. & Young, J. Complete sufficiency and maximum likelihood estimation for the twoparameter negative binomial distribution. Metrika 33, 349–362 (1986).
 58.
Saha, K. & Paul, S. Biascorrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics 61, 179–185 (2005).
 59.
Gelman, A. et al. Bayesian Data Analysis (Chapman and Hall/CRC, 2013).
 60.
Gelman, A., Meng, X.L. & Stern, H. Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–760 (1996).
 61.
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for singlecell RNAseq data. F1000Research 7, 1141 (2018).
 62.
Brennecke, P. et al. Accounting for technical noise in singlecell RNAseq experiments. Nat. Methods 10, 1093 (2013).
 63.
Muraro, M. J. et al. A singlecell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).
 64.
Lun, A. T., McCarthy, D. J. & Marioni, J. C. A stepbystep workflow for lowlevel analysis of singlecell RNAseq data with Bioconductor. F1000Research 5, 2122 (2016).
 65.
Tian, L. et al. scpipe: a flexible r/bioconductor preprocessing pipeline for singlecell rnasequencing data. PLoS Comput. Biol. 14, e1006361 (2018).
Acknowledgements
This work was supported by the Hong Kong Ph.D. Fellowship PF1517417 and the General Research Funds 14306417 and 14305319 from the Hong Kong Research Grants Council of the Hong Kong Special Administrative Region of the People’s Republic of China and Direct Grants from the Research Committee of the Chinese University of Hong Kong. We acknowledge Dr. Xiangyu Luo for helpful comments on an early version of our paper.
Author information
Affiliations
Contributions
F.S. developed the method and the proof, implemented the algorithm, prepared the software package, analyzed the data, and wrote the paper. G.M.A.C. implemented the algorithm and analyzed the data. Y.W. conceived and supervised the study, developed the method and the proof, and wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks F. William Townes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Song, F., Chan, G.M.A. & Wei, Y. Flexible experimental designs for valid singlecell RNAsequencing experiments allowing batch effects correction. Nat Commun 11, 3274 (2020). https://doi.org/10.1038/s41467020169052
Received:
Accepted:
Published:
Further reading

A review of computational strategies for denoising and imputation of singlecell transcriptomic data
Briefings in Bioinformatics (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.