Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction

Song, Fangda; Chan, Ga Ming Angus; Wei, Yingying

doi:10.1038/s41467-020-16905-2

Download PDF

Article
Open access
Published: 01 July 2020

Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction

Nature Communications volume 11, Article number: 3274 (2020) Cite this article

8283 Accesses
10 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Despite their widespread applications, single-cell RNA-sequencing (scRNA-seq) experiments are still plagued by batch effects and dropout events. Although the completely randomized experimental design has frequently been advocated to control for batch effects, it is rarely implemented in real applications due to time and budget constraints. Here, we mathematically prove that under two more flexible and realistic experimental designs—the reference panel and the chain-type designs—true biological variability can also be separated from batch effects. We develop Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq), which is an interpretable Bayesian hierarchical model that closely follows the data-generating mechanism of scRNA-seq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events, and detect differentially expressed genes without requiring a preliminary normalization step. We demonstrate that BUSseq outperforms existing methods with simulated and real data.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Introduction

Single-cell RNA-sequencing (scRNA-seq) technologies enable the measurement of the transcriptome of individual cells, which provides unprecedented opportunities to discover cell types and understand cellular heterogeneity¹. However, like the other high-throughput technologies^2,3,4, scRNA-seq experiments can suffer from severe batch effects⁵. Moreover, compared with bulk RNA-seq data, scRNA-seq data can have an excessive number of zeros that result from either biological zeros—that is, a gene is not expressed in a given cell—or dropout events—that is, the expression of some genes are not detected even though they are actually expressed in the cell due to amplification failure prior to sequencing⁶. Consequently, despite the widespread adoption of scRNA-seq experiments, the design of a valid scRNA-seq experiment that allows the batch effects to be removed, the biological cell types to be discovered, and the missing data to be imputed remains an open problem.

One of the major tasks of scRNA-seq experiments is to identify cell types for a population of cells¹. The cell type of each individual cell is unknown and is often the target of inference. Classic batch effects correction methods, such as ComBat⁷ and SVA^8,9, are designed for bulk experiments and require knowledge of the subtype information of each sample a prior. For scRNA-seq data, this subtype information corresponds to the cell type of each individual cell. Clearly, these methods are thus infeasible for scRNA-seq data. Alternatively, if one has knowledge of a set of control genes whose expression levels are constant across cell types, then it is possible to apply RUV^10,11. However, selecting control genes is still challenging for scRNA-seq experiments, and recently there has been active research on identifying stably expressed genes that are reproducible and conserved across species for single cells¹².

To identify unknown subtypes, MetaSparseKmeans¹³ jointly clusters samples across batches. Unfortunately, MetaSparseKmeans requires all subtypes to be present in each batch. Suppose that we conduct scRNA-seq experiments for blood samples from a healthy individual and a leukemia patient, one person per batch. Although we can anticipate that the two batches will share T cells and B cells, we do not expect that the healthy individual will have cancer cells as the leukemia patient. Therefore, MetaSparseKmeans is too restrictive for many scRNA-seq experiments.

The mutual-nearest-neighbor (MNN) based approaches, including MNN¹⁴ and Scanorama¹⁵, allow each batch to contain some but not all cell types. However, these methods require batch effects to be almost orthogonal to the biological subspaces and much smaller than the biological variations between different cell types¹⁴. These are strong assumptions and cannot be validated at the design stage of the experiments. Seurat^16,17, LIGER¹⁸, and scMerge¹⁹ attempt to identify shared variations across batches by low-dimensional embeddings and treat them as shared cell types. However, they may mistake the technical artifacts as the biological variability of interest if some batches share certain technical noises, for example when each patient is measured by several batches. To handle severe batch effects for microarray data, Luo and Wei²⁰ developed BUS to simultaneously cluster samples across multiple batches and correct batch effects. However, none of the above methods considers features unique to scRNA-seq data, such as the count nature of the data, overdispersion²¹, dropout events⁶, or cell-specific size factors²². ZIFA²³ and ZINB-WaVE²⁴ incorporate dropout events into the factor model, whereas scVI²⁵ and SAVER-X²⁶ couple the modeling of dropout events with neural networks. However, as is the case with the other state-of-the-art methods, these papers do not discuss the designs of scRNA-seq experiments under which their methods are applicable.

Nevertheless, it is crucial to understand the conditions under which biological variability can be separated from technical artifacts. Obviously, for completely confounded designs—for example one in which batch 1 measures cell type 1 and 2, whereas batch 2 measures cell type 3 and 4—no method is applicable.

Here, we propose Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq), an interpretable hierarchical model that simultaneously corrects batch effects, clusters cell types, and takes care of the count data nature, the overdispersion, the dropout events, and the cell-specific size factors of scRNA-seq data. We mathematically prove that it is legitimate to conduct scRNA-seq experiments under not only the commonly advocated completely randomized design^1,5,27,28, in which each batch measures all cell types, but also the reference panel design and the chain-type design, which allow some cell types to be missing from some batches. Furthermore, we demonstrate that BUSseq outperforms the existing approaches in both simulation data and real applications. The theoretical results answer the question about when we can integrate multiple scRNA-seq datasets and analyze them jointly. We envision that the proposed experimental designs will be able to guide biomedical researchers and help them to design better scRNA-seq experiments.

Results

BUSseq is an interpretable hierarchical model for scRNA-seq

We develop a hierarchical model BUSseq that closely mimics the data generating procedure of scRNA-seq experiments (Fig. 1a, Supplementary Fig. 1 and Supplementary Note 1). Given that we have measured B batches of cells each with a sample size of n_b, let us denote the underlying gene expression level of gene g in cell i of batch b as X_big. X_big follows a negative binomial distribution with mean expression level μ_big and a gene-specific and batch-specific overdispersion parameter ϕ_bg. The mean expression level is determined by the cell type W_bi with the cell type effect β_gk, the log-scale baseline expression level α_g, the location batch effect ν_bg, and the cell-specific size factor δ_bi. The cell-specific size factor δ_bi characterizes the impact of cell size, library size and sequencing depth. It is of note that the cell type W_bi of each individual cell is unknown and is our target of inference. Therefore, we assume that a cell on batch b comes from cell type k with probability Pr(W_b= k) = π_bk and the proportions of cell types (π_b1, ⋯ , π_bK) vary among batches.

**Fig. 1: Illustration of the BUSseq model and various types of experimental designs.**

Unfortunately, it is not always possible to observe the expression level X_big. Without dropout (Z_big = 0), we can directly observe Y_big = X_big. However, if a dropout event occurs (Z_big = 1), then we observe Y_big = 0 instead of X_big. In other words, when we observe a zero read count Y_big = 0, there are two possibilities: a non-expressed gene—biological zeros—or a dropout event. When gene g is not expressed in cell i of batch b (X_big = 0), we always have Y_big = 0; when gene g is actually expressed in cell i of batch b (X_big > 0) but a dropout event occurs, we can only observe Y_big = 0, and hence Z_big = 1. It has been noted that highly expressed genes are less-likely to suffer from dropout events⁶. We thus model the dependence of the dropout rate Pr(Z_big = 1∣X_big) on the expression level using a logistic regression with batch-specific intercept γ_b0 and log-odds ratio γ_b1.

Noteworthy, BUSseq includes the negative binomial distribution without zero inflation as a special case. When all cells are from a single cell type and the cell-specific size factor δ_bi is estimated a priori according to spike-in genes, BUSseq can reduce to a form similar to BASiCS²¹.

We only observe Y_big for all cells in the B batches and the total G genes. We conduct statistical inference under the Bayesian framework and adopt the Metropolis-within-Gibbs algorithm²⁹ for the Markov chain Monte Carlo (MCMC) sampling³⁰ (Supplementary Note 2). Based on the parameter estimates, we can learn the cell type for each individual cell, impute the missing underlying expression levels X_big for dropout events, and identify genes that are differentially expressed among cell types. Moreover, our algorithm can automatically detect the total number of cell types K that exists in the dataset according to the Bayesian information criterion (BIC)³¹. BUSseq also provides a batch-effect corrected version of count data, which can be used for downstream analysis as if all of the data were measured in a single batch (“Methods”).

Valid experimental designs for scRNA-seq experiments

If a study design is completely confounded, as shown in Fig. 1b, then no method can separate biological variability from technical artifacts, because different combinations of batch-effect and cell-type-effect values can lead to the same probabilistic distribution for the observed data, which in statistics is termed a non-identifiable model. Formally, a model is said to be identifiable if each probability distribution can arise from only one set of parameter values³². Statistical inference is impossible for non-identifiable models because two sets of distinct parameter values can give rise to the same probability distribution function. We prove that the BUSseq model is identifiable under conditions that are very easily met in reality. Thus, a wide range of designs of scRNA-seq experiments are valid as their batch effects can be adjusted at least by BUSseq.

For the complete setting, in which each batch measures all of the cell types (Fig. 1c and Theorem 1 in “Methods”), BUSseq is identifiable as long as: (I) the log-odds ratio γ_b1s in the logistic regressions for the dropout rates are negative for all of the batches, (II) every two cell types have more than one differentially expressed gene, and (III) the ratios of mean expression levels between two cell types $(\frac{\exp ({\beta }_{1k})}{\exp ({\beta }_{1\tilde{k}})},\cdots ,\frac{\exp ({\beta }_{Gk})}{\exp ({\beta }_{G\tilde{k}})})$ are different for each cell-type pair $(k,\tilde{k})$ (see Theorem 1 in “Methods”). Condition (I) requires that the highly expressed genes are less likely to have dropout events, which is routinely observed for scRNA-seq data⁶. Condition (II) always holds in reality. Because scRNA-seq experiments measure the whole transcriptome of a cell, condition (III) is also always met in real data. For example, if there exists one gene g such that for any two distinct cell-type pairs (k₁, k₂) and (k₃, k₄) their mean expression levels’ ratios $\frac{\exp ({\beta }_{g{k}_{1}}\!)}{\exp ({\beta }_{g{k}_{2}}\!)}$ and $\frac{\exp ({\beta }_{g{k}_{3}}\!)}{\exp ({\beta }_{g{k}_{4}}\!)}$ are not the same, then condition (III) is already satisfied.

The commonly advocated completely randomized experimental design is a special case of the complete setting design. In a completely randomized design, cells are assigned to different batches completely at random. As a result, all of the batches have similar compositions of cell populations. In contrast, under the complete setting design, cells from different cell types can be distributed to different batches very unevenly. The requirement that each batch has similar cellular compositions is crucial for traditional batch effects correction methods developed for bulk experiments such as ComBat⁷ to work well for scRNA-seq data. In contrast, BUSseq is not limited to this balanced design constraint and is applicable to not only the completely randomized design but also the general complete setting design.

Ideally, we would wish to adopt completely randomized experimental designs. However, in reality, it is always very challenging to implement complete randomization due to time and budget constraints. For example, when we recruit patients sequentially, we often have to conduct scRNA-seq experiments patient-by-patient rather than randomize the cells from all of the patients to each batch, and the patients may not have the same set of cell types. Fortunately, we can prove that BUSseq also applies to two sets of flexible experimental designs, which allow cell types to be measured in only some but not all of the batches.

Assuming that conditions (I)–(III) are satisfied, if there exists one batch that contains cells from all cell types and the other batches have at least two cell types (Fig. 1d), then BUSseq can tease out the batch effects and identify the true biological variability (see Theorem 2 in “Methods”). We call this setting the reference panel design.

Sometimes, it can still be difficult to obtain a reference batch that collects all cell types. In this case, we can turn to the chain-type design, which requires every two consecutive batches to share two cell types (Fig. 1e). Under the chain-type design, given that conditions (I)–(III) hold, BUSseq is also identifiable and can estimate the parameters well (see Theorem 3 in “Methods”).

A special case of the chain-type design is when two common cell types are shared by all of the batches, which is frequently encountered in real applications. For instance, when blood samples are assayed, even if we perform scRNA-seq experiment patient-by-patient with one patient per batch, we know a priori that each batch will contain at least both T cells and B cells, thus satisfying the requirement of the chain-type design.

The key insight is that despite batch effects, differences between cell types remain constant across batches. The differences between a pair of cell types allow us to distinguish batch effects from biological variability for those batches that measure both cell types. Therefore, BUSseq can separate batch effects from cell type effects under more general designs beyond the easily understood and commonly encountered reference panel design and chain-type design. If we regard each batch as a node in a graph and connect two nodes with an edge if the two batches share at least two cell types, then BUSseq is identifiable as long as the resulting graph is connected (Supplementary Fig. 2 and Theorem 4 in “Methods”).

For scRNA-seq data, dropout rate depends on the underlying expression levels⁶. Such missing data mechanism is called missing not at random (MNAR) in statistics. It is very challenging to establish identifiability for MNAR. Miao et al.³³ showed that for many cases even when both the outcome distribution and the missing data mechanism has parametric forms, the model can be nonidentifiable. However, fortunately, despite the dropout events and the cell-specific size factors, we are able to prove Theorems 1–4 (Supplementary Note 3). The reference panel design, the chain-type design, and the connected design liberalize researchers from the ideal but often unrealistic requirement of the completely randomized design.

BUSseq accurately learns the parameters and the missing data

We first evaluated the performance of BUSseq via a simulation study. We simulated a dataset with four batches and a total of five cell types under the chain-type design (Fig. 2a–d and Theorem 3). Every two consecutive batches share at least two cell types, but none of the batches contains all of the cell types. The sample sizes for each batch are (n₁, n₂, n₃, n₄) = (300, 300, 200, 200) (Supplementary Table 1), and there are a total of 3000 genes, out of which 500 genes are differentially expressed between cell types. The remaining 2500 genes have no biological differences between different cell types, so they are pure noises with only batch effects. In real datasets, batch effects are often much larger than the cell type effects (Fig. 3a) and not orthogonal to the cell type effects (Supplementary Fig. 3). In the simulation study, we choose the magnitude of the batch effects, cell type effects, the dropout rates, and the cell-specific size factors to mimic real data scenarios (Fig. 3a). The simulated observed data suffer from severe batch effects and dropout events (Figs. 2d, 3c). The dropout rates for the four batches are 26.79%, 24.53%, 28.36%, and 31.29%, with the corresponding total zero proportions given by 44.13%, 48.85%, 53.07%, and 61.38%.

**Fig. 2: Patterns of the simulation study.**

**Fig. 3: Comparison of batch effects correction methods in the simulation study.**

BUSseq correctly identifies the presence of five cell types among the cells (Fig. 2e). Moreover, despite the dropout events, BUSseq accurately estimates the cell type effects β_gks (Fig. 2a, f), the batch effects ν_bgs (Fig. 2b, g), and the cell-specific size factors δ_bis (Fig. 2j). In particular, BUSseq outperforms existing normalization methods, including DESeq normalization³⁴, trimmed mean of M-values (TMM) normalization³⁵, library size normalization, and the deconvolution normalization method³⁶, in estimating the cell-specific size factors δ_bis (Supplementary Fig. 4 and Supplementary Note 4). When controlling the Bayesian False Discovery Rate (FDR) at 5%^37,38, we identify all intrinsic genes that differentiate cell types with the true FDR being 2% (“Methods”).

Figure 2h demonstrates that BUSseq can learn the underlying expression levels X_bigs well based on the observed data Y_bigs, which are subject to dropout events. This success arises because BUSseq uses an integrative model to borrow strengths both across genes and across cells from all batches. In comparison, we also benchmarked BUSseq with three state-of-the-art imputation methods for scRNA-seq data—SAVER³⁹, DrImpute⁴⁰, and scImpute⁴¹. Once again, BUSseq performs the best in identifying the true biological zeros and recovering the underlying expression levels X_bigs for the dropout events (Supplementary Table 2 and Supplementary Note 5).

ComBat offers a version of data that have been adjusted for batch effects⁷. Here, we also provide batch-effects-corrected count data based on quantile matching (“Methods”). The adjusted count data no longer suffer from batch effects and dropout events, and they even do not need further cell-specific normalization (Fig. 2i). Therefore, they can be treated as if measured in a single batch for downstream analysis.

To evaluate the robustness of BUSseq, we conducted extensive sensitivity analyses, and they show that BUSseq is robust to the choice of hyperparameters, high zero rates, model misspecification and gene filtering (Supplementary Figs. 5–7, Supplementary Tables 3 and 4, and Supplementary Note 6).

BUSseq outperforms existing methods in simulation study

We benchmarked BUSseq with the state-of-the-art methods for batch effects correction for scRNA-seq data—LIGER¹⁸, MNN¹⁴, Scanorama¹⁵, scVI²⁵, Seurat¹⁷, and ZINB-WaVE²⁴. The adjusted Rand index (ARI) measures the consistency between two clustering results and is between zero and one, a higher value indicating better consistency⁴² (Supplementary Note 7). The ARI between the inferred cell types ${\widehat{W}}_{bi}$s by BUSseq and the true underlying cell types W_bis is one. Thus, BUSseq can perfectly recover the true cell type of each cell. In comparison, we applied each of the compared methods to the dataset and then performed their own clustering approaches (Supplementary Note 8). The ARI is able to compare the consistency of two clustering results even if the numbers of clusters differ, therefore, we chose the number of cell types by the default approach of each method rather than set it to a common number. The resulting ARIs are 0.837 for LIGER, 0.654 for MNN, 0.521 for Scanorama, 0.480 for scVI, 0.632 for Seurat, and 0.571 for ZINB-WaVE. Moreover, the t-SNE plots (Fig. 3c, d) show that only BUSseq can perfectly cluster the cells by cell types rather than batches. We also calculated the silhouette score for each cell for each compared method (Supplementary Note 7). A high silhouette score indicates that the cell is well matched to its own cluster and separated from neighboring clusters. Figure 3b shows that BUSseq gives the best segregated clusters.

BUSseq outperforms existing methods on hematopoietic data

We re-analyzed the two hematopoietic datasets^43,44 previously studied by Haghverdi et al.¹⁴ (Fig. 4a and Supplementary Fig. 8a,b). The two datasets shared at least three cell types, including the common myeloid progenitors (CMP), megakaryocyte-erythrocyte progenitors (MEP) and granulocyte-monocyte progenitors (GMP), thus they follow the chain-type design.

**Fig. 4: t-SNE and Principal Component Analysis (PCA) plots for the hematopoietic data.**

BUSseq fits the zero rates (Table 1 and Supplementary Note 9) and the mean-variance trends (Fig. 5a, Supplementary Fig. 9 and Supplementary Note 7) of the real data very well. In order to compare BUSseq with existing methods, we compute the ARIs between the clustering of each method and the FACS labels. The resulting ARIs are 0.582 for BUSseq, 0.307 for LIGER, 0.575 for MNN, 0.518 for Scanorama, 0.197 for scVI, 0.266 for Seurat, and 0.348 for ZINB-WaVE (Supplementary Table 5 and Supplementary Note 8). BUSseq thus outperforms all of the other methods in being consistent with FACS labeling. BUSseq also has silhouette coefficients that are comparable to those of MNN, which are better than those of all the other methods (Fig. 4b and Supplementary Fig. 10a, b).

Table 1 Zero-count rates and dropout rates of the hematopoietic and pancreas studies.

Full size table

**Fig. 5: BUSseq recapitulates the mean-variance trends of the real data.**

Specifically, BUSseq learns six cell types from the dataset. According to the FACS labels (Methods), Cluster 2, Cluster 5, and Cluster 6 correspond to CMP, MEP, and GMP, respectively (Figs. 4c, 6a–c). Cluster 1 is composed of long-term hematopoietic stem cells and multi-potent progenitors (MPP). These are cells from the early stage of differentiation. Cluster 4 consists of a mixture of MEP and CMP, while Cluster 3 is dominated by cells labeled as other. Comparison between the subpanel for BUSseq in Figs. 4c and 6b indicates that Cluster 4 are cells from an intermediate cell type between CMP and MEP. In particular, according to Fig. 6e, the marker genes Apoe and Gata2 are highly expressed in Cluster 4 but not in CMP (Cluster 2) and MEP (Cluster 6), and the marker gene Ctse is expressed in MEP (Cluster 6) but not in Cluster 4 and CMP (Cluster 2). Therefore, cells in Cluster 4 do form a unique group with distinct expression patterns. This intermediate cell stage between CMP and MEP is missed by all of the other methods considered. Moreover, we find that well known B-cell lineage genes⁴⁵, Ebf1, Vpreb1, Vpreb3, and Igll1, are highly expressed in Cluster 3, but not in the other clusters (Fig. 6c, e). To identify Cluster 3, which is dominated by cells labeled as other by Nestorowa et al.⁴³, we map the mean expression profile of each cluster learned by BUSseq to the Haemopedia RNA-seq dataset⁴⁶. It turns out that Cluster 3 aligns well to common lymphoid progenitors (CLP) that give rise to T-lineage cells, B-lineage cells and natural killer cells (Fig. 6d). Therefore, Cluster 3 represents cells that differentiate from lymphoid-primed multipotent progenitors (LMPP)⁴⁴. Once again, all the other methods fail to identify these cells as a separate group.

**Fig. 6: BUSseq preserves the hematopoietic stem and progenitor cells (HSPC) differentiation trajectories.**

Thus, although BUSseq does not assume any temporal ordering between cell types, it is able to preserve the differentiation trajectories (Fig. 6a, b); although BUSseq assumes that each cell belongs to one cell type rather than conducts semisoft clustering⁴⁷, it is capable of capturing the subtle changes across cell types and within a cell type due to continuous processes such as development and differentiation (Supplementary Fig. 11 and Supplementary Note 10).

We further inspect the functions of the intrinsic genes that distinguish different cell types. BUSseq detects 1,419 intrinsic genes at the Bayesian false discovery rate (FDR) cutoff of 0.05 (“Methods”). The gene set enrichment analysis⁴⁸ shows that 51 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways⁴⁹ are enriched among the intrinsic genes (p values < 0.05) (Supplementary Note 11). The highest ranked pathway is the Hematopoietic Cell Lineage Pathway, which corresponds to the exact biological process studied in the two datasets. Among the remaining 50 pathways, 13 are related to the immune system, and another 9 are associated with cell growth and differentiation (Supplementary Table 6). Therefore, the pathway analysis demonstrates that BUSseq is able to capture the underlying true biological variability, even if the batch effects are severe, as shown in Figs. 3a and 4a.

BUSseq outperforms existing method on pancreas data

We further studied the four scRNA-seq datasets of human pancreas cells^50,51,52 analyzed in Haghverdi et al.¹⁴. These cells were isolated from deceased organ donors with and without type 2 diabetes. As each patient has at least two pancreas cell types—alpha cells and beta cells, the four datasets follow the chain-type design. We obtained 7095 cells after quality control (Methods) and treated each dataset as a batch following Haghverdi et al.¹⁴.

BUSseq recapitulates the properties of real scRNA-seq data very well in terms of the zero rates (Table 1 and Supplementary Note 9) and the mean-variance trend (Fig. 5b and Supplementary Fig. 12). In particular, the posterior predictive check shows that BUSseq fits the zero rates much better than a model that ignores dropout events, especially when scRNA-seq data are assayed by protocols that do not incorporate UMI counts, such as SMART-seq2.

We can compare the clustering results from each batch effects correction method with the cell-type labels provided by Segerstolpe et al.⁵² and Lawlor et al.⁵¹ (Fig. 7a, b and Supplementary Fig. 8c, d). The pancreas is highly heterogeneous and consists of two major categories of cells: islet cells and non-islet cells. Islet cells include alpha, beta, gamma, and delta cells, while non-islet cells include acinar and ductal cells. BUSseq identifies a total of eight cell types: five for islet cells, two for non-islet cells and one for the labeled other cells. Specifically, the five islet cell types identified by BUSseq correspond to three groups of alpha cells, a group of beta cells, and a group of delta and gamma cells. The two non-islet cell types identified by BUSseq correspond exactly to the acinar and ductal cells. Compared with all of the other methods, BUSseq gives the best separation between islet and non-islet cells, as well as the best segregation within islet cells. In particular, the median silhouette coefficient by BUSseq is higher than that of any other method (Fig. 7c and Supplementary Fig. 10c).

**Fig. 7: t-SNE plots for the pancreas data.**

The ARIs of all methods are 0.608 for BUSseq, 0.542 for LIGER, 0.279 for MNN, 0.527 for Scanorama, 0.282 for scVI, 0.287 for Seurat, and 0.380 for ZINB-WaVE (“Methods” and Supplementary Table 5). Thus, BUSseq outperforms all of the other methods in being consistent with the cell-type labels according to marker genes. In Fig. 7d, the locally high expression levels of marker genes for each cell type show that BUSseq correctly clusters cells according to their biological cell types.

BUSseq identifies 426 intrinsic genes at the Bayesian FDR cutoff of 5% (Methods). We conducted the gene set enrichment analysis⁴⁸ with the KEGG pathways⁴⁹ (Supplementary Note 11). There are 14 enriched pathways (p values < 0.05). Among them, three are diabetes pathways; two are pancreatic and insulin secretion pathways; and another two pathways are related to metabolism (Supplementary Table 7). Recall that the four datasets assayed pancreas cells from type 2 diabetes and healthy individuals, therefore, the pathway analysis once again confirms that BUSseq provides biologically and clinically valid cell typing.

BUSseq is applicable to droplet-based scRNA-seq data

We further analyzed a dataset that contains samples assayed by droplet-based scRNA-seq protocols. Comparing the performance of different methods on real scRNA-seq data is challenging due to the lack of true cell type labels in real application. Fortunately, Tian et al.⁵³ created scRNA-seq datasets with known cell type labels by profiling cells from cancer cell lines. In one experiment, they assayed three lung adenocarcinoma (LUAD) cell lines—HCC827, H1975, and H2228 on three platforms with CELseq2, 10x Chromium and Drop-seq protocols, respectively. As a result, 1401 cells were totally measured on three batches. Each batch consists of all of the three cell types, and data from different batches have different levels of sparsity. Consequently, this study satisfies the complete setting, which is a special case of both the reference-panel design and the chain-type design.

We selected the top 6000 highly variable genes (HVGs) within each batch and obtained 2267 common HVGs across three batches (“Methods”). The t-SNE and PCA plots of the raw count data show that significant batch effects occur across the three protocols (Fig. 8a, b and Supplementary Fig. 13a, b). We applied BUSseq and varied the number of cell type K from 2 to 6. Although the BIC selects four cell types instead of three cell lines (Supplementary Fig. 14), two of the four identified clusters correspond to two subpopulations of the H1975 cell lines (Supplementary Table 8). We further visualized the log-scale mean expression levels of intrinsic genes of the four learned cell types (Fig. 8e). The first two cell types have similar expression patterns, but some differentially expressed genes are observed between them. Moreover, the t-SNE (Fig. 8c, d) and PCA (Supplementary Fig. 13c, d) plots demonstrate the high level of similarity of the first two estimated cell types and confirm that the corrected count data ${\tilde{x}}_{big}$ obtained by BUSseq cluster cells by cell type instead of by batch (Fig. 8f).

We also applied the benchmarked methods to compare their clustering accuracy. The ARIs of all methods are 0.841 for BUSseq, 0.825 for LIGER, 0.650 for MNN, 0.637 for Scanorama, 0.429 for scVI, 0.324 for Seurat, and 0.398 for ZINB-WaVE. Thus, BUSseq outperforms all of the other methods in clustering accuracy. We further compared BUSseq with a recently proposed semi-supervised batch-effect-correction methods, CellAssign. CellAssign requires the number of cell types and the input of a set of marker genes for each cell type. It then annotates scRNA-seq into predefined or de novo cell types⁵⁴. To allow a fair comparison, we also set the number of cell types as the priori known three for BUSseq, and the resulting ARI for BUSseq becomes 0.993. Even though CellAssign is semi-supervised whereas BUSseq is unsupervised, BUSseq outperforms CellAssign in the LUAD dataset as well (ARI for CellAssign is 0.972, Supplementary Table 9). Thus, BUSseq also works very well for scRNA-seq data with high levels of sparsity, such as those generated by droplet-based protocols.

Discussion

For the completely randomized experimental design, it seems that everyone is talking but no one is listening. Due to time and budget constraints, it is always difficult to implement a completely randomized design in practice. Consequently, researchers often pretend to be blind to the issue when carrying out their scRNA-seq experiments. In this paper, we mathematically prove and empirically show that under the more realistic reference panel and chain-type designs, batch effects can also be adjusted for scRNA-seq experiments. We hope that our results will alarm researchers of confounded experimental designs and encourage them to implement valid designs for scRNA-seq experiments in real applications.

BUSseq provides one-stop services. In contrast, most existing methods are multi-stage approaches—clustering can only be performed after the batch effects have been corrected and the differential expressed genes can only be called after the cells have been clustered. The major issue with multi-stage methods is that uncertainties in the previous stages are often ignored. For instance, when cells have been first clustered into different cell types and then differential gene expression identification is conducted, the clustering results are taken as if they were the underlying truth. As the clustering results may be prone to errors in practice, this can lead to false positives and false negatives. In contrast, BUSseq simultaneously corrects batch effects, clusters cell types, imputes missing data, and identifies intrinsic genes that differentiate cell types. BUSseq thus accounts for all uncertainties and fully exploits the information embedded in the data. As a result, BUSseq is able to capture subtler changes between cell types, such as the cluster corresponding to LMPP lineage that is missed by all of the state-of-the-art methods.

BUSseq employs MCMC algorithm for statistical inference. Although MCMC algorithms are well-known for heavy computation load, fortunately, the computational complexity of BUSseq is $O(\mathop{\sum }\nolimits_{b = 1}^{B}{n}_{b}GK)$, which is both linear in the number of genes G and in the total number of cells $N=\mathop{\sum }\nolimits_{b = 1}^{B}{n}_{b}$. Moreover, most steps of the MCMC algorithm for BUSseq are parallelizable (Supplementary Note 12). We implement a parallel multi-core-CPU version and a parallel GPU version of the algorithm, respectively. Running the GPU version of the algorithm with a single core of an Intel Xeon Gold 6132 Processor and one NVIDIA Tesla P100 GPU took 0.35, 1.15, 1.5 h for the simulation, the hematopoietic, and the human pancreas data, respectively (Supplementary Table 10). Experiments show that the running time and random-access memory (RAM) usage are indeed linear in the number of genes G and the number of cells N for both the CPU and the GPU parallel version of BUSseq (Fig. 9 and Supplementary Note 13). Moreover, by writing the posterior samples to the hard disk every a few iterations, we can further reduce the RAM usage so that BUSseq is affordable by a commonly available cluster node rather than a high-end one (Supplementary Table 11 and Supplementary Fig. 15). Compared with the time for preparing samples and conducting the scRNA-seq experiments, the computation time of BUSseq is affordable and worthwhile for the accuracy.

**Fig. 9: The trend of running time and RAM usage with respect to gene and cell numbers.**

Practical and valid experimental designs are urgently required for scRNA-seq experiments. We envision that the flexible reference panel and the chain-type designs will be widely adopted in scRNA-seq experiments and BUSseq will greatly facilitate the analysis of scRNA-seq data.

Methods

BUSseq model

The hierarchical model of BUSseq can be summarized as:

$$\Pr ({W}_{bi}=k)={\pi }_{bk},\mathop{\sum }\limits_{k=1}^{K}{\pi }_{bk} =1;\\ {X}_{big}| {W}_{bi}=k \sim {\rm{NB}}({\mu }_{big},{\phi }_{bg}),{\mathrm{log}}\,({\mu }_{big}) ={\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi};\\ {Z}_{big}| {X}_{big} ={x}_{big} \sim {\rm{Bernoulli}}({p}_{big}),{\mathrm{log}}\,(\frac{{p}_{big}}{1-{p}_{big}}) ={\gamma }_{b0}+{\gamma }_{b1}{x}_{big}; \\ {Y}_{big}={X}_{big}| {Z}_{big}=0,{Y}_{big} =0| {Z}_{big}=1.$$

Collectively, ${\bf{Y}}={\{{Y}_{big}\}}_{b = 1,\cdots ,B;i = 1,\ \cdots ,{n}_{b}}^{g = 1,\cdots ,G}$ are the observed data; the underlying expression levels ${\bf{X}}={\{{X}_{big}\}}_{b = 1,\cdots ,B;i = 1,\cdots ,{n}_{b}}^{g = 1,\cdots ,G}$, the dropout indicators ${\bf{Z}}={\{{Z}_{big}\}}_{b = 1,\cdots , B;i = 1, \cdots , {n}_{b}}^{g = 1, \cdots , G}$ and the cell type indicators ${\bf{W}}={\{{W}_{bi}\}}_{b = 1, \cdots , B;i = 1, \cdots , {n}_{b}}$ are all missing data; the log-scale baseline gene expression levels ${\boldsymbol{\alpha }}={\{{\alpha }_{g}\}}_{g = 1, \cdots , G}$, the cell type effects ${\boldsymbol{\beta }}={\{{\beta }_{gk}\}}_{k = 2, \cdots , K}^{g = 1,\cdots ,G}$, the location batch effects ${\boldsymbol{\nu }}={\{{\nu }_{bg}\}}_{b = 2, \cdots , B}^{g = 1, \cdots , G}$, the overdispersion parameters ${\boldsymbol{\phi }}={\{{\phi }_{bg}\}}_{b = 1, \cdots , B}^{g = 1, \cdots , G}$, the cell-specific size factors ${\boldsymbol{\Delta }}={\{{\delta }_{bi}\}}_{b = 1, \cdots , B}^{i = 2, \cdots , {n}_{b}}$, the dropout parameters ${\boldsymbol{\Gamma }}={\{{\gamma }_{b0},{\gamma }_{b1}\}}_{b = 1, \cdots , B}$ and the cell compositions ${\boldsymbol{\pi }}={\{{\pi }_{bk}\}}_{b = 1, \cdots , B}^{k = 1, \cdots , K}$ are the parameters. Without loss of generality, for model identifiability, we assume that the first batch is the reference batch measured without batch effects with ν_1g = 0 for every gene and the first cell type is the baseline cell type with β_g1 = 0 for every gene. Similarly, we take the cell-specific size factor δ_b1 = 0 for the first cell of each batch. We gather all the parameters as Θ = {α, β, ν, ϕ, Δ, Γ, π}.

Consequently, the observed data likelihood function becomes

$${L}_{o}({\boldsymbol{\Theta }}| {\bf{y}})=\mathop{\prod }\limits_{b=1}^{B}\mathop{\prod }\limits_{i=1}^{{n}_{b}}[\mathop{\sum }\limits_{k=1}^{K}{\pi }_{bk}\mathop{\prod }\limits_{g=1}^{G}\Pr ({Y}_{big}={y}_{big}| {\boldsymbol{\Theta }})],$$

(1)

where

$$\Pr ({Y}_{big}={y}_{big}| {\boldsymbol{\Theta }})=\left\{\begin{array}{ll}\mathop{\sum }\nolimits_{x = 1}^{\infty }\frac{\exp ({\gamma }_{b0}+{\gamma }_{b1}x)}{1+\exp ({\gamma }_{b0}+{\gamma }_{b1}x)}{f}_{{\rm{NB}}}(x;\exp ({\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi}),{\phi }_{bg})&\\ +{f}_{{\rm{NB}}}(0;\exp ({\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi}),{\phi }_{bg})&{y}_{big}=0,\\ \frac{1}{1+\exp ({\gamma }_{b0}+{\gamma }_{b1}{y}_{big})}{f}_{{\rm{NB}}}({y}_{big};\exp ({\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi}),{\phi }_{bg})&{y}_{big}\,> \, 0.\end{array}\right.$$

and ${f}_{{\rm{NB}}}(x;\mu ,\phi )={C}_{x}^{\phi +x-1}{(\frac{\mu }{\mu +\phi })}^{x}{(\frac{\phi }{\mu +\phi })}^{\phi }$ denotes the probability mass function of the negative binomial distribution NB(μ, ϕ). For y_big = 0, ${f}_{{\rm{NB}}}(0;\exp ({\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi}),{\phi }_{bg})$ corresponds to a biological zero, whereas $\mathop{\sum }\nolimits_{x = 1}^{\infty }\frac{\exp ({\gamma }_{b0}+{\gamma }_{b1}x)}{1+\exp ({\gamma }_{b0}+{\gamma }_{b1}x)}{f}_{{\rm{NB}}}(x;\exp ({\alpha }_{g}+{\beta }_{gk}+{\nu }_{bg}+{\delta }_{bi}),{\phi }_{bg})$ corresponds to a dropout event.

Experimental designs

By creating a set of functions similar to the probability generating function, we prove that BUSseq is identifiable, in other words, if two sets of parameters are different, then their probability distribution functions for the observed data are different, for not only the complete setting but also the reference panel and the chain-type designs (see the proofs in Supplementary Note 3).

Theorem 1 (The Complete Setting) If π_bk > 0 for every batch b and cell type k, given that (I) γ_b1 < 0 for every b, (II) for any two cell types k₁ and k₂, there exist at least two differentially expressed genes g₁ and g₂—${\beta }_{{g}_{1}{k}_{1}}\ne \ {\beta }_{{g}_{1}{k}_{2}}$ and ${\beta }_{{g}_{2}{k}_{1}}\ne \ {\beta }_{{g}_{2}{k}_{2}}$, and (III) for any two distinct cell-type pairs (k₁, k₂) ≠ (k₃, k₄), their differences in cell-type effects are not the same ${{\boldsymbol{\beta }}}_{{k}_{1}}-{{\boldsymbol{\beta }}}_{{k}_{2}}\ne \ {{\boldsymbol{\beta }}}_{{k}_{3}}-{{\boldsymbol{\beta }}}_{{k}_{4}}$, then BUSseq is identifiable (up to label switching) in the sense that L_o(Θ∣y) = L_o(Θ^*∣y) for any y implies that ${\pi }_{bk}={\pi }_{b\rho (k)}^{* },({\gamma }_{b0},{\gamma }_{b1})=({\gamma }_{b0}^{* },{\gamma }_{b1}^{* }),{\alpha }_{g}+{\beta }_{gk}={\alpha }_{g}^{* }+{\beta }_{g\rho (k)}^{* },{\nu }_{gb}={\nu }_{gb}^{* },{\delta }_{bi}={\delta }_{bi}^{* }$ and ${\phi }_{bg}={\phi }_{bg}^{* }$ for every gene g and batch b, where ρ is a permutation of {1, 2, ⋯ , K}.

In the following, we denote the cell types that are present in batch b as C_b and count the number of cell types existing in batch b as K_b= ∣C_b∣.

Theorem 2 (The Reference Panel Design) If there are a total of K cell types ${\cup }_{b = 1}^{B}{C}_{b}=\{1,2, \cdots , K\}$, K_b≥ 2 for every batch b, and there exists a batch $\tilde{b}$ such that it contains all of the cell types ${C}_{\tilde{b}}=\{1,2, \cdots , K\}$, then given that conditions (I)–(III) hold, BUSseq is identifiable (up to label switching).

Theorem 3 (The Chain-type Design) If there are a total of K cell types ${\cup }_{b = 1}^{B}{C}_{b}=\{1,2, \cdots , K\}$ and every two consecutive batches share at least two cell types ∣C_b ∩ C_b−1∣ ≥ 2 for all b ≥ 2, then given that conditions (I)–(III) hold, BUSseq is identifiable (up to label switching).

Therefore, even for the reference panel and chain-type designs that do not assay all cell types in each batch, batch effects can be removed; cell types can be clustered; and missing data due to dropout events can be imputed. Both the reference panel design and the chain-type design belong to the more general connected design.

Theorem 4 (The Connected Design) We define a batch graph G = (V, E). Each node b ∈ V represents a batch. There is an edge e ∈ E between two nodes b₁ and b₂ if and only if batches b₁ and b₂ share at least two cell types. If the batch graph is connected and conditions (I)–(III) hold, then BUSseq is identifiable (up to label switching).

Statistical inference

We conduct the statistical inference under the Bayesian framework. We assign independent priors to each component of Θ as follows (Supplementary Table 3):

$${{\boldsymbol{\pi }}}_{b}=({\pi }_{b1}, \cdots , {\pi }_{bK}) \sim {\rm{Dirichlet}}(\xi , \cdots , \xi ),1\le b\le B;\\ {\gamma }_{b0} \sim {\rm{N}}(0,{\sigma }_{z0}^{2}),1\le b\le B;\\ -{\gamma }_{b1} \sim {\rm{Gamma}}({a}_{\gamma },{b}_{\gamma }),1\le b\le B;\\ {\alpha }_{g} \sim {\rm{N}}({m}_{a},{\sigma }_{a}^{2}),1\le g\le G;\\ {\nu }_{bg} \sim {\rm{N}}({m}_{c},{\sigma }_{c}^{2}),2\le b\le B,g=1, \cdots , G;\\ {\delta }_{bi} \sim {\rm{N}}({m}_{d},{\sigma }_{d}^{2}),1\le b\le B,2\le i\le {n}_{b};\\ {\phi }_{bg} \sim {\rm{Gamma}}(\kappa ,\tau ),1\le b\le B,1\le g\le G.$$

We are interested in detecting genes that differentiate cell types. Therefore, we impose a spike-and-slab prior⁵⁵ using a normal mixture to the cell-type effect β_gk. The spike component concentrates on zero with a small variance ${\tau }_{\beta 0}^{2}$, whereas the slab component tends to deviate from zero, thus having a larger variance ${\tau }_{\beta 1}^{2}$. We introduce another latent variable L_gk to indicate which component β_gk comes from. L_gk = 0 if gene g is not differentially expressed between cell type k and cell type one, and L_gk = 1, otherwise. We further define ${D}_{g}=\mathop{\sum }\nolimits_{k = 2}^{K}{L}_{gk}$. If D_g > 0, then the expression level of gene g does not stay the same across cell types. Following Huo et al.¹³, we call such genes intrinsic genes, which differentiate cell types. To control for multiple hypothesis testing, we let L_gk ~ Bernoulli(p) and assign a conjugate prior Beta(a_p, b_p) to p. We set τ_β1 to a large number and let τ_β0 follow an inverse-gamma prior Inv—Gamma(a_τ, b_τ) with a small prior mean.

We develop an MCMC algorithm to sample from the posterior distribution (Supplementary Note 2). After the burn-in period, we take the mean of the posterior samples to estimate γ_b, α_g, β_gk, ν_bg, δ_bi, and ϕ_bg and use the mode of posterior samples of W_bi to infer the cell type for each cell.

We have actually also implemented an Expectation-Maximization (EM) algorithm⁵⁶ for a simplified version of the BUSseq model. Unfortunately, consistent with the literature^57,58, we found that inference by the EM algorithm can be very sensitive to small disturbance of observed data and the initial values. Thus, we choose to use the MCMC algorithm for inference. The extra benefit of the MCMC algorithm is that it not only provides point estimates but also explores the entire posterior distributions and hence allow the users to quantify the uncertainty of estimates.

Identification of intrinsic genes

When inferring the differential expression indicator L_gk, we control the Bayesian FDR³⁷ defined as

$${\rm{FDR}}(\kappa )=\frac{\mathop{\sum }\limits_{g=1}^{G}\mathop{\sum }\limits_{k=2}^{K}{\xi }_{gk}I({\xi }_{gk}\le \kappa )}{\mathop{\sum }\limits_{g=1}^{G}\mathop{\sum }\limits_{k=2}^{K}I({\xi }_{gk}\le \kappa )},$$

where ξ_gk = Pr(L_gk = 0∣y) is the posterior marginal probability that gene g is not differentially expressed between cell type k and cell type one, which can be estimated by the T posterior samples ${L}_{gk}^{(t)}$s collected after the burn-in period as $\frac{1}{T}\mathop{\sum }\nolimits_{t = 1}^{T}(1-{L}_{gk}^{(t)})$. Given a control level α such as 0.1, we search for the largest κ₀ ≤ 0.5 such that the estimated $\widehat{{\rm{FDR}}}(\kappa )$ based on ${\widehat{\xi }}_{gk}$s is smaller than α and declare ${\widehat{L}}_{gk}=1$ if ${\widehat{\xi }}_{gk}\le {\kappa }_{0}$. The upper bound 0.5 for κ₀ prevents us from calling differentially expressed genes with small posterior probability Pr(L_gk = 1∣y). Consequently, we identify the genes with ${\widehat{D}}_{g}=\mathop{\sum }\nolimits_{k = 2}^{K}{\widehat{L}}_{gk}> 0$ as the intrinsic genes. We set α = 0.05 in both the simulation study and the real applications. Here, we follow Huo et al.¹³ to define intrinsic genes as genes that are differentially expressed between at least two cell types. In contrast, marker genes are genes that feature certain cell types according to the literature. For example, in the pancreas study, GCG gene is known to be highly expressed in alpha islet cells, so this gene often serves as a marker to label alpha islet cells⁵¹.

Convergence of the MCMC algorithm

To rigorously assess the convergence of the Markov chain, we adopt the EPSR factors criterion⁵⁹ (Supplementary Note 14). We are interested in the log-scale baseline expression level {α_g, g = 1, 2, ⋯ , G}, the cell type effects {β_gk, g = 1, 2, ⋯ , G, k = 2, 3, ⋯ , K}, the location batch effects {ν_bg, g = 1, 2, ⋯ , G, b = 2, 3, ⋯ , B} and the overdispersion parameters {ϕ_bg, g = 1, 2, ⋯ , G, b = 1, 2, ⋯ , B}. To avoid the impact of label switching of cell types (Supplementary Fig. 16 and Supplementary Note 14), we consider the log-scale cell-type-specific expression level θ_gk = α_g + β_gk, g = 1, 2, ⋯ , G, k = 1, 2, ⋯ , K and match the cell type indicators in different chains such that most cells in the different chains are assigned to the same cell types. If the EPSR factors of most parameters are close to one, we treat the posterior sampling as attaining stationary. Thus, we use the following rule to diagnose the convergence of the MCMC algorithm for BUSseq:

1.
More than 80% of {EPSR(θ_gk)} are <1.3;
2.
More than 80% of {EPSR(ν_bg)} are <1.3;
3.
More than 80% of {EPSR(ϕ_bg)} are <1.3.

Implementation of the MCMC algorithm

In the simulation study, we ran the MCMC algorithm for 4000 iterations and discarded the first 2000 iterations as burn-ins. In the three real data analysis, we ran BUSseq for 8000 iterations and discarded the first 4000 iterations as burin-ins. Both the estimated potential scale reduction (EPSR) factors (Supplementary Table 12) and the acceptance rates of the Metropolis steps of the MCMC algorithm (Supplementary Tables 13 and 14, and Supplementary Note 14) demonstrate that the Markov chain has converged with good mixing.

Selection of cell type numbers

BUSseq allows the user to input the total number of cell types K according to prior knowledge. When K is unknown, BUSseq selects the number of cell types $\widehat{K}$ such that it achieves the minimum BIC³¹. BIC adds a penalty term to the observed data log-likelihood ${L}_{o}(\widehat{{\boldsymbol{\Theta }}}| {\bf{y}})$ as Eq. (1).

$${\rm{BIC}}(K)=-2{\mathrm{log}}\,({L}_{o}(\widehat{{\boldsymbol{\Theta }}}| {\bf{y}}))+[K(B+G)+2B+(2B-1)G+\mathop{\sum }\limits_{b=1}^{B}({n}_{b}-1)]\cdot {\mathrm{log}}\,(\mathop{\sum }\limits_{b=1}^{B}{n}_{b}G),$$

where $\widehat{{\boldsymbol{\Theta }}}=(\widehat{{\boldsymbol{\alpha }}},\widehat{{\boldsymbol{\beta }}},\widehat{{\boldsymbol{\gamma }}},\widehat{{\boldsymbol{\nu }}},\widehat{{\boldsymbol{\phi }}},\widehat{{\boldsymbol{\delta }}},\widehat{{\boldsymbol{\pi }}})$ denotes the posterior mean of parameters. As a result, the penalty in BIC helps the model selection to balance between goodness-of-fit and the model complexity (Supplementary Figs. 17–19, Supplementary Tables 15 and 16, and Supplementary Note 15).

Inference of dropout events

In the BUSseq model, a dropout event occurs for gene g in cell i of batch b if the observed value y_big = 0 but the imputed count data ${\widehat{x}}_{big}> 0$. The identification allows us to calculate the frequency of dropout events in each batch. We calculate the zero rate of each batch as following:

$${\rho }_{0}=\frac{1}{G\cdot {n}_{b}}\mathop{\sum }\limits_{g=1}^{G}\mathop{\sum }\limits_{i=1}^{{n}_{b}}I({y}_{big}=0),$$

(2)

and compute the dropout rate as the proportion of dropout events among the observations with zero counts:

$${\rho }_{d}=\frac{\mathop{\sum }\limits_{g=1}^{G}\mathop{\sum }\limits_{i=1}^{{n}_{b}}I({y}_{big}=0\ {\rm{and}}\ {\widehat{x}}_{big}> 0)}{\mathop{\sum }\limits_{g=1}^{G}\mathop{\sum }\limits_{i=1}^{{n}_{b}}I({y}_{big}=0)}.$$

Posterior predictive check

We evaluate how well BUSseq fits the data via posterior predictive checks⁶⁰. In particular, we focus on the zero rates. In the posterior predictive check, we take MCMC samples of all the parameters after the burn-in iterations to simulate replicated datasets ${Y}_{j}^{rep},j=1,2, \cdots , J$ for G genes and $N=\mathop{\sum }\nolimits_{b = 1}^{B}{n}_{b}$ cells, where J denotes the total number of collected iterations after burn-ins. In our real data analyses, we ran 8000 iterations with the first 4000 iterations as burn-ins, so we generated J = 8000 − 4000 = 4000 replicated datasets for both the hematopoietic and Pancreas studies. For each generated replicate dataset, we calculated the zero rates of each batch according to Eq. (2). Finally, we average the zero rates over all the J iterations to calculate the posterior mean ${\widehat{\rho }}_{0}$ of the zero rate of each batch and compare it with the corresponding observed zero rate. Moreover, we also compare BUSseq with a reduced model of BUSseq which ignores dropout events and hence uses negative binomial distribution without zero inflation, abbreviated as BUSseq-nzf (Supplementary Note 9), via the posterior predictive check (Supplementary Note 16).

Batch-effects-corrected values

To facilitate further downstream analysis, we also provide a version of count data $\widetilde{{\bf{X}}}={\{{\widetilde{X}}_{big}\}}_{b = 1, \cdots , B;i = 1, \cdots , {n}_{b}}^{g = 1, \cdots , G}$ for which the batch effects are removed and the biological variability is retained similar to that of ComBat⁷. Ideally, if x_big is the αth percentile of ${\rm{NB}}(\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}+{\widehat{\nu }}_{bg}+{\widehat{\delta }}_{bi}),{\widehat{\phi }}_{bg})$, we aim to take the αth percentile of ${\rm{NB}}(\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}),{\widehat{\phi }}_{bg})$ as the corrected value ${\tilde{x}}_{big}$. However, the negative binomial distribution is a discrete distribution. As a result, several $\tilde{x}$s can lie between the Pr(x ≤ x_big − 1)-percentile and Pr(x ≤ x_big)-percentile of the distribution of ${\widetilde{X}}_{big}$. For example, if ${X}_{big} \sim {\rm{NB}}(\exp (2),3)$, ${\widetilde{X}}_{big} \sim {\rm{NB}}(\exp (3),5)$, and our observed value x_big = 8, then Pr(x_big ≤ 7) and Pr(x_big ≤ 8) correspond to the 58.67th and 65.76th percentiles of ${\rm{NB}}(\exp (2),3)$. However, three numbers—21, 22, and 23—lie between 58.67th and 65.76th percentile of ${\rm{NB}}(\exp (3),5)$. Thus, to avoid bias, we draw one number uniformly from 21, 22, and 23 rather than take the maximum or the minimum to calculate ${\widetilde{x}}_{big}$.

Thus, we develop a quantile matching approach based on inverse sampling. Specifically, given the fitted model and the inferred underlying expression level ${\widehat{x}}_{big}$, we first sample u_big from ${\rm{Unif}}[{F}_{{\rm{NB}}}({\widehat{x}}_{big}-1;\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}+{\widehat{\nu }}_{bg}+{\widehat{\delta }}_{bi}),{\widehat{\phi }}_{bg}),{F}_{{\rm{NB}}}({\widehat{x}}_{big};\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}+{\widehat{\nu }}_{bg}+{\widehat{\delta }}_{bi}),{\widehat{\phi }}_{bg})]$ where Unif[a, b] denotes the uniform distribution on the interval [a, b] and F_NB( ⋅ ; μ, r) denotes the cumulative distribution function of a negative binomial distribution with mean μ and overdispersion parameter r. Next, we calculate the ${u}_{big}^{th}$ quantile of ${\rm{NB}}(\exp ({\widehat{\alpha }}_{g}+{\widehat{\beta }}_{g{\widehat{w}}_{bi}}),{\widehat{\phi }}_{1g})$ as the corrected value ${\widetilde{x}}_{big}$.

The corrected data $\widetilde{{\bf{X}}}$ are not only protected from batch effects but also impute the missing data due to dropout events. Moreover, further cell-specific normalization is not needed. Meanwhile, the biological variability is retained thanks to the quantile transformation and sampling step. Therefore, we can directly perform downstream analysis on $\widetilde{{\bf{X}}}$.

Preprocessing of the real datasets

Gene filtering

A common practice of scRNA-seq data analysis is to focus on the set of HVGs^{14,17,18,24,25} or the genes with the high mean expression levels across cells⁶¹. Although BUSseq is robust to gene filtering strategies in real studies (Supplementary Tables 17 and 18), comparing the ARIs resulting from the two gene filtering strategies, we recommend filtering HVGs in preprocessing (Supplementary Note 17). An intrinsic gene that well distinguishes cell types may be highly expressed in one cell type but lowly expressed in other cell types. As a result, its mean expression level across all of the cells may be low, and hence such a gene will be missed by filtering according to mean expression levels. Thus, filtering genes according to mean expression levels is likely to select genes whose expression levels are high but remain the same across all of the cell types. Unfortunately, such genes can provide very limited information for differentiating cell types. We therefore filter out HVGs for the downstream analysis in real data analyses.

Hematopoietic study

For the two hematopoietic datasets, we downloaded the read count matrix of the 1920 cells profiled by Paul et al.⁴⁴ and the 2729 cells labeled as myeloid progenitor cells by Nestorowa et al.⁴³ from the NCBI Gene Expression Omnibus (GEO) with the accession numbers GSE72857 and GSE81682. Following Brennecke et al.⁶², we first labeled cells using FACS labels and then performed the size factor normalization within each batch. Next, we filtered out the common HVGs identified by Nestorowa et al.⁴³ between two datasets. These HVGs were denoted by Ensembl ID. The genes in the GSE81682 dataset were named by Ensembl ID, but the genes in the GSE72857 dataset were named by Gene Symbol. The R package biomaRt was used to query the corresponding Gene Symbol by Ensembl ID. Finally, we obtained 3,470 common HVGs shared by the two datasets.

Pancreas study

Two of the pancreas datasets profiled by the CEL-seq2 platform were downloaded from GEO with accession number GSE80176⁵⁰ and GSE85241⁶³. The two datasets assayed by the SMART-seq2 platform were obtained from GSE86473⁵¹ and from ArrayExpress accession number E-MATB-5061⁵². Following Haghverdi et al.¹⁴, we excluded cells with low library sizes (<100,000 reads), low numbers of expressed genes (>40% total counts from ribosomal RNA genes), or high ERCC content (>20% of total counts from spike-in transcripts) resulting in 7095 cells. We selected the 2480 HVGs shared by the four datasets according to Brennecke et al.⁶² by sorting the ratio of variance and mean expression level after adjusting technical noise with the variances of spike-in transcripts. GSE86473 and EMATB-5061 have the cell type labels for all of the cells, but the cell type labels of GSE81076 and GSE85241 were inferred by the marker genes used in the original publications by Lawlor et al.⁵¹ and Grün et al.⁵⁰.

To assign cell type labels for the GSE81076 and GSE85241 datasets, following Haghverdi et al.¹⁴, we first extracted the normalized expression levels of the selected HVGs within each dataset. Next, we obtained the low dimensional embedding of HVGs by tSNE for visualization. At the same time, we applied robust k-means clustering to the normalized expression levels of the selected HVGs using the pam function in the R package cluster. The number of clusters was set as nine. Next, we drew t-SNE plots colored by the expression levels of the marker genes. It is known that GCG is highly expressed in alpha islet cells, INS in beta islet cells, SST in delta islet cells, PPY in gamma islet cells (pancreatic polypeptide cells), PRSS1 in acinar cells, KRT19 in ductal cells and COL1A1 in mescenchymal cells^50,51, so we labeled each cluster by its corresponding highly expressed marker gene.

LUAD cancer cell line study

We downloaded the raw count data from the GitHub repository https://github.com/LuyiTian/sc_mixology with accession number GSE118767. We selected the top 6000 HVGs within each batch using the trendVar and decomposeVar functions in the R package scran⁶⁴ and obtained 2,267 common HVGs across three batches (Supplementary Note 15).

Naming clusters learned by BUSseq according to FACS labels

In the two real data examples, we first identify the cell type of each individual cell according to FACS labeling. Then, for each cluster learned by BUSseq, we calculate the proportion of labeled cell types. If a cell type accounts for more than one-third of the cells in a given cluster, we assign this cell type to the cluster. Although a cluster may be assigned more than one cell type, most identified clusters by BUSseq are dominated by only one cell type. For example, in the hematopoietic study, BUSseq identifies 1165 cells for Cluster 5. According to FACS labels, 1127 of the 1165 cells are megakaryocyte-erythrocyte progenitors (MEP). Therefore, we name Cluster 5 as MEP.

Mapping clusters to haemopedia

Haemopedia is a database of gene expression profiles from diverse types of hematopoietic cells⁴⁶. It collected flow sorted cell populations from healthy mice.

To understand Cluster 3 learned by BUSseq for the hematopoietic data, which is dominated by cells classified as other according to the FACS labeling, we mapped the cluster means learned by BUSseq to the Haemopedia RNA-seq dataset.

We first applied TMM normalization³⁵ to all the samples in the Haemopedia RNA-seq dataset. Then, we extracted seven types of hematopoietic stem and progenitor cells from Haemopedia, including Lin⁻Sca-1⁺c-Kit⁺ cells, short-term hematopoietic stem cells, MPP, CLP, CMP, MEP, and GMP. Each selected cell type had two RNA-seq samples in Haemopedia, so we averaged over the two replicates for each cell type. Further, we added one to the normalized expression levels as a pseudo read count to handle genes with zero read count and log-transformed the data. Finally, we scaled the data across the seven cell types for each gene. To be comparable, we transformed the cluster mean learned by BUSseq as ${m}_{gk}=\mathrm{log}\,(1+\exp ({\alpha }_{g}+{\beta }_{gk}))$ for gene g in the cluster k and scaled m_gk across all cell types as well. Finally, we calculated the correlation between the cluster means inferred by BUSseq and the reference expression profiles in Haemopedia for 37 marker genes. The 37 marker genes were retrieved from Paul et al.⁴⁴ (31 maker genes for HSPC) and Herman et al.⁴⁵ (6 maker genes for LMPP).

Software availability

The C++ source code of the parallel multi-core-CPU version of BUSseq is available on GitHub https://github.com/songfd2018/BUSseq-1.0, and the CUDA C source code of the GPU version of BUSseq is available on GitHub https://github.com/Anguscgm/BUSseq_gpu. All code for producing results and figures in this manuscript are also available on GitHub (https://github.com/songfd2018/BUSseq-1.1_implementation). Furthermore, we wrap C++ source code as an R package, BUSseq (https://github.com/songfd2018/BUSseq-Rpackage).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The published datasets used in this manuscript are available through the following accession numbers: SMART-seq2 platform hematopoietic data with GEO GSE81682 by Nestorowa et al.⁴³; MARS-seq platform hematopoietic data with GEO GSE72857 by Paul et al.⁴⁴; CEL-seq platform pancreas data with GEO GSE81076 by Grün et al.⁵⁰; CEL-seq2 platform pancreas data with GEO GSE85241 by Muraro et al.⁶³; SMART-seq2 platform pancreas data with GEO GSE86473 by Lawlor et al.⁵¹; and SMART-seq2 platform pancreas data with ArrayExpress E-MTAB-5061 by Segerstolpe et al.⁵²; human lung adenocarcinoma cell line data with GEO GSE118767 by Tian et al.⁶⁵. The parameter settings for the simulation study and the simulated data are available on GitHub (https://github.com/songfd2018/BUSseq-1.1_implementation).

References

Bacher, R. & Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 17, 63 (2016).
Article PubMed PubMed Central CAS Google Scholar
Irizarry, R. A. et al. Multiple-laboratory comparison of microarray platforms. Nat. Methods 2, 345–350 (2005).
Article CAS PubMed Google Scholar
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).
Article CAS PubMed Google Scholar
Taub, M. A., CorradaBravo, H. & Irizarry, R. A. Overcoming bias and systematic errors in next generation sequencing data. Genome Med. 2, 87 (2010).
Article PubMed PubMed Central Google Scholar
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).
Article MathSciNet PubMed Google Scholar
Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740 (2014).
Article CAS PubMed PubMed Central Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article MATH PubMed Google Scholar
Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161 (2007).
Article PubMed Central CAS Google Scholar
Leek, J. T. Svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161–e161 (2014).
Article CAS PubMed Central Google Scholar
Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896 (2014).
Article CAS PubMed PubMed Central Google Scholar
Jacob, L., Gagnon-Bartsch, J. A. & Speed, T. P. Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28 (2015).
Article MathSciNet PubMed PubMed Central Google Scholar
Lin, Y. et al. Evaluating stably expressed genes in single cells. GigaScience 8, giz106 (2019).
Article PubMed PubMed Central CAS Google Scholar
Huo, Z., Ding, Y., Liu, S., Oesterreich, S. & Tseng, G. Meta-analytic framework for sparse k-means to identify disease subtypes in multiple transcriptomic studies. J. Am. Stat. Assoc. 111, 27–42 (2016).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Haghverdi, L., Lun, A. T., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685 (2019).
Article CAS PubMed PubMed Central Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411 (2018).
Article CAS PubMed PubMed Central Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Article CAS PubMed PubMed Central Google Scholar
Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887.e17 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lin, Y. et al. scmerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA. 116, 9775–9784 (2019).
Article MathSciNet CAS PubMed MATH PubMed Central Google Scholar
Luo, X. & Wei, Y. Batch effects correction with unknown subtypes. J. Am. Stat. Assoc. 114, 581–594 (2019).
Article MathSciNet CAS MATH Google Scholar
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).
Article ADS PubMed PubMed Central CAS Google Scholar
Wang, J. et al. Gene expression distribution deconvolution in single-cell RNA sequencing. Proc. Natl Acad. Sci. USA. 115, E6437–E6446 (2018).
MathSciNet CAS PubMed MATH PubMed Central Google Scholar
Pierson, E. & Yau, C. Zifa: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16, 241 (2015).
Article PubMed PubMed Central CAS Google Scholar
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wang, J. et al. Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16, 875–878 (2019).
Article CAS PubMed PubMed Central Google Scholar
Baran-Gale, J., Chandra, T. & Kirschner, K. Experimental design for single-cell RNA sequencing. Brief. Funct Genom. 17, 233–239 (2017).
Dal, M. A. & Di, C. B. How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives. Brief. Bioinform. 20, 1384–1394 (2018).
Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 22, 1701–1728 (1994).
Robert, C., Casella, G. Monte Carlo Statistical Methods (Springer Science, Business Media, 2013).
Schwarz, G. et al. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
Article MathSciNet MATH Google Scholar
Casella, G., Berger, R. L. Statistical Inference, vol. 2 (Duxbury Pacific Grove, CA, 2002).
Miao, W., Ding, P. & Geng, Z. Identifiability of normal and normal mixture models with nonignorable missing data. J. Am. Stat. Assoc. 111, 1673–1683 (2016).
Article MathSciNet CAS Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central CAS Google Scholar
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
Article PubMed PubMed Central CAS Google Scholar
Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
Article PubMed CAS Google Scholar
Newton, M. A., Noueiry, A., Sarkar, D. & Ahlquist, P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5, 155–176 (2004).
Article PubMed MATH Google Scholar
Peterson, C., Stingo, F. C. & Vannucci, M. Bayesian inference of multiple Gaussian graphical models. J. Am. Stat. Assoc. 110, 159–174 (2015).
Article MathSciNet CAS PubMed MATH Google Scholar
Huang, M. et al. Saver: gene expression recovery for single-cell rna sequencing. Nat. Methods 15, 539 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gong, W., Kwak, I.-Y., Pota, P., Koyano-Nakagawa, N. & Garry, D. J. Drimpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinform. 19, 220 (2018).
Article CAS Google Scholar
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 1–9 (2018).
Article ADS CAS Google Scholar
Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Nestorowa, S. et al. A single cell resolution map of mouse haematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
Article CAS PubMed PubMed Central Google Scholar
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Article CAS PubMed Google Scholar
Herman, J. S. & Grün, D. et al. FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data. Nat. Methods 15, 379 (2018).
Article CAS PubMed Google Scholar
Choi, J. et al. Haemopedia RNA-seq: a database of gene expression during haematopoiesis in mice and humans. Nucleic Acids Res. 47, D780–D785 (2018).
Article PubMed Central CAS Google Scholar
Zhu, L., Lei, J., Klei, L., Devlin, B. & Roeder, K. Semisoft clustering of single-cell data. Proc. Natl Acad. Sci. USA. 116, 466–471 (2019).
Article MathSciNet CAS PubMed MATH Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44 (2009).
Article CAS Google Scholar
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
Article PubMed PubMed Central CAS Google Scholar
Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
Article CAS PubMed PubMed Central Google Scholar
Segerstolpe, Å et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).
Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods 16, 1007–1015 (2019).
Article CAS PubMed PubMed Central Google Scholar
George, E. I. & McCulloch, R. E. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88, 881–889 (1993).
Article Google Scholar
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Series B (Methodological) 9, 1–22 (1977).
Willson, L., Folks, J. & Young, J. Complete sufficiency and maximum likelihood estimation for the two-parameter negative binomial distribution. Metrika 33, 349–362 (1986).
Article MathSciNet MATH Google Scholar
Saha, K. & Paul, S. Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics 61, 179–185 (2005).
Article MathSciNet PubMed MATH Google Scholar
Gelman, A. et al. Bayesian Data Analysis (Chapman and Hall/CRC, 2013).
Gelman, A., Meng, X.-L. & Stern, H. Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sin. 6, 733–760 (1996).
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 7, 1141 (2018).
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093 (2013).
Article CAS PubMed Google Scholar
Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lun, A. T., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research 5, 2122 (2016).
Tian, L. et al. scpipe: a flexible r/bioconductor preprocessing pipeline for single-cell rna-sequencing data. PLoS Comput. Biol. 14, e1006361 (2018).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

This work was supported by the Hong Kong Ph.D. Fellowship PF15-17417 and the General Research Funds 14306417 and 14305319 from the Hong Kong Research Grants Council of the Hong Kong Special Administrative Region of the People’s Republic of China and Direct Grants from the Research Committee of the Chinese University of Hong Kong. We acknowledge Dr. Xiangyu Luo for helpful comments on an early version of our paper.

Author information

Authors and Affiliations

Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
Fangda Song, Ga Ming Angus Chan & Yingying Wei

Authors

Fangda Song
View author publications
You can also search for this author in PubMed Google Scholar
Ga Ming Angus Chan
View author publications
You can also search for this author in PubMed Google Scholar
Yingying Wei
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.S. developed the method and the proof, implemented the algorithm, prepared the software package, analyzed the data, and wrote the paper. G.M.A.C. implemented the algorithm and analyzed the data. Y.W. conceived and supervised the study, developed the method and the proof, and wrote the paper.

Corresponding author

Correspondence to Yingying Wei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks F. William Townes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Song, F., Chan, G.M.A. & Wei, Y. Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction. Nat Commun 11, 3274 (2020). https://doi.org/10.1038/s41467-020-16905-2

Download citation

Received: 08 November 2019
Accepted: 29 May 2020
Published: 01 July 2020
DOI: https://doi.org/10.1038/s41467-020-16905-2

This article is cited by

Cerebellar contributions to a brainwide network for flexible behavior in mice
- Jessica L. Verpeut
- Silke Bergeler
- Samuel S.-H. Wang
Communications Biology (2023)
RNA sequencing: new technologies and applications in cancer research
- Mingye Hong
- Shuang Tao
- Hua Zhang
Journal of Hematology & Oncology (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.