Single-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations, but as with all genomics experiments, batch effects can hamper data integration and interpretation. The success of batch-effect correction is often evaluated by visual inspection of low-dimensional embeddings, which are inherently imprecise. Here we present a user-friendly, robust and sensitive k-nearest-neighbor batch-effect test (kBET; https://github.com/theislab/kBET) for quantification of batch effects. We used kBET to assess commonly used batch-regression and normalization approaches, and to quantify the extent to which they remove batch effects while preserving biological variability. We also demonstrate the application of kBET to data from peripheral blood mononuclear cells (PBMCs) from healthy donors to distinguish cell-type-specific inter-individual variability from changes in relative proportions of cell populations. This has important implications for future data-integration efforts, central to projects such as the Human Cell Atlas.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
We applied the batch estimates to several scRNA-seq datasets. In the inDrop publication, the droplet-based sequencing was demonstrated on mESCs growing on LIF+ medium and two additional technical replicates12. In our analysis, we used two replicates that consisted of 5,952 cells from two batches and 11,308 genes with at least 2 cells having more than 4 unique molecular identifier (UMI) reads per cell. Data were downloaded as UMI-filtered read count matrices from accession GSE65525.
Kolodziejczyk et al.14 explored heterogeneity in mESCs cultured with three different media (2i, a2i and LIF+) on full-length sequenced transcripts (Smart-seq). The three conditions included 219, 123 and 207 cells in 4, 2 and 3 batches, respectively. The mESC data sequenced with full-length Smart-seq14 were downloaded from ENA (project ID PRJEB6455) as FASTQ files and mapped to an Ensembl52 mouse transcriptome (GRCm38.p5.87, equivalent to UCSC mm10) with Salmon24. Cells were quality-controlled according to data derived from the Espresso database (http://www.ebi.ac.uk/teichmann-srv/espresso/).
Further, scRNA-seq has been widely applied in explorations of mouse embryonic development. To test the performance of batch correction for data integration, we collected single-cell RNA-seq data of mouse early embryonic development from eight different studies16,17,18,19,20,21,22,23, consisting of 56, 49, 124, 65, 15, 294, 17 and 15 cells, respectively. The early embryonic development data used have the following accession IDs: E-GEOD-57249, E-GEOD-70605, E-MTAB-3321, GSE53386, E-MTAB-2958, E-GEOD-45719, E-GEOD-44183 and E-GEOD-66582. All studies applied Smart-seq-based protocols for scRNA-seq. All FASTQ files were mapped to an Ensembl52 mouse transcriptome (version GRCm38.p5.87) with Salmon24 (version 0.8.2; k-mer = 21 to tolerate different read lengths). Here we considered the studies as batches while omitting the flowcell batches. We continued our analysis without further gene filtering or quality control.
Kang et al.26 studied genetic variation among PBMCs from eight individuals as a replacement for cell barcoding in droplet-based sequencing (10x Genomics). From that study, we used three experimental runs: 3,514 and 4,106 cells from four healthy donors each, and 5,832 cells from these eight healthy donors. Human PBMC data26 can be provided by the authors upon request. Count matrices are available under accession number GSE96583.
Tung, P.-Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
Heimberg, G., Bhatnagar, R., El-Samad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Syst. 2, 239–250 (2016).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Cressie, N. & Timothy, R. C. Pearson’s χ2 and the loglikelihood ratio statistic G2: a comparative review. Int. Stat. Rev. 57, 19–43 (1989).
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
Kolodziejczyk, A. A. et al. Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485 (2015).
Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).
Biase, F. H., Cao, X. & Zhong, S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res. 24, 1787–1796 (2014).
Liu, W. et al. Identification of key factors conquering developmental arrest of somatic cell cloned embryos by combining embryo biopsy and single-cell sequencing. Cell Discov. 2, 16010 (2016).
Goolam, M. et al. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell 165, 61–74 (2016).
Fan, X. et al. Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation embryos. Genome Biol. 16, 148 (2015).
Boroviak, T. et al. Lineage-specific profiling delineates the emergence and progression of naive pluripotency in mammalian embryogenesis. Dev. Cell 35, 366–382 (2015).
Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343, 193–196 (2014).
Xue, Z. et al. Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature 500, 593–597 (2013).
Wu, J. et al. The landscape of accessible chromatin in mammalian preimplantation embryos. Nature 534, 652–657 (2016).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Teng, M. et al. A benchmark for RNA-seq quantification pipelines. Genome Biol. 17, 74 (2016).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Liu, Q. et al. Quantitative assessment of cell population diversity in single-cell landscapes. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/30/333393 (2018).
Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Bacher, R. et al. SCnorm: robust normalization of single-cell RNA-seq data. Nat. Methods 14, 584–586 (2017).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Buettner, F., Pratanwanich, N., McCarthy, D. J., Marioni, J. C. & Stegle, O. f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. Genome Biol. 18, 212 (2017).
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNA-seq denoising using a deep count autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/04/13/300681 (2018).
Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).
van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods: towards more accurate and robust tools. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/03/05/276907 (2018).
Bhaduri, A., Nowakowski, T. J., Pollen, A. A. & Kriegstein, A. R. Saturating single-cell datasets. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/12/218370 (2017).
Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
Tabula Muris Consortium. Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/03/29/237446 (2018).
McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
Baik, J. & Silverstein, J. W. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97, 1382–1408 (2006).
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Andrews, T. S. & Hemberg, M. Dropout-based feature selection for scRNASeq. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/17/065094 (2018).
Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 (2017).
Lun, A. T. L., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
Paulson, J. N. et al. Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data. BMC Bioinformatics 18, 437 (2017).
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).
Aken, B. L. et al. Ensembl 2017. Nucleic Acids Res. 45, D635–D642 (2017).
We thank A. Böttcher for motivating this study, and T. Illicic for carrying out pilot analyses. We thank in particular M. Subramaniam and J. Ye (UCSF) for the PBMC data. We are grateful to the members of the Teichmann and Theis labs for valuable discussions and comments on the manuscript. M.B. is supported by a DFG Fellowship through the Graduate School of Quantitative Biosciences Munich (QBM). Z.M. is supported by a Single Cell Gene Expression Atlas grant from the Wellcome Trust (nr. 108437/Z/15/Z). F.A.W. acknowledges support by the Helmholtz Postdoc Programme, Initiative and Networking Fund of the Helmholtz Association. F.J.T. acknowledges financial support by the German Science Foundation (SFB 1243 and Graduate School QBM) and by the Bavarian government (BioSysNet). This collaboration was supported by a Helmholtz International Fellow Award to S.A.T.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Neighborhood size effect for two simulated datasets (1,000 genes, 500 cells, 2 batches of equal size). Dashed vertical line shows the optimal neighborhood size for batch effect detection, that is, where the rejection rate is maximal. Shaded areas represent the 95th percentile of 100 repeated kBET runs. In each run, the number of tested neighborhoods is 10% of the sample size. (a) For 1% of genes, the mean expression levels are varied across the batches. The observed rejection rate is low overall, and decreases with increasing neighborhood size. (b) For 20% of genes, the mean expression levels are varied across the batches. The observed rejection rate is almost 100% and decreases for neighborhood sizes larger than 75%. The vertical dashed line marks the optimal neighborhood size. All flavors of kBET return almost identical results. The exact test has slightly lower rejection rates than Pearson’s 2-test for small neighborhoods (<10% sample size).
Dropout is controlled by the shape parameter k, where batch 1 has k1 = –1 and k2 in batch 2 ranges from –0.9 to –3. Batches are always equally sized; sample sizes refer to the total number of samples in the dataset. (a) kBET rejection rates, variance explained and silhouette coefficient for each simulation (from top to bottom). (b) PCA plots for a large difference (left, Δk = 2) and a small difference (right, Δk = 0.04). (c-d) Mean relation (top), dropout relation (center), and cellular detection rate (CDR) effect (bottom) for a large difference (c) and a small difference (d). Blue lines indicate the linear fit with parameters depicted in the top left corner. R2 values denote the variance explained by the fit.
Additional noise is a factor multiplied on final gene expression means and is drawn from a log normal distribution LN, with batch factor and batch scale .We simulated 1,000 samples and drew subsets from this dataset. Batches were always equally sized. (a-b) Batch-effect analysis for several batch factors (a) and batch scales (b) using kBET, PC regression (variance explained by batch effect) and silhouette coefficients (from top to bottom). kBET rejection rates were computed for both the original data space and the default 50-dimensional PC space (top two plots). (c-d) Mean relation (top), dropout relation (center), PCA plot for batch factor = 0.1 (c) and a batch scale = 0.5 (d). Blue lines indicate the linear fit with parameters depicted in the top left corner. R2 values denote the variance explained by the fit and correspond to high correlation of mean and dropout of both batches.
Supplementary Figure 4 Highly variable genes and kBET results after batch regression (Klein et al.).
(a) Number of retained highly variable genes before and after batch correction. Reference: intersect of highly variable genes per batch with log(counts + 1) normalization. (b) Total number of highly variable genes after batch correction. (c) False positive rates on highly variable genes for all combinations of normalization and batch-correction methods. (d) Comparison of silhouette coefficient and kBET mean ‘acceptance rate’ (1 – rejection rate) from 100 kBET runs. (e) Comparison of PC regression and kBET mean ‘acceptance rate’ (1 – rejection rate) from 100 kBET runs.
Supplementary Figure 5 Deeply sequenced SMART-seq2/C1 mESC data have similar characteristics for batch correction (Kolodziejczyk et al.).
(a) Illustration of two full-length read datasets with replicates in 2i, LIF and a2i culture (219, 207 and 123 cells, respectively). (b) PCA plots for log(CPM + 1) ComBat-corrected data. (c) Percentage of retained highly variable genes versus kBET acceptance rate (equals 1 – rejection rate) for all combinations of normalization and batch-correction approaches. Best-performing normalization-regression strategies cluster in the top right corner, such as ComBat on log(CPM + 1) data. Isolated cells do not have mutual nearest neighbors and appear in some correction models. Seurat’s CCA alignment batch-corrects data only in a latent space as done in manifold learning, and we therefore could not compute highly variable genes and show only kBET values.
Supplementary Figure 6 Sequencing depth in mouse early development data varies by study rather than cell type.
Sequencing depth (library size) per developmental stage (shape) in eight different studies (color-coded) of mouse embryonic development.
PBMC data from eight unrelated individuals processed in three experiments (batches) (Chromium 10X Genomics device), with donor cell identity assigned with demuxlet. Note that with pooling of cells from multiple donors, between-donor processing batch effects are effectively excluded. We applied kBET as a sensitive measure of inter-individual variability. (a) t-SNE plot of all data. Cell types are annotated as in the original publication. (b) Cell-type frequencies per individual and batch. Cells were prepared once and processed in two separate runs. Inter-individual variability is stronger than preparation bias. (c) kBET acceptance rates (1 – rejection rate) for several subsets of the complete dataset. Subsample sizes were chosen from 10% to 100% of the data sample size. Subsampling was repeated threefold and kBET rejection rates were averaged across these replicates to reduce bias from subsampling. With decreasing sample size, we find decreasing rejection rates. This result is due to decreasing certainty for each tested neighborhood as it leads to enhanced failure to reject the null hypothesis.
(a-d) Comparison of cellular detection rate (CDR) effect and library size for inDrop UMI data (Klein et al.) (a-b) and C1 SMART-seq data (Kolodziejczyk et al.) (c-d). (a,c) CDR for count data. (b,d) CDR for CPM-normalized data where any gene was counted as detected if CPM > 1. Blue line shows a linear fit of CDR (y) to library size (x) and batch variable (z) and R2 denotes the variance explained by the model. Gray shaded areas indicate s.e. of the fit. In UMI data, the library size explains largely the CDR effect independent of normalization (a,b). In SMART-seq data, we observe a significant contribution of library size to CDR effect (c) that is lost in CPM-normalized data (d).
Supplementary Figures 1–8 and Supplementary Notes 1–7
k-nearest-neighbor batch-effect test (kBET) is available as an R package
Overview of batch-regression and normalization approaches
Top 20 best-performing batch-correction strategies
About this article
Cite this article
Büttner, M., Miao, Z., Wolf, F.A. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods 16, 43–49 (2019). https://doi.org/10.1038/s41592-018-0254-1
Plasticity of Epididymal Adipose Tissue in Response to Diet-Induced Obesity at Single-Nucleus Resolution
Cell Metabolism (2021)
Identifying Differentially Expressed Genes of Zero Inflated Single Cell RNA Sequencing Data Using Mixed Model Score Tests
Frontiers in Genetics (2021)
Cross-Comparison of Human iPSC Motor Neuron Models of Familial and Sporadic ALS Reveals Early and Convergent Transcriptomic Disease Signatures
Cell Systems (2021)
Briefings in Bioinformatics (2021)
Nature Biotechnology (2021)