Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

A test metric for assessing single-cell RNA-seq batch correction

Abstract

Single-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations, but as with all genomics experiments, batch effects can hamper data integration and interpretation. The success of batch-effect correction is often evaluated by visual inspection of low-dimensional embeddings, which are inherently imprecise. Here we present a user-friendly, robust and sensitive k-nearest-neighbor batch-effect test (kBET; https://github.com/theislab/kBET) for quantification of batch effects. We used kBET to assess commonly used batch-regression and normalization approaches, and to quantify the extent to which they remove batch effects while preserving biological variability. We also demonstrate the application of kBET to data from peripheral blood mononuclear cells (PBMCs) from healthy donors to distinguish cell-type-specific inter-individual variability from changes in relative proportions of cell populations. This has important implications for future data-integration efforts, central to projects such as the Human Cell Atlas.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Batch types and the concept of kBET.
Fig. 2: kBET is more responsive than other batch tests on simulated data.
Fig. 3: ComBat provides the best correction on mESC inDrop technical replicates.
Fig. 4: kBET assesses data-integration quality and inter-individual variability.

Similar content being viewed by others

Data availability

We applied the batch estimates to several scRNA-seq datasets. In the inDrop publication, the droplet-based sequencing was demonstrated on mESCs growing on LIF+ medium and two additional technical replicates12. In our analysis, we used two replicates that consisted of 5,952 cells from two batches and 11,308 genes with at least 2 cells having more than 4 unique molecular identifier (UMI) reads per cell. Data were downloaded as UMI-filtered read count matrices from accession GSE65525.

Kolodziejczyk et al.14 explored heterogeneity in mESCs cultured with three different media (2i, a2i and LIF+) on full-length sequenced transcripts (Smart-seq). The three conditions included 219, 123 and 207 cells in 4, 2 and 3 batches, respectively. The mESC data sequenced with full-length Smart-seq14 were downloaded from ENA (project ID PRJEB6455) as FASTQ files and mapped to an Ensembl52 mouse transcriptome (GRCm38.p5.87, equivalent to UCSC mm10) with Salmon24. Cells were quality-controlled according to data derived from the Espresso database (http://www.ebi.ac.uk/teichmann-srv/espresso/).

Further, scRNA-seq has been widely applied in explorations of mouse embryonic development. To test the performance of batch correction for data integration, we collected single-cell RNA-seq data of mouse early embryonic development from eight different studies16,17,18,19,20,21,22,23, consisting of 56, 49, 124, 65, 15, 294, 17 and 15 cells, respectively. The early embryonic development data used have the following accession IDs: E-GEOD-57249, E-GEOD-70605, E-MTAB-3321, GSE53386, E-MTAB-2958, E-GEOD-45719, E-GEOD-44183 and E-GEOD-66582. All studies applied Smart-seq-based protocols for scRNA-seq. All FASTQ files were mapped to an Ensembl52 mouse transcriptome (version GRCm38.p5.87) with Salmon24 (version 0.8.2; k-mer = 21 to tolerate different read lengths). Here we considered the studies as batches while omitting the flowcell batches. We continued our analysis without further gene filtering or quality control.

Kang et al.26 studied genetic variation among PBMCs from eight individuals as a replacement for cell barcoding in droplet-based sequencing (10x Genomics). From that study, we used three experimental runs: 3,514 and 4,106 cells from four healthy donors each, and 5,832 cells from these eight healthy donors. Human PBMC data26 can be provided by the authors upon request. Count matrices are available under accession number GSE96583.

References

  1. Tung, P.-Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).

    PubMed  PubMed Central  Google Scholar 

  3. Heimberg, G., Bhatnagar, R., El-Samad, H. & Thomson, M. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Syst. 2, 239–250 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).

    Article  PubMed  Google Scholar 

  6. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  PubMed  Google Scholar 

  7. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Cressie, N. & Timothy, R. C. Pearson’s χ2 and the loglikelihood ratio statistic G2: a comparative review. Int. Stat. Rev. 57, 19–43 (1989).

    Article  Google Scholar 

  12. Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).

    Article  CAS  PubMed  Google Scholar 

  14. Kolodziejczyk, A. A. et al. Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).

    Article  Google Scholar 

  16. Biase, F. H., Cao, X. & Zhong, S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Res. 24, 1787–1796 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Liu, W. et al. Identification of key factors conquering developmental arrest of somatic cell cloned embryos by combining embryo biopsy and single-cell sequencing. Cell Discov. 2, 16010 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Goolam, M. et al. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell 165, 61–74 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Fan, X. et al. Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation embryos. Genome Biol. 16, 148 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Boroviak, T. et al. Lineage-specific profiling delineates the emergence and progression of naive pluripotency in mammalian embryogenesis. Dev. Cell 35, 366–382 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343, 193–196 (2014).

    Article  CAS  PubMed  Google Scholar 

  22. Xue, Z. et al. Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature 500, 593–597 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Wu, J. et al. The landscape of accessible chromatin in mammalian preimplantation embryos. Nature 534, 652–657 (2016).

    Article  CAS  PubMed  Google Scholar 

  24. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Teng, M. et al. A benchmark for RNA-seq quantification pipelines. Genome Biol. 17, 74 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    Article  CAS  PubMed  Google Scholar 

  27. Liu, Q. et al. Quantitative assessment of cell population diversity in single-cell landscapes. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/30/333393 (2018).

  28. Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Bacher, R. et al. SCnorm: robust normalization of single-cell RNA-seq data. Nat. Methods 14, 584–586 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Buettner, F., Pratanwanich, N., McCarthy, D. J., Marioni, J. C. & Stegle, O. f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. Genome Biol. 18, 212 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNA-seq denoising using a deep count autoencoder. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/04/13/300681 (2018).

  34. Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods: towards more accurate and robust tools. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/03/05/276907 (2018).

  38. Bhaduri, A., Nowakowski, T. J., Pollen, A. A. & Kriegstein, A. R. Saturating single-cell datasets. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/11/12/218370 (2017).

  39. Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Tabula Muris Consortium. Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/03/29/237446 (2018).

  41. McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Baik, J. & Silverstein, J. W. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97, 1382–1408 (2006).

    Article  Google Scholar 

  43. Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article  Google Scholar 

  44. Andrews, T. S. & Hemberg, M. Dropout-based feature selection for scRNASeq. bioRxiv Preprint at https://www.biorxiv.org/content/early/2018/05/17/065094 (2018).

  45. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Lun, A. T. L., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

    Article  PubMed  Google Scholar 

  47. Paulson, J. N. et al. Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data. BMC Bioinformatics 18, 437 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).

    Article  CAS  PubMed  Google Scholar 

  51. Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Aken, B. L. et al. Ensembl 2017. Nucleic Acids Res. 45, D635–D642 (2017).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank A. Böttcher for motivating this study, and T. Illicic for carrying out pilot analyses. We thank in particular M. Subramaniam and J. Ye (UCSF) for the PBMC data. We are grateful to the members of the Teichmann and Theis labs for valuable discussions and comments on the manuscript. M.B. is supported by a DFG Fellowship through the Graduate School of Quantitative Biosciences Munich (QBM). Z.M. is supported by a Single Cell Gene Expression Atlas grant from the Wellcome Trust (nr. 108437/Z/15/Z). F.A.W. acknowledges support by the Helmholtz Postdoc Programme, Initiative and Networking Fund of the Helmholtz Association. F.J.T. acknowledges financial support by the German Science Foundation (SFB 1243 and Graduate School QBM) and by the Bavarian government (BioSysNet). This collaboration was supported by a Helmholtz International Fellow Award to S.A.T.

Author information

Authors and Affiliations

Authors

Contributions

M.B. developed, tested and validated the method; prepared and analyzed the data; and wrote the paper. Z.M. prepared and analyzed the data and wrote the paper. F.A.W. assisted with method development and manuscript writing. S.A.T. and F.J.T. oversaw the research, designed the method validation and wrote the paper.

Corresponding authors

Correspondence to Sarah A. Teichmann or Fabian J. Theis.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Assessing neighborhood size effect with different flavors of kBET.

Neighborhood size effect for two simulated datasets (1,000 genes, 500 cells, 2 batches of equal size). Dashed vertical line shows the optimal neighborhood size for batch effect detection, that is, where the rejection rate is maximal. Shaded areas represent the 95th percentile of 100 repeated kBET runs. In each run, the number of tested neighborhoods is 10% of the sample size. (a) For 1% of genes, the mean expression levels are varied across the batches. The observed rejection rate is low overall, and decreases with increasing neighborhood size. (b) For 20% of genes, the mean expression levels are varied across the batches. The observed rejection rate is almost 100% and decreases for neighborhood sizes larger than 75%. The vertical dashed line marks the optimal neighborhood size. All flavors of kBET return almost identical results. The exact test has slightly lower rejection rates than Pearson’s 2-test for small neighborhoods (<10% sample size).

Supplementary Figure 2 Simulation of different dropout effects in single-cell RNA-seq data.

Dropout is controlled by the shape parameter k, where batch 1 has k1 = –1 and k2 in batch 2 ranges from –0.9 to –3. Batches are always equally sized; sample sizes refer to the total number of samples in the dataset. (a) kBET rejection rates, variance explained and silhouette coefficient for each simulation (from top to bottom). (b) PCA plots for a large difference (left, Δk = 2) and a small difference (right, Δk = 0.04). (c-d) Mean relation (top), dropout relation (center), and cellular detection rate (CDR) effect (bottom) for a large difference (c) and a small difference (d). Blue lines indicate the linear fit with parameters depicted in the top left corner. R2 values denote the variance explained by the fit.

Supplementary Figure 3 Simulation of equally sized batches with one having additional noise.

Additional noise is a factor multiplied on final gene expression means and is drawn from a log normal distribution LN, with batch factor and batch scale .We simulated 1,000 samples and drew subsets from this dataset. Batches were always equally sized. (a-b) Batch-effect analysis for several batch factors (a) and batch scales (b) using kBET, PC regression (variance explained by batch effect) and silhouette coefficients (from top to bottom). kBET rejection rates were computed for both the original data space and the default 50-dimensional PC space (top two plots). (c-d) Mean relation (top), dropout relation (center), PCA plot for batch factor = 0.1 (c) and a batch scale = 0.5 (d). Blue lines indicate the linear fit with parameters depicted in the top left corner. R2 values denote the variance explained by the fit and correspond to high correlation of mean and dropout of both batches.

Supplementary Figure 4 Highly variable genes and kBET results after batch regression (Klein et al.).

(a) Number of retained highly variable genes before and after batch correction. Reference: intersect of highly variable genes per batch with log(counts + 1) normalization. (b) Total number of highly variable genes after batch correction. (c) False positive rates on highly variable genes for all combinations of normalization and batch-correction methods. (d) Comparison of silhouette coefficient and kBET mean ‘acceptance rate’ (1 – rejection rate) from 100 kBET runs. (e) Comparison of PC regression and kBET mean ‘acceptance rate’ (1 – rejection rate) from 100 kBET runs.

Supplementary Figure 5 Deeply sequenced SMART-seq2/C1 mESC data have similar characteristics for batch correction (Kolodziejczyk et al.).

(a) Illustration of two full-length read datasets with replicates in 2i, LIF and a2i culture (219, 207 and 123 cells, respectively). (b) PCA plots for log(CPM + 1) ComBat-corrected data. (c) Percentage of retained highly variable genes versus kBET acceptance rate (equals 1 – rejection rate) for all combinations of normalization and batch-correction approaches. Best-performing normalization-regression strategies cluster in the top right corner, such as ComBat on log(CPM + 1) data. Isolated cells do not have mutual nearest neighbors and appear in some correction models. Seurat’s CCA alignment batch-corrects data only in a latent space as done in manifold learning, and we therefore could not compute highly variable genes and show only kBET values.

Supplementary Figure 6 Sequencing depth in mouse early development data varies by study rather than cell type.

Sequencing depth (library size) per developmental stage (shape) in eight different studies (color-coded) of mouse embryonic development.

Supplementary Figure 7 kBET detects inter-individual variability in PBMC data (Kang et al.).

PBMC data from eight unrelated individuals processed in three experiments (batches) (Chromium 10X Genomics device), with donor cell identity assigned with demuxlet. Note that with pooling of cells from multiple donors, between-donor processing batch effects are effectively excluded. We applied kBET as a sensitive measure of inter-individual variability. (a) t-SNE plot of all data. Cell types are annotated as in the original publication. (b) Cell-type frequencies per individual and batch. Cells were prepared once and processed in two separate runs. Inter-individual variability is stronger than preparation bias. (c) kBET acceptance rates (1 – rejection rate) for several subsets of the complete dataset. Subsample sizes were chosen from 10% to 100% of the data sample size. Subsampling was repeated threefold and kBET rejection rates were averaged across these replicates to reduce bias from subsampling. With decreasing sample size, we find decreasing rejection rates. This result is due to decreasing certainty for each tested neighborhood as it leads to enhanced failure to reject the null hypothesis.

Supplementary Figure 8 Normalization affects the CDR–library size relation in SMART-seq data.

(a-d) Comparison of cellular detection rate (CDR) effect and library size for inDrop UMI data (Klein et al.) (a-b) and C1 SMART-seq data (Kolodziejczyk et al.) (c-d). (a,c) CDR for count data. (b,d) CDR for CPM-normalized data where any gene was counted as detected if CPM > 1. Blue line shows a linear fit of CDR (y) to library size (x) and batch variable (z) and R2 denotes the variance explained by the model. Gray shaded areas indicate s.e. of the fit. In UMI data, the library size explains largely the CDR effect independent of normalization (a,b). In SMART-seq data, we observe a significant contribution of library size to CDR effect (c) that is lost in CPM-normalized data (d).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8 and Supplementary Notes 1–7

Reporting Summary

Supplementary Software

k-nearest-neighbor batch-effect test (kBET) is available as an R package

Supplementary Table 1

Overview of batch-regression and normalization approaches

Supplementary Table 2

Top 20 best-performing batch-correction strategies

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Büttner, M., Miao, Z., Wolf, F.A. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods 16, 43–49 (2019). https://doi.org/10.1038/s41592-018-0254-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-018-0254-1

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing