Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

Abstract

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.

This is a preview of subscription content

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Schematics of batch-effect correction by MNN.
Figure 2: t-SNE plots of simulated scRNA-seq data containing two batches of different cell types (with each batch containing n = 1,000 cells).
Figure 3: t-SNE plots of scRNA-seq count data for cells from the hematopoietic lineage, prepared in two batches by using different technologies (SMART-seq2 with n = 1,920 cells, closed circles; MARS-seq with n = 2,729 cells, open circles).
Figure 4: Application of MNN batch correction to pancreas cells by using four data sets (GSE81076 with n = 1,007 cells, GSE86473 with n = 2,331 cells, GSE85241 with n = 1,595 cells and E-MTAB-5061 with n = 2,163 cells) measured on two different platforms, CEL-seq2 and SMART-seq2.
Figure 5: MNN batch correction can be scaled to tens of thousands of cells.

Accession codes

Primary accessions

ArrayExpress

Gene Expression Omnibus

Referenced accessions

ArrayExpress

References

  1. 1

    Jaitin, D.A. et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. 2

    Klein, A.M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  3. 3

    Macosko, E.Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. 4

    Gierahn, T.M. et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods 14, 395–398 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. 5

    Hicks, S.C., Townes, F.W., Teng, M. & Irizarry, R.A. Missing data and technical variability in single-cell RNA-sequencing experiments. Preprint at https://www.biorxiv.org/content/early/2017/05/08/025528/ (2017).

  6. 6

    Tung, P.Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. 7

    Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  8. 8

    Johnson, W.E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  9. 9

    Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  10. 10

    Leek, J.T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).

    Article  PubMed Central  Google Scholar 

  11. 11

    Spitzer, M.H. et al. An interactive reference framework for modeling a dynamic immune system. Science 349, 1259425 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  12. 12

    Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  13. 13

    Scialdone, A. et al. Resolving early mesoderm diversification through single-cell expression profiling. Nature 535, 289–293 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. 14

    Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. 15

    Bendall, S.C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  16. 16

    van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  17. 17

    Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

    CAS  Article  PubMed  Google Scholar 

  18. 18

    Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

    CAS  Article  PubMed  Google Scholar 

  19. 19

    Angerer, P. et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. 20

    Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  21. 21

    Muraro, M.J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394.e3 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. 22

    Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  23. 23

    Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  24. 24

    Zheng, G.X.Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  25. 25

    Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  26. 26

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  27. 27

    Liao, Y., Smyth, G.K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  28. 28

    Lun, A.T., Bach, K. & Marioni, J.C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

    Article  Google Scholar 

  29. 29

    Xu, C. & Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  30. 30

    Pons, P. & Latapy, M. Computing communities in large networks using random walks. ISCIS 3733, 284–293 (2005).

    Google Scholar 

  31. 31

    Buttner, M., Miao, Z., Wolf, A., Teichmann, S.A. & Theis, F.J. Assessment of batch-correction methods for scRNA-seq data with a new test metric. Preprint at https://www.biorxiv.org/content/early/2017/10/09/200345/ (2017).

  32. 32

    Brandani, G.B. et al. Quantifying disorder through conditional entropy: an application to fluid mixing. PloS One 6, e65617 (2013).

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful to F.K. Hamey, J.P. Munro, J. Griffiths and M. Büttner for helpful discussions. L.H. was supported by Wellcome Trust Grant 108437/Z/15 to J.C.M. A.T.L.L. was supported by core funding from CRUK (award number 17197 to J.C.M.). M.D.M. was supported by Wellcome Trust Grant 105045/Z/14/Z to J.C.M. J.C.M. was supported by core funding from EMBL and from CRUK (award number 17197).

Author information

Affiliations

Authors

Contributions

L.H. developed the method and the computational tools, performed the analysis and wrote the paper. A.T.L.L. developed the method and the computational tools and wrote the paper. M.D.M. developed the method, performed the analysis and wrote the paper. J.C.M. developed the method, wrote the paper and supervised the study.

Corresponding author

Correspondence to John C Marioni.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 MNN corrects nonconstant batch effects.

By using locally linear corrections, MNN can handle non-constant batch effects, here simulated as a small angle rotation of data on two-dimensional x-y coordinates. Each shown batch contains 400 cells (points). The reference batch is shown in red and the second (rotated) batch is shown in black for (a) raw (uncorrected) data and (b) data after MNN correction.

Supplementary Figure 2 Simulation of batch effect in two batches with identical cell-type composition.

t-SNE plots of (a) the raw (uncorrected) simulated data, and the simulation data corrected by (b) our MNN approach, (c) limma and (d) ComBat. The filled circles and open triangles represent cells from the first and second batch respectively. The three different cell types are shown by different colours. While there is a split between cells of the same cell type in the uncorrected data, all batch correction methods remove the batch effect successfully for this simple example and yield clusters consistent with the original simulated cell types. The data were simulated to have identical cell type compositions (0.2/0.3/0.5) in both batches, with each batch containing 1000 cells.

Supplementary Figure 3 Analysis of the hematopoietic data by using all 3,904 highly variable genes.

t-SNE plots for (a) uncorrected data and data after correction by (b) our MNN approach, (c) limma and (d) ComBat. Cells are coloured according to their batch labels. (e) Histogram of the angle (f) between the first two SVD components of the reference data (SMART-seq2) and the correction vectors of the MARS-seq data calculated by MNN. Diffusion maps of the haematopoietic data after MNN correction, shown on the (f) first two diffusion components, (g) first and the third diffusion components, and (h) second and the third diffusion components.

Supplementary Figure 4 Analysis of the hematopoietic data by using 1,500 genes randomly subsampled from the highly variable gene set.

t-SNE plots for (a) uncorrected data and data after correction by (b) MNN, (c) limma and (d) ComBat, coloured according to cell types. The same t -SNE plots are coloured according to batch for (e) uncorrected and batch-corrected data from (f) MNN, (g) limma and (h) ComBat. PCA plots for shared cell types (the SMART-seq2 batch with n=791 cells and the MARS-seq batch with n=2729 cells) between the two batches for (i) uncorrected data and batch-corrected data from (j) MNN, (k) limma and (l) ComBat.

Supplementary Figure 5 Analysis of the pancreas data by using all 2,507 highly variable genes.

t-SNE plots of (a) uncorrected data and data after correction by (b) MNN, (c) limma and (d) ComBat, coloured according to cell type labels. t-SNE plots were also generated for (e) uncorrected data and batch-corrected data from (f) MNN, (g) limma and (h) ComBat, coloured according to batch. PCA plots were also generated for (i) uncorrected and batch-corrected data from (j) MNN, (k) limma and (l) ComBat, coloured according to batch. Histograms of the angle between the batch effect vectors and the first two SVDs for the (m) reference (GSE85241) and the E-MTAB-5061 batch, (n) reference and the GSE86473 batch, and the (o) reference and the GSE81076 batch. (p) Silhouette coefficients according to cell type labels, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. (q) Boxplots of the entropy of batch mixing on the first two PCs, with n=100 (i.e. number of bootstraps) observations for each boxplot. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary Figure 6 Analysis of pancreas data on 1,500 genes randomly subsampled from the highly variable gene set.

t-SNE plots of (a) uncorrected data and data corrected by (b) our MNN approach, (c) limma and (d) ComBat, coloured according to cell type labels. t-SNE plots of (e) uncorrected data and batch-corrected data from (f) MNN, (g) limma and (h) ComBat, coloured according to batch. PCA plots of (i) uncorrected data and batch-corrected data from (j) MNN, (k) limma and (l) ComBat, coloured according to batch. Histogram of the angle between the batch effect vectors and the first two SVD components for the (m) reference (GSE85241) and the E-MTAB-5061 batch, (n) reference and the GSE86473 batch, and the (o) reference and the GSE81076 batch. (p) Silhouette coefficients according to cell type labels, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. (q) Boxplots of the entropy of batch mixing on the first two PCs, with n=100 (i.e. number of bootstraps) observations for each boxplot. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary Figure 7 Locally varying batch correction versus global (i.e., constant vector) batch correction.

t-SNE plots of pancreas data (GSE81076 with n=1007, GSE86473 with n= 2331, GSE85241 with n=1595 and E-MTAB-5061 with n=2163 cells) after batch correction with (a,c) MNN allowing for local batch vectors (default) or (b,d) MNN with a single global batch vector for all cells, coloured according to cell type labels (a,b) or batch (c,d). PCA plots of pancreas data after batch correction with (e) MNN allowing for local batch vectors (default) or (f) MNN with a single global batch vector for all cells, coloured according to batch. (g) Silhouette coefficients for clustering according to cell types after correction with two alternative settings of MNN, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. The difference between the Silhouette coefficients is not significant (two-sided Welch's test p-value=0.97). (h) Entropy of batch mixing on the first two PCs for batch-corrected data with the two alternative settings of MNN, with n=100 (i.e. number of bootstraps) observations for each boxplot Allowing for local batch vectors has significantly (two-sided Welch's test p-value = 0.00001) larger entropy compared to the use of a global batch vector. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7, Supplementary Notes 1–5 and Supplementary Table 1 (PDF 2875 kb)

Reporting Summary (PDF 158 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Haghverdi, L., Lun, A., Morgan, M. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36, 421–427 (2018). https://doi.org/10.1038/nbt.4091

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing