Analysis | Published:

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

Nature Biotechnology volume 36, pages 421427 (2018) | Download Citation

Abstract

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Accessions

Primary accessions

ArrayExpress

Referenced accessions

ArrayExpress

References

  1. 1.

    et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014).

  2. 2.

    et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

  3. 3.

    et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

  4. 4.

    et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods 14, 395–398 (2017).

  5. 5.

    , , & Missing data and technical variability in single-cell RNA-sequencing experiments. Preprint at (2017).

  6. 6.

    et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).

  7. 7.

    et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

  8. 8.

    , & Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

  9. 9.

    , , & Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).

  10. 10.

    svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).

  11. 11.

    et al. An interactive reference framework for modeling a dynamic immune system. Science 349, 1259425 (2015).

  12. 12.

    et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).

  13. 13.

    et al. Resolving early mesoderm diversification through single-cell expression profiling. Nature 535, 289–293 (2016).

  14. 14.

    et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).

  15. 15.

    et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).

  16. 16.

    & Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

  17. 17.

    et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

  18. 18.

    et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

  19. 19.

    et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2016).

  20. 20.

    et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

  21. 21.

    et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394.e3 (2016).

  22. 22.

    et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).

  23. 23.

    et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

  24. 24.

    et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  25. 25.

    et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).

  26. 26.

    et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  27. 27.

    , & featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

  28. 28.

    , & Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

  29. 29.

    & Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).

  30. 30.

    & Computing communities in large networks using random walks. ISCIS 3733, 284–293 (2005).

  31. 31.

    , , , & Assessment of batch-correction methods for scRNA-seq data with a new test metric. Preprint at (2017).

  32. 32.

    et al. Quantifying disorder through conditional entropy: an application to fluid mixing. PloS One 6, e65617 (2013).

Download references

Acknowledgements

We are grateful to F.K. Hamey, J.P. Munro, J. Griffiths and M. Büttner for helpful discussions. L.H. was supported by Wellcome Trust Grant 108437/Z/15 to J.C.M. A.T.L.L. was supported by core funding from CRUK (award number 17197 to J.C.M.). M.D.M. was supported by Wellcome Trust Grant 105045/Z/14/Z to J.C.M. J.C.M. was supported by core funding from EMBL and from CRUK (award number 17197).

Author information

Affiliations

  1. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK.

    • Laleh Haghverdi
    •  & John C Marioni
  2. Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany.

    • Laleh Haghverdi
  3. Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK.

    • Aaron T L Lun
    •  & John C Marioni
  4. Wellcome Trust Sanger Institute, Cambridge, UK.

    • Michael D Morgan
    •  & John C Marioni

Authors

  1. Search for Laleh Haghverdi in:

  2. Search for Aaron T L Lun in:

  3. Search for Michael D Morgan in:

  4. Search for John C Marioni in:

Contributions

L.H. developed the method and the computational tools, performed the analysis and wrote the paper. A.T.L.L. developed the method and the computational tools and wrote the paper. M.D.M. developed the method, performed the analysis and wrote the paper. J.C.M. developed the method, wrote the paper and supervised the study.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to John C Marioni.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–7, Supplementary Notes 1–5 and Supplementary Table 1

  2. 2.

    Reporting Summary

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nbt.4091

Further reading