Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Robust integration of multiple single-cell RNA sequencing datasets using a single reference space

Abstract

In many biological applications of single-cell RNA sequencing (scRNA-seq), an integrated analysis of data from multiple batches or studies is necessary. Current methods typically achieve integration using shared cell types or covariance correlation between datasets, which can distort biological signals. Here we introduce an algorithm that uses the gene eigenvectors from a reference dataset to establish a global frame for integration. Using simulated and real datasets, we demonstrate that this approach, called Reference Principal Component Integration (RPCI), consistently outperforms other methods by multiple metrics, with clear advantages in preserving genuine cross-sample gene expression differences in matching cell types, such as those present in cells at distinct developmental stages or in perturbated versus control studies. Moreover, RPCI maintains this robust performance when multiple datasets are integrated. Finally, we applied RPCI to scRNA-seq data for mouse gut endoderm development and revealed temporal emergence of genetic programs helping establish the anterior–posterior axis in visceral endoderm.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Schematic illustration of data integration.
Fig. 2: Performance of RPCI in integrating simulated heterogeneous datasets.
Fig. 3: RPCI integration of semi-simulated datasets with diverse cell types and multiple perturbations.
Fig. 4: Performance of RPCI in integrating WT and ERR KO snRNA-seq datasets.
Fig. 5: Maintenance of correct cell type relationship in endoderm developmental trajectory in RPCI integration.

Data availability

All scRNA-seq datasets in this study were published previously, and their availabilities are described in Supplementary Table 2.

Code availability

The RISC is prepared as an R package and is available for free use via GitHub (https://github.com/bioinfoDZ/RISC). Codes for the analysis (and related source data) are provided in the Code Ocean (https://codeocean.com/capsule/9098032).

References

  1. 1.

    Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014).

    CAS  PubMed  Article  Google Scholar 

  2. 2.

    Nawy, T. Single-cell sequencing. Nat. Methods 11, 18 (2014).

    CAS  PubMed  Article  Google Scholar 

  3. 3.

    Wang, Y. & Navin, N. E. Advances and applications of single-cell sequencing technologies. Mol. Cell 58, 598–609 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Fan, X. et al. Spatial transcriptomic survey of human embryonic cerebral cortex by single-cell RNA-seq analysis. Cell Res. 28, 730–745 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Wang, J. X. et al. Single-cell gene expression analysis reveals regulators of distinct cell subpopulations among developing human neurons. Genome Res. 27, 1783–1794 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Davie, K. et al. A single-cell transcriptome atlas of the aging Drosophila brain. Cell 174, 982–998 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. 10.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Lin, Y. et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA 116, 9775–9784 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).

    CAS  PubMed  Article  Google Scholar 

  16. 16.

    Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  17. 17.

    Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  19. 19.

    Wang, T. et al. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol. 20, 165 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  20. 20.

    Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Zhang, F., Wu, Y. & Tian, W. A novel approach to remove the batch effect of single-cell data. Cell Discov. 5, 46 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  22. 22.

    Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. 23.

    Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016).

    PubMed  PubMed Central  Google Scholar 

  24. 24.

    Jolliffe, I. T. Principal Component Analysis (Springer, 2011).

  25. 25.

    Zhang, X., Xu, C. & Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 10, 2611 (2019).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  26. 26.

    Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  27. 27.

    Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8, 289–317 (2016).

    PubMed  PubMed Central  Article  Google Scholar 

  28. 28.

    Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article  Google Scholar 

  29. 29.

    Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    PubMed  Article  CAS  Google Scholar 

  30. 30.

    Villani, A. C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  31. 31.

    Hu, P. et al. Single-nucleus transcriptomic survey of cell diversity and functional maturation in postnatal mammalian hearts. Genes Dev. 32, 1344–1357 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. 32.

    Liu, Y., Singh, V. K. & Zheng, D. Stereo3D: using stereo images to enrich 3D visualization. Bioinformatics 36, 4189–4190 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Nowotschin, S. et al. The emergent landscape of the mouse gut endoderm at single-cell resolution. Nature 569, 361–367 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Arnold, S. J. & Robertson, E. J. Making a commitment: cell lineage allocation and axis patterning in the early mouse embryo. Nat. Rev. Mol. Cell Biol. 10, 91–103 (2009).

    CAS  PubMed  Article  Google Scholar 

  35. 35.

    Nowotschin, S., Hadjantonakis, A. K. & Campbell, K. The endoderm: a divergent cell lineage with many commonalities. Development 146, dev150920 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    Stuckey, D. W., Di Gregorio, A., Clements, M. & Rodriguez, T. A. Correct patterning of the primitive streak requires the anterior visceral endoderm. PLoS ONE 6, e17620 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    CAS  PubMed  Article  Google Scholar 

  38. 38.

    Pepe-Mooney, B. J. et al. Single-cell analysis of the liver epithelium reveals dynamic heterogeneity and an essential role for YAP in homeostasis and regeneration. Cell Stem Cell 25, 23–38 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Hill, M. C. et al. A cellular atlas of Pitx2-dependent cardiac development. Development 146, dev180398 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  40. 40.

    Gordon, S. R. et al. PD-1 expression by tumour-associated macrophages inhibits phagocytosis and tumour immunity. Nature 545, 495–499 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).

    CAS  Article  Google Scholar 

  42. 42.

    Yost, K. E. et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat. Med. 25, 1251–1259 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Ding, J. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 38, 737–746 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  44. 44.

    Grun, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. 45.

    Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  46. 46.

    Segerstolpe, A. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  47. 47.

    Wang, Y. J. et al. Single-cell transcriptomics of the human endocrine pancreas. Diabetes 65, 3028–3038 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. 48.

    Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  49. 49.

    Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).

    PubMed  PubMed Central  Article  Google Scholar 

  50. 50.

    Giraddi, R. R. et al. Single-cell transcriptomes distinguish stem cell state changes and lineage specification programs in early mammary gland development. Cell Rep. 24, 1653–1666 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  51. 51.

    Maaten, L. V. D. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).

    Google Scholar 

  52. 52.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38 (2018).

    Article  CAS  Google Scholar 

  53. 53.

    Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  54. 54.

    Kolde, R. pheatmap: Pretty Heatmaps https://rdrr.io/cran/pheatmap/ (2019).

  55. 55.

    Zwiener, I., Frisch, B. & Binder, H. Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9, e85150 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  56. 56.

    Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  57. 57.

    McCarthy, D. J., Campbell, K. R., Lun, A. T. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. 58.

    Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

    PubMed  Article  CAS  Google Scholar 

  59. 59.

    Bacher, R. et al. SCnorm: robust normalization of single-cell RNA-seq data. Nat. Methods 14, 584–586 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  60. 60.

    R Core Team. R: A Language and Environment for Statistical Computing https://www.R-project.org/ (2019).

  61. 61.

    Koopmans, L. H., Owen, D. B. & Rosenblatt, J. I. Confidence intervals for the coefficient of variation for the normal and log normal distributions. Biometrika 51, 25–32 (1964).

    Article  Google Scholar 

  62. 62.

    Ver Hoef, J. M. & Boveng, P. L. Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology 88, 2766–2772 (2007).

    PubMed  Article  Google Scholar 

  63. 63.

    Gonzalez, I., Déjean, S., Martin, P. & Baccini, A. CCA: an R package to extend canonical correlation analysis. J. Stat. Softw. 23, 14 (2008).

    Article  Google Scholar 

  64. 64.

    Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).

    PubMed  PubMed Central  Article  Google Scholar 

  65. 65.

    Wooldridge, J.M. Introductory Econometrics: A Modern Approach (Cengage, 2018)

  66. 66.

    Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14 (1992).

    Article  Google Scholar 

  67. 67.

    Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster-analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article  Google Scholar 

  68. 68.

    Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K. cluster: Cluster Analysis Basics and Extensions https://cran.r-project.org/package=cluster (2019).

  69. 69.

    Venables, W.N., Ripley, B.D. & Venables, W.N. Modern Applied Statistics with S (Springer, 2002).

  70. 70.

    Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  71. 71.

    Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  72. 72.

    Nabavi, S., Schmolze, D., Maitituoheti, M., Malladi, S. & Beck, A. H. EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics 32, 533–541 (2016).

    CAS  PubMed  Article  Google Scholar 

Download references

Acknowledgements

We thank all the research groups that generated and shared the scRNA-seq data used in this study. We thank the members of the Zheng lab for valuable discussions, software testing and comments on the manuscript. We also acknowledge funding support from the National Institutes of Health (grants HL133120 to D.Z. and B.Z., HL153920 to D.Z., HD092944 to D.Z. and B.Z., and HD070454 to D.Z.).

Author information

Affiliations

Authors

Contributions

Y.L. and D.Z. conceived the algorithm and analysis. Y.L. performed the analyses. T.W. and B.Z. contributed to the methods or discussions. D.Z. supervised the study. All authors wrote the manuscript.

Corresponding author

Correspondence to Deyou Zheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Supplementary Methods and Supplementary Figs. 1–21.

Reporting Summary

Supplementary Tables 1 and 2

Supplementary Table 1. Metric scores from pairwise integration and full integration of the semi-simulated data. Supplementary Table 2. List of real scRNA-seq datasets.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Wang, T., Zhou, B. et al. Robust integration of multiple single-cell RNA sequencing datasets using a single reference space. Nat Biotechnol 39, 877–884 (2021). https://doi.org/10.1038/s41587-021-00859-x

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing