Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Iterative single-cell multi-omic integration using online learning

Abstract

Integrating large single-cell gene expression, chromatin accessibility and DNA methylation datasets requires general and scalable computational approaches. Here we describe online integrative non-negative matrix factorization (iNMF), an algorithm for integrating large, diverse and continually arriving single-cell datasets. Our approach scales to arbitrarily large numbers of cells using fixed memory, iteratively incorporates new datasets as they are generated and allows many users to simultaneously analyze a single copy of a large dataset by streaming it over the internet. Iterative data addition can also be used to map new data to a reference dataset. Comparisons with previous methods indicate that the improvements in efficiency do not sacrifice dataset alignment and cluster preservation performance. We demonstrate the effectiveness of online iNMF by integrating more than 1 million cells on a standard laptop, integrating large single-cell RNA sequencing and spatial transcriptomic datasets, and iteratively constructing a single-cell multi-omic atlas of the mouse motor cortex.

This is a preview of subscription content

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of the online iNMF algorithm.
Fig. 2: Online iNMF converges much faster than previously published batch algorithms.
Fig. 3: Benchmark of online iNMF, batch iNMF, Harmony and Seurat.
Fig. 4: Joint analysis of nine regions of the adult mouse brain (n = 691,962 cells) using online iNMF.
Fig. 5: Online iNMF integrates large scRNA-seq and spatial transcriptomic datasets.
Fig. 6: Iterative refinement of cell identity using multiple single-cell modalities from the mouse primary motor cortex.

Data availability

• Human PBMC from Kang et al.9 (GSE96583) distributed by SeuratData

• Human pancreatic islet cells from Grün et al.10 (GSE81076), Muraro et al.11 (GSE85241), Lawlor et al.12 (GSE86469), Baron et al.13 (GSE84133) and Segerstolpe et al.14 (E-MTAB-5061) distributed by SeuratData

• Adult mouse brain cells from Saunders et al.7 (http://dropviz.org/)

• Mouse Organogenesis Cell Atlas from Cao et al.18 (https://oncoscape.v3.sttrcancer.org/atlas.gs.washington.edu.mouse.rna/downloads)

• Mouse hippocampus cells from Rodriques et al.19 (https://singlecell.broadinstitute.org/single_cell/study/SCP354/slide-seq-study#study-download)

• Mouse hippocampus cells from Yao et al.22 (http://data.nemoarchive.org/biccn/grant/zeng/zeng/transcriptome/scell/10X/processed/YaoHippo2020/)

• Mouse hypothalamic pre-optic region data from Moffitt et al.23 (https://datadryad.org/stash/dataset/doi:10.5061/dryad.8t8s248 and GSE113576)

• Mouse primary motor cortex cells from Yao et al.27 (https://assets.nemoarchive.org/dat-ch1nqb7)

Code availability

An R implementation of LIGER is available from the Comprehensive R Archive Network at https://cran.r-project.org/package=rliger and on GitHub at https://github.com/welch-lab/liger, along with detailed installation instructions. Tutorials demonstrating package functionality, including online learning for Scenario 1, Scenario 2 and Scenario 3, are available on the GitHub page.

References

  1. 1.

    Ye, Z. & Sarkar, C. A. Towards a quantitative understanding of cell identity. Trends Cell Biol. 28, 1030–1048 (2018).

    CAS  Article  Google Scholar 

  2. 2.

    Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    CAS  Article  Google Scholar 

  3. 3.

    Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019).

    CAS  Article  Google Scholar 

  4. 4.

    Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    CAS  Article  Google Scholar 

  5. 5.

    Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).

    CAS  Article  Google Scholar 

  6. 6.

    Mairal, J., Bach, F., Ponce, J. & Sapiro, G. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60 (2010).

    Google Scholar 

  7. 7.

    Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174, 1015–1030 (2018).

    CAS  Article  Google Scholar 

  8. 8.

    Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    CAS  Article  Google Scholar 

  9. 9.

    Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    CAS  Article  Google Scholar 

  10. 10.

    Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    Article  Google Scholar 

  11. 11.

    Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

    CAS  Article  Google Scholar 

  12. 12.

    Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).

    CAS  Article  Google Scholar 

  13. 13.

    Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

    CAS  Article  Google Scholar 

  14. 14.

    Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    CAS  Article  Google Scholar 

  15. 15.

    Toda, T., Parylak, S. L., Linker, S. B. & Gage, F. H. The role of adult hippocampal neurogenesis in brain health and disease. Mol. Psychiatry 24, 67–87 (2019).

    CAS  Article  Google Scholar 

  16. 16.

    Ernst, A. et al. Neurogenesis in the striatum of the adult human brain. Cell 156, 1072–1083 (2014).

    CAS  Article  Google Scholar 

  17. 17.

    Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014 (2018).

    CAS  Article  Google Scholar 

  18. 18.

    Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).

    CAS  Article  Google Scholar 

  19. 19.

    Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).

    CAS  Article  Google Scholar 

  20. 20.

    Stickels, R. R. et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 39, 313–319 (2021).

    CAS  Article  Google Scholar 

  21. 21.

    Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).

    Article  Google Scholar 

  22. 22.

    Yao, Z. et al. A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation. Preprint at bioRxiv https://doi.org/10.1101/2020.03.30.015214 (2020).

  23. 23.

    Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).

    Article  Google Scholar 

  24. 24.

    Ecker, J. R. et al. The BRAIN Initiative Cell Census Consortium: lessons learned toward generating a comprehensive brain cell atlas. Neuron 96, 542–557 (2017).

    CAS  Article  Google Scholar 

  25. 25.

    HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).

    CAS  Article  Google Scholar 

  26. 26.

    Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    Article  Google Scholar 

  27. 27.

    Yao, Z. et al. An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types. Preprint at bioRxiv https://doi.org/10.1101/2020.02.29.970558 (2020).

  28. 28.

    Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).

    CAS  Article  Google Scholar 

  29. 29.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    CAS  Article  Google Scholar 

  30. 30.

    Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    Article  Google Scholar 

  31. 31.

    Hubert, L. & Arabie, P. Comparing partitions. J. Classification 2, 193–218 (1985).

    Article  Google Scholar 

  32. 32.

    Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).

    Article  Google Scholar 

  33. 33.

    Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Institutes of Health grants R01AI149669-01, R01HG010883-01 and RF1MH123199 (to J.D.W.) and 5U19MH114831 (to J.R.E.). J.R.E. is an Investigator of the Howard Hughes Medical Institute.

Author information

Affiliations

Authors

Contributions

S.P., C.L., R.C., J.S., A.R., J.R.N., M.M.B., J.R.E. and B.R. generated the snATAC-seq and snmC-seq data. J.D.W. conceived the idea of online iNMF. C.G. and J.D.W. developed and implemented the online iNMF algorithm. C.G., J.L., A.R.K. and J.D.W. carried out data analyses. C.G., J.L., A.R.K. and J.D.W. wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Joshua D. Welch.

Ethics declarations

Competing interests

A patent application on LIGER has been submitted by the Broad Institute and the General Hospital Corporation with J.D.W. listed as an inventor. The remaining authors declare no competing financial interests.

Additional information

Peer review information Nature Biotechnology thanks Samantha A. Morris and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Convergence behavior for online iNMF and batch iNMF algorithms on scRNA-seq data from the adult mouse brain, human PBMC and human pancreas.

The online iNMF algorithm exhibits faster convergence and better objective minimization after a fixed amount of training time. The advantage of the online algorithm in convergence speed is more apparent for larger datasets. a-c, Adult mouse brain (n = 691,962 cells, nine individual datasets). d-f, Human PBMC (n = 13,999 cells, two individual datasets). g-i, Human pancreas (n = 14,890 cells, eight individual datasets). Center lines of box plots show the median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; and points are outliers.

Extended Data Fig. 2 Online and batch iNMF yield highly similar UMAP visualizations.

We performed online iNMF and batch iNMF on data from mouse cortex (n = 255,353 cells), human PBMC (n = 13,999 cells), and human pancreas (n = 14,890 cells). Online iNMF and batch iNMF produce very similar visualizations, suggesting that the approaches give very similar dataset alignment and cluster preservation. We subsequently confirmed this qualitative observation using quantitative metrics.

Extended Data Fig. 3

Benchmarking integration across data modalities (RNA+ATAC). 5,000 cells from the snRNA-seq dataset and 5,000 cells from the snATAC-seq dataset from MOp data collection were integrated using four different methods. The cells are exhibited in 2-dimensional UMAP space and colored by dataset.

Extended Data Fig. 4 Performing online iNMF in three scenarios produces similar results.

These analyses were carried out separately to integrate eight MOp datasets (scRNA-seq, snRNA-seq, snATAC-seq and snmC-seq, n = 408,885) using online iNMF in scenario 1 (a), scenario 2 (b), and scenario 3 (c). The results are visualized in UMAP coordinates and the cells are colored by the cell type annotations from Fig. 6.

Supplementary information

Supplementary Information

Supplementary Notes, Figs. 1–15 and Table 1

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gao, C., Liu, J., Kriebel, A.R. et al. Iterative single-cell multi-omic integration using online learning. Nat Biotechnol 39, 1000–1007 (2021). https://doi.org/10.1038/s41587-021-00867-x

Download citation

Further reading

Search

Quick links