Integrating large single-cell gene expression, chromatin accessibility and DNA methylation datasets requires general and scalable computational approaches. Here we describe online integrative non-negative matrix factorization (iNMF), an algorithm for integrating large, diverse and continually arriving single-cell datasets. Our approach scales to arbitrarily large numbers of cells using fixed memory, iteratively incorporates new datasets as they are generated and allows many users to simultaneously analyze a single copy of a large dataset by streaming it over the internet. Iterative data addition can also be used to map new data to a reference dataset. Comparisons with previous methods indicate that the improvements in efficiency do not sacrifice dataset alignment and cluster preservation performance. We demonstrate the effectiveness of online iNMF by integrating more than 1 million cells on a standard laptop, integrating large single-cell RNA sequencing and spatial transcriptomic datasets, and iteratively constructing a single-cell multi-omic atlas of the mouse motor cortex.
Data availability
• Human PBMC from Kang et al.9 (GSE96583) distributed by SeuratData
• Human pancreatic islet cells from Grün et al.10 (GSE81076), Muraro et al.11 (GSE85241), Lawlor et al.12 (GSE86469), Baron et al.13 (GSE84133) and Segerstolpe et al.14 (E-MTAB-5061) distributed by SeuratData
• Adult mouse brain cells from Saunders et al.7 (http://dropviz.org/)
• Mouse Organogenesis Cell Atlas from Cao et al.18 (https://oncoscape.v3.sttrcancer.org/atlas.gs.washington.edu.mouse.rna/downloads)
• Mouse hippocampus cells from Rodriques et al.19 (https://singlecell.broadinstitute.org/single_cell/study/SCP354/slide-seq-study#study-download)
• Mouse hippocampus cells from Yao et al.22 (http://data.nemoarchive.org/biccn/grant/zeng/zeng/transcriptome/scell/10X/processed/YaoHippo2020/)
• Mouse hypothalamic pre-optic region data from Moffitt et al.23 (https://datadryad.org/stash/dataset/doi:10.5061/dryad.8t8s248 and GSE113576)
• Mouse primary motor cortex cells from Yao et al.27 (https://assets.nemoarchive.org/dat-ch1nqb7)
Code availability
An R implementation of LIGER is available from the Comprehensive R Archive Network at https://cran.r-project.org/package=rliger and on GitHub at https://github.com/welch-lab/liger, along with detailed installation instructions. Tutorials demonstrating package functionality, including online learning for Scenario 1, Scenario 2 and Scenario 3, are available on the GitHub page.
Acknowledgements
This work was supported by National Institutes of Health grants R01AI149669-01, R01HG010883-01 and RF1MH123199 (to J.D.W.) and 5U19MH114831 (to J.R.E.). J.R.E. is an Investigator of the Howard Hughes Medical Institute.
Ethics declarations
Competing interests
A patent application on LIGER has been submitted by the Broad Institute and the General Hospital Corporation with J.D.W. listed as an inventor. The remaining authors declare no competing financial interests.
Extended data
Extended Data Fig. 1 Convergence behavior for online iNMF and batch iNMF algorithms on scRNA-seq data from the adult mouse brain, human PBMC and human pancreas.
The online iNMF algorithm exhibits faster convergence and better objective minimization after a fixed amount of training time. The advantage of the online algorithm in convergence speed is more apparent for larger datasets. a-c, Adult mouse brain (n = 691,962 cells, nine individual datasets). d-f, Human PBMC (n = 13,999 cells, two individual datasets). g-i, Human pancreas (n = 14,890 cells, eight individual datasets). Center lines of box plots show the median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; and points are outliers.
Extended Data Fig. 2 Online and batch iNMF yield highly similar UMAP visualizations.
We performed online iNMF and batch iNMF on data from mouse cortex (n = 255,353 cells), human PBMC (n = 13,999 cells), and human pancreas (n = 14,890 cells). Online iNMF and batch iNMF produce very similar visualizations, suggesting that the approaches give very similar dataset alignment and cluster preservation. We subsequently confirmed this qualitative observation using quantitative metrics.
Extended Data Fig. 3
Benchmarking integration across data modalities (RNA+ATAC). 5,000 cells from the snRNA-seq dataset and 5,000 cells from the snATAC-seq dataset from MOp data collection were integrated using four different methods. The cells are exhibited in 2-dimensional UMAP space and colored by dataset.
Extended Data Fig. 4 Performing online iNMF in three scenarios produces similar results.
These analyses were carried out separately to integrate eight MOp datasets (scRNA-seq, snRNA-seq, snATAC-seq and snmC-seq, n = 408,885) using online iNMF in scenario 1 (a), scenario 2 (b), and scenario 3 (c). The results are visualized in UMAP coordinates and the cells are colored by the cell type annotations from Fig. 6.
