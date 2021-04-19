Skip to main content

Iterative single-cell multi-omic integration using online learning

Nature Biotechnology (2021)

Subjects

Abstract

Integrating large single-cell gene expression, chromatin accessibility and DNA methylation datasets requires general and scalable computational approaches. Here we describe online integrative non-negative matrix factorization (iNMF), an algorithm for integrating large, diverse and continually arriving single-cell datasets. Our approach scales to arbitrarily large numbers of cells using fixed memory, iteratively incorporates new datasets as they are generated and allows many users to simultaneously analyze a single copy of a large dataset by streaming it over the internet. Iterative data addition can also be used to map new data to a reference dataset. Comparisons with previous methods indicate that the improvements in efficiency do not sacrifice dataset alignment and cluster preservation performance. We demonstrate the effectiveness of online iNMF by integrating more than 1 million cells on a standard laptop, integrating large single-cell RNA sequencing and spatial transcriptomic datasets, and iteratively constructing a single-cell multi-omic atlas of the mouse motor cortex.

Fig. 1: Overview of the online iNMF algorithm.
Fig. 2: Online iNMF converges much faster than previously published batch algorithms.
Fig. 3: Benchmark of online iNMF, batch iNMF, Harmony and Seurat.
Fig. 4: Joint analysis of nine regions of the adult mouse brain (n = 691,962 cells) using online iNMF.
Fig. 5: Online iNMF integrates large scRNA-seq and spatial transcriptomic datasets.
Fig. 6: Iterative refinement of cell identity using multiple single-cell modalities from the mouse primary motor cortex.

Data availability

• Human PBMC from Kang et al.9 (GSE96583) distributed by SeuratData

• Human pancreatic islet cells from Grün et al.10 (GSE81076), Muraro et al.11 (GSE85241), Lawlor et al.12 (GSE86469), Baron et al.13 (GSE84133) and Segerstolpe et al.14 (E-MTAB-5061) distributed by SeuratData

• Adult mouse brain cells from Saunders et al.7 (http://dropviz.org/)

• Mouse Organogenesis Cell Atlas from Cao et al.18 (https://oncoscape.v3.sttrcancer.org/atlas.gs.washington.edu.mouse.rna/downloads)

• Mouse hippocampus cells from Rodriques et al.19 (https://singlecell.broadinstitute.org/single_cell/study/SCP354/slide-seq-study#study-download)

• Mouse hippocampus cells from Yao et al.22 (http://data.nemoarchive.org/biccn/grant/zeng/zeng/transcriptome/scell/10X/processed/YaoHippo2020/)

• Mouse hypothalamic pre-optic region data from Moffitt et al.23 (https://datadryad.org/stash/dataset/doi:10.5061/dryad.8t8s248 and GSE113576)

• Mouse primary motor cortex cells from Yao et al.27 (https://assets.nemoarchive.org/dat-ch1nqb7)

Code availability

An R implementation of LIGER is available from the Comprehensive R Archive Network at https://cran.r-project.org/package=rliger and on GitHub at https://github.com/welch-lab/liger, along with detailed installation instructions. Tutorials demonstrating package functionality, including online learning for Scenario 1, Scenario 2 and Scenario 3, are available on the GitHub page.

References

Acknowledgements

This work was supported by National Institutes of Health grants R01AI149669-01, R01HG010883-01 and RF1MH123199 (to J.D.W.) and 5U19MH114831 (to J.R.E.). J.R.E. is an Investigator of the Howard Hughes Medical Institute.

Author information

  1. Chongyuan Luo

    Present address: Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA

Affiliations

  1. Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA

    Chao Gao, Jialin Liu, April R. Kriebel & Joshua D. Welch

  2. Center for Epigenomics, Department of Cellular and Molecular Medicine, University of California, San Diego, School of Medicine, La Jolla, CA, USA

    Sebastian Preissl & Bing Ren

  3. Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA

    Chongyuan Luo, Rosa Castanon, Justin Sandoval, Angeline Rivkin, Joseph R. Nery & Joseph R. Ecker

  4. Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA

    Chongyuan Luo & Joseph R. Ecker

  5. Computational Neurobiology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA

    Margarita M. Behrens

  6. Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA

    Joshua D. Welch

  1. Chao Gao
  2. Jialin Liu
  3. April R. Kriebel
  4. Sebastian Preissl
  5. Chongyuan Luo
  6. Rosa Castanon
  7. Justin Sandoval
  8. Angeline Rivkin
  9. Joseph R. Nery
  10. Margarita M. Behrens
  11. Joseph R. Ecker
  12. Bing Ren
  13. Joshua D. Welch
Contributions

S.P., C.L., R.C., J.S., A.R., J.R.N., M.M.B., J.R.E. and B.R. generated the snATAC-seq and snmC-seq data. J.D.W. conceived the idea of online iNMF. C.G. and J.D.W. developed and implemented the online iNMF algorithm. C.G., J.L., A.R.K. and J.D.W. carried out data analyses. C.G., J.L., A.R.K. and J.D.W. wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Joshua D. Welch.

Ethics declarations

Competing interests

A patent application on LIGER has been submitted by the Broad Institute and the General Hospital Corporation with J.D.W. listed as an inventor. The remaining authors declare no competing financial interests.

Additional information

Peer review information Nature Biotechnology thanks Samantha A. Morris and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Convergence behavior for online iNMF and batch iNMF algorithms on scRNA-seq data from the adult mouse brain, human PBMC and human pancreas.

The online iNMF algorithm exhibits faster convergence and better objective minimization after a fixed amount of training time. The advantage of the online algorithm in convergence speed is more apparent for larger datasets. a-c, Adult mouse brain (n = 691,962 cells, nine individual datasets). d-f, Human PBMC (n = 13,999 cells, two individual datasets). g-i, Human pancreas (n = 14,890 cells, eight individual datasets). Center lines of box plots show the median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; and points are outliers.

Extended Data Fig. 2 Online and batch iNMF yield highly similar UMAP visualizations.

We performed online iNMF and batch iNMF on data from mouse cortex (n = 255,353 cells), human PBMC (n = 13,999 cells), and human pancreas (n = 14,890 cells). Online iNMF and batch iNMF produce very similar visualizations, suggesting that the approaches give very similar dataset alignment and cluster preservation. We subsequently confirmed this qualitative observation using quantitative metrics.

Extended Data Fig. 3

Benchmarking integration across data modalities (RNA+ATAC). 5,000 cells from the snRNA-seq dataset and 5,000 cells from the snATAC-seq dataset from MOp data collection were integrated using four different methods. The cells are exhibited in 2-dimensional UMAP space and colored by dataset.

Extended Data Fig. 4 Performing online iNMF in three scenarios produces similar results.

These analyses were carried out separately to integrate eight MOp datasets (scRNA-seq, snRNA-seq, snATAC-seq and snmC-seq, n = 408,885) using online iNMF in scenario 1 (a), scenario 2 (b), and scenario 3 (c). The results are visualized in UMAP coordinates and the cells are colored by the cell type annotations from Fig. 6.

Supplementary information

Supplementary Information

Supplementary Notes, Figs. 1–15 and Table 1

Reporting Summary

