Efficient integration of heterogeneous single-cell transcriptomes using Scanorama

Hie, Brian; Bryson, Bryan; Berger, Bonnie

doi:10.1038/s41587-019-0113-3

Analysis
Published: 06 May 2019

Efficient integration of heterogeneous single-cell transcriptomes using Scanorama

Nature Biotechnology volume 37, pages 685–691 (2019)Cite this article

33k Accesses
323 Citations
148 Altmetric
Metrics details

Subjects

Abstract

Integration of single-cell RNA sequencing (scRNA-seq) data from multiple experiments, laboratories and technologies can uncover biological insights, but current methods for scRNA-seq data integration are limited by a requirement for datasets to derive from functionally similar cells. We present Scanorama, an algorithm that identifies and merges the shared cell types among all pairs of datasets and accurately integrates heterogeneous collections of scRNA-seq data. We applied Scanorama to integrate and remove batch effects across 105,476 cells from 26 diverse scRNA-seq experiments representing 9 different technologies. Scanorama is sensitive to subtle temporal changes within the same cell lineage, successfully integrating functionally similar cells across time series data of CD14⁺ monocytes at different stages of differentiation into macrophages. Finally, we show that Scanorama is orders of magnitude faster than existing techniques and can integrate a collection of 1,095,538 cells in just ~9 h.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Illustration of ‘panoramic’ dataset integration.**

**Fig. 2: Scanorama correctly integrates a simple collection of datasets where other methods fail.**

**Fig. 3: Panoramic integration of 26 single-cell datasets across nine different technologies.**

**Fig. 4: Scanorama scales to collections of datasets with more than a million cells.**

**Fig. 5: Scanorama is sensitive to subtle transcriptional changes in cellular state over time.**

Multi-level cellular and functional annotation of single-cell transcriptomes using scPipeline

Article Open access 28 October 2022

Nicholas Mikolajewicz, Rafael Gacesa, … Hong Han

Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments

Article 27 May 2019

Luyi Tian, Xueyi Dong, … Matthew E. Ritchie

Semi-supervised integration of single-cell transcriptomics data

Article Open access 29 January 2024

Massimo Andreatta, Léonard Hérault, … Santiago J. Carmona

Data availability

All datasets are available for download at http://scanorama.csail.mit.edu/data.tar.gz. scRNA-seq read data and expression matrices generated in this study have been deposited to the Gene Expression Omnibus (GEO) under accession GSE126085. We used the following publicly available datasets:

• 293T cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/293t)

• Jurkat cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/jurkat)

• 50:50 Jurkat:293T cell mixture from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/jurkat:293t_50:50)

• 99:1 Jurkat:293T cell mixture from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/jurkat_293t_99_1)

• Mouse neurons from 10x Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/neuron_9k)

• Macrophages (Mtb exposed) from Gierahn et al.⁴⁸ (GSE92495)

• Macrophages (unexposed) from Gierahn et al.⁴⁸ (GSE92495)

• Mouse HSCs from Paul et al.¹⁸ (GSE72857)

• Mouse HSCs from Nestorowa et al.¹⁹ (GSE81682)

• Human pancreatic islet cells from Baron et al.²⁰ (GSE84133)

• Human pancreatic islet cells from Muraro et al.²¹ (GSE85241)

• Human pancreatic islet cells from Grün et al.²² (GSE81076)

• Human pancreatic islet cells from Lawlor et al.²³ (GSE86469)

• Human pancreatic islet cells from Segerstolpe et al.²⁴ (E-MTAB-5061)

• Human PBMCs from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/fresh_68k_pbmc_donor_a)

• Human CD19⁺ B cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/b_cells)

• Human CD14⁺ monocytes from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cd14_monocytes)

• Human CD4⁺ helper T cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cd4_t_helper)

• Human CD56⁺ natural killer cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cd56_nk)

• Human CD8⁺ cytotoxic T cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cytotoxic_t)

• Human CD4⁺CD45RO⁺ memory T cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/memory_t)

• Human CD4⁺CD25⁺ regulatory T cells from Zheng et al.¹⁷ (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/regulatory_t)

• Human PBMCs from Kang et al.⁴⁹ (GSE96583)

• Human PBMCs from 10x Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)

• Mouse bone marrow derived dendritic cells with LPS stimulation from Shalek et al.²⁸ (GSE48968)

• D. melanogaster brain cells from Davie et al.²⁹ (GSE107451)

Code availability

Scanorama code is available as Supplementary Code and at https://github.com/brianhie/scanorama.

References

Grün, D. et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255 (2015).
Article Google Scholar
Villani, A.-C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).
Article Google Scholar
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Article CAS Google Scholar
Treutlein, B. et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509, 371–375 (2014).
Article CAS Google Scholar
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).
Article CAS Google Scholar
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 979–982 (2017).
Article CAS Google Scholar
Chen, X., Teichmann, S. A. & Meyer, K. B. From tissues to cell types and back: single-cell gene expression analysis of tissue architecture. Annu. Rev. Biomed. Data Sci 1, 29–51 (2018).
Article Google Scholar
Rozenblatt-Rosen, O., Stubbington, M. J. T., Regev, A. & Teichmann, S. A. The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017).
Article CAS Google Scholar
Haghverdi, L., Lun, A., Morgan, M. & Marioni, J. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
Article CAS Google Scholar
Brown, M. & Lowe, D. G. Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis. 74, 59–73 (2007).
Article Google Scholar
Dekel, T., Oron, S., Rubinstein, M., Avidan, S. & Freeman, W. T. Best-Buddies Similarity for robust template matching. in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (eds. Grauman, K. et al.) 2021–2029 (IEEE, 2015).
Halko, N., Martinsson, P.-G. & Tropp, J. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 217–288 (2011).
Article Google Scholar
Charikar, M. S. Similarity estimation techniques from rounding algorithms. in Proc. Thirty-Fourth Annual ACM Symposium on Theory of Computing (ed. Reif, J.) 380–388 (ACM, 2002).
Dasgupta, S. & Freund, Y. Random projection trees and low dimensional manifolds. in Proc. Fourtieth Annual ACM Symposium on Theory of Computing (ed. Ladner, R. & Dwork, C.) 537–546 (ACM, 2008).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 147 (2017).
Article Google Scholar
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Article CAS Google Scholar
Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
Article CAS Google Scholar
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).
Article CAS Google Scholar
Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).
Article CAS Google Scholar
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
Article Google Scholar
Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
Article CAS Google Scholar
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Article CAS Google Scholar
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Article Google Scholar
Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174, 1015–1030.e16 (2018).
Article CAS Google Scholar
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
Article CAS Google Scholar
Shalek, A. K. et al. Single-cell RNA seq reveals dynamic paracrine control of cellular variation. Nature 510, 363–369 (2014).
Article CAS Google Scholar
Davie, K. et al. A single-cell transcriptome atlas of the aging Drosophila brain. Cell 174, 982–998.e20 (2018).
Article CAS Google Scholar
Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).
Article Google Scholar
Ronen, J. & Akalin, A. netSmooth: Network-smoothing based imputation for single cell RNA-seq. F1000Research 7, 8 (2018).
Article Google Scholar
Yip, S. H., Sham, P. C. & Wang, J. Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Brief. Bioinform. https://doi.org/10.1093/bib/bby011 (2018).
Tung, P. Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
Article CAS Google Scholar
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Article CAS Google Scholar
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across datasets. Nat. Methods 15, 359–362 (2018).
Article CAS Google Scholar
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Article CAS Google Scholar
Zhang, J. M., Fan, J., Fan, H. C., Rosenfeld, D. & Tse, D. N. An interpretable framework for clustering single-cell RNA-Seq datasets. BMC Bioinformatics 19, 93 (2018).
Article Google Scholar
Cho, H., Berger, B. & Peng, J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst. 7, 185–191 (2018).
Article CAS Google Scholar
Van Dijk, D. et al. Recovering gene interactions from single-cell data using data diffusion. Cell 174, 716–729.e27 (2018).
Article Google Scholar
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).
Article Google Scholar
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Article CAS Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article Google Scholar
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 15, 255–261 (2018).
Article CAS Google Scholar
Cleary, B., Cong, L., Cheung, A., Lander, E. S. & Regev, A. Efficient generation of transcriptomic profiles by random composite measurements. Cell 171, 1424–1436.e18 (2017).
Article CAS Google Scholar
Crow, M., Paul, A., Ballouz, S., Huang, Z. J. & Gillis, J. Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat. Commun. 9, 884 (2018).
Article Google Scholar
Hie, B., Cho, H., DeMeo, B., Bryson, B. & Berger, B. Geometric sketching compactly summarizes the single-cell transcriptomic landscape. Cell Syst. (in the press); preprint at https://doi.org/10.1101/536730
Allaire, J., Ushey, K., Tang, Y. & Eddelbuettel, D. Reticulate: R interface to Python (RStudio, 2017).
Gierahn, T. M. et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods 14, 395–398 (2017).
Article CAS Google Scholar
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Article CAS Google Scholar
Oliphant, T. E. SciPy: open source scientific tools for Python. Comput. Sci. Eng. 9, 10–20 (2007).
Article CAS Google Scholar
Loh, P. R., Baym, M. & Berger, B. Compressive genomics. Nature Biotech. 30, 627–630 (2012).
Article CAS Google Scholar
Van Der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Pedregosa F. & Varoquaux G. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Buttner, M., Miao, Z., Wolf, A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2017).
Article Google Scholar
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Article CAS Google Scholar
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. in Proc. 7th Python Sci. Conf. (ed. Varoquaux, G. et al.) 11–15 (SciPy, 2008).
Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009).
Article Google Scholar
Skipper, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. in Proc. 9th Python Sci. Conf. (eds. van der Walt, S. & Millman, J.) 57–61 (SciPy, 2010).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar

Download references

Acknowledgements

B.H. is partially supported by NIH grant R01GM081871 (to B.Berger). We thank H. Cho, S. Nyquist and L. Schaeffer for valuable discussions and feedback. We thank S. Tovmasian for assistance in preparing the manuscript. We thank R. Amezquita, G. Sturm, I. Virshup, A. Wenzel and others for their helpful questions, comments and improvements to the Scanorama package throughout the prelease process.

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA
Brian Hie & Bonnie Berger
Department of Biological Engineering, MIT, Cambridge, MA, USA
Bryan Bryson
Department of Mathematics, MIT, Cambridge, MA, USA
Bonnie Berger

Authors

Brian Hie
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Bryson
View author publications
You can also search for this author in PubMed Google Scholar
Bonnie Berger
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceived the problem. B.H. and B. Berger conceived the algorithm. B.H. developed and performed the computational experiments. B. Bryson performed the scRNA-seq experiments. B. Berger led the research. All authors wrote the manuscript.

Corresponding authors

Correspondence to Bryan Bryson or Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Fig. 1 Integration of a 293T/Jurkat mixture using scran MNN and Seurat CCA is sensitive to the order in which the datasets are considered.

(a) When a mixture dataset of 293T cells and Jurkat cells is chosen as the first reference dataset (n = 3388 cells), scran MNN correctly integrates a second dataset of Jurkat cells (n = 3257) and a third dataset of 293T cells (n = 2885 cells). (b) When given the two datasets of 293T cells and Jurkat cells first, scran MNN incorrectly merges the two cell types together into a single cluster. Integration by scran MNN requires its first dataset to share at least one cell type with all other datasets that are successively integrated, which may not be a reasonable assumption. Seurat CCA was unsuccessful at integrating these three datasets in both cases (a,b). (c) Without correction, Jurkat cells cluster by batch instead of by cell type.

Supplementary Fig. 2 Comparison of scRNA-seq integration methods on simulated data.

(a-h) We use the Splatter package to simulate three datasets with four cell types in total, where dataset 1 has cell types A and B, dataset 2 has cell types B and C, and dataset 3 has cell types C and D. In each dataset, we assign cells to a cell type with a 50/50 probability. Each dataset contains 1,000 cells. The Splatter simulation also generates batch effects between datasets such that without batch correction cells cluster by both dataset and batch (a, e). For Seurat CCA and scran MNN, datasets are aligned in numerical order. Scanorama correctly aligns the same cell types together (b, f), whereas scran MNN incorrectly merges cell types A and D and does not merge cell type C across batches (d, h). Seurat CCA is unable to merge the datasets together (c, g). (i) Scanorama alignment scores find the correct pairwise matches between the simulated cell types. (j) Scanorama has significantly improved Silhouette scores (median of 0.28) than the uncorrected data (median of 0.00; independent, two-sided t-test P < 5e-324; n = 3,000 cells), scran MNN (median of 0.16; P = 1.1e-40), and Seurat CCA (median of 0.18; P = 2.7e-37). An asterisk (*) indicates a significantly higher Silhouette Coefficient distribution (Bonferroni corrected P < 0.05) between Scanorama and no correction, a dagger (†) indicates significance over scran MNN, and a double dagger (‡) indicates significance over Seurat CCA. t-SNE visualizations use a learning rate of 200 and a perplexity of 100. Box plot boxes extend from lower to upper quartiles with an orange line at the median and green triangle at the mean; whiskers show the range.

Supplementary Fig. 3 Visualizing Scanorama alignment scores across 26 scRNA-seq datasets.

Scanorama alignment scores from aligning 26 heterogeneous scRNA-seq datasets reveal high amounts of alignment among biologically similar datasets and alignments scores close to or at zero for datasets that are not biologically similar. Heatmap rows and columns correspond to different datasets and diagonal entries are set to 1.

Supplementary Fig. 4 Comparison of scRNA-seq integration methods on HSCs.

Integration of 2401 hematopoietic stem cells (HSCs) from MARS-seq and 774 HSCs from Smart-seq2. (a, e) Two datasets of HSCs plotted on the first two principal components (PCs) shows cell separated by batch effects along the second PC. Cells are visualized using PCs, instead of t-SNE embeddings, since they organize according to their pseudo-temporal relationships when visualized with PCA; granulocyte-macrophage progenitors (GMP) and megakaryocyte-erythrocytes (MEP) are derived from common myeloid progenitors (CMP). (b, f) Scanorama removes any significant difference due to experimental batch (natural log likelihood-ratio = -902; n = 3,175 cells). (c, g) Seurat CCA overcorrects and places all cell types into a single cluster. (d, h) scran MNN obtains a similar result to that of Scanorama. (i) Scanorama alignments consists of a substantial percentage of the cells in both datasets, as expected. (j) Scanorama and scran MNN have similar performance and the same median Silhouette Coefficient (median of 0.28; independent, two-sided t-test P = 0.14; n = 3,175 cells), but Scanorama has significantly better performance than no correction (median of 0.22; P = 8e-10) and Seurat CCA (median of 0.07; P = 2e-132). An asterisk (*) indicates a significantly higher Silhouette Coefficient distribution (Bonferroni corrected P < 0.05) between Scanorama and no correction and a double dagger (‡) indicates significance over Seurat CCA. Box plot boxes extend from lower to upper quartiles, whiskers indicate range, an orange line indicates the median, and a green triangle indicates the mean (n = 3,175 cells). (k-m) Expression of marker genes indicating different stages of erythropoiesis. APOE and GATA2 are more highly expressed in the erythropoietic transition from common myeloid progenitors (CMPs) to megakaryocyte-erythrocytes (MEPs) (k, l) and CTSE is more highly expressed in MEPs (m).

Supplementary Fig. 5 Comparison of scRNA-seq integration methods on pancreatic islet cells.

Integration of 8569 pancreatic islet cells from inDrop, 2449 cells from CEL-Seq2, 1276 cells from CEL-Seq, 638 cells from Fluidigm C1, and 2989 cells from Smart-seq2. (a, e) Pancreatic islets cluster by cell type and batch in the uncorrected setting. (b-d, f-h) Visually, Scanorama, Seurat CCA, and scran MNN have similar performance in merging cell-type specific clusters together across datasets. (i) Scanorama finds substantial overlap among all five pancreatic islet datasets. (j) All methods have relatively similar performance, but Seurat CCA has a higher Silhouette Coefficient distribution (median of 0.30; compared to Scanorama, independent, two-sided t-test P = 4.8e-3; n = 15,921 cells) followed by Scanorama (median of 0.28), scran MNN (median of 0.25; P = 5.1e-4), and the uncorrected data (median of 0.23; P = 9.7e-5). An asterisk (*) indicates a significantly higher Silhouette Coefficient distribution (Bonferroni corrected P < 0.05) between Scanorama and no correction and a dagger (†) indicates significance over scran MNN. Box plot boxes extend from lower to upper quartiles, whiskers indicate range, an orange line indicates the median, and a green triangle indicates the mean (n = 15,921 cells).

Supplementary Fig. 6 Marker gene expression of Scanorama-integrated and batch-corrected pancreatic islet datasets.

Gene expression after integration of 8569 pancreatic islet cells from inDrop, 2449 cells from CEL-Seq2, 1276 cells from CEL-Seq, 638 cells from Fluidigm C1, and 2989 cells from Smart-seq2. (a-f) Marker gene expression heatmaps of the t-SNE embedded panorama of pancreatic islet cells. We observe higher expression of TTR in alpha cells (a), HADH and PCSK1 in beta cells (b, c), KRT19 in ductal cells (d), SST in delta cells (e), and PPY in gamma cells (f). (g, h) Marker genes GADD45A and HERPUD1 related to ER stress are significantly elevated among a subpopulation of beta cells (n = 320 cells) compared to other beta cells (n = 4765 cells), consistent with a rare subpopulation of beta cells marked by ER stress that was previously identified in one of the datasets. The P-values for increased expression of GADD45A and HERPUD1 are also much stronger after integrating five pancreas datasets (P = 6.07e-14 for GADD45A and P = 2.42e-22 for HERPUD1) than for the initial findings in a single dataset (P = 5.21e-3 for GADD45A and P = 2.98e-5 for HERPUD1; 102 ER stress beta cells and 1,114 other beta cells). We computed P-values using a two-sided, Welch’s t-test for comparing populations with unequal variances. t-SNE visualizations use a learning rate of 200 and a perplexity of 400. Box plot boxes extend from lower to upper quartiles, upper whisker extends to last point less than the third quartile plus 1.5 times the interquartile range (IQR), lower whisker extends to first point greater than the first quartile minus 1.5 times the IQR, points indicate remaining cells, an orange line indicates the median, and a green triangle indicates the mean.

Supplementary Fig. 7 Clustering of Scanorama-integrated pancreatic islet datasets and batch correction quality.

Batch correction performance after applying Scanorama to 8569 pancreatic islet cells from inDrop, 2449 cells from CEL-Seq2, 1276 cells from CEL-Seq, 638 cells from Fluidigm C1, and 2989 cells from Smart-seq2. (a-c) k-means clustering of datasets integrated with Scanorama result in clusters that are orthogonal to differences due to batch, noting that even smaller sub-clusters do not find dataset-specific structure. (d, e) Scanorama batch correction of five pancreas datasets results in lower one-way ANOVA F-values compared to scran MNN (we note that this analysis is not applicable to Seurat CCA, which finds integrated embeddings and does not modify gene expression values). Each point represents a gene; results are for 15,369 genes. Closer to the left is better, indicating more similar gene expression distributions after batch correction. The red dashed line indicates equal F-values between uncorrected and corrected datasets.

Supplementary Fig. 8 Comparison of scRNA-seq integration methods on PBMCs.

Integration of 18018 PBMCs from 10x Genomics (donor 1), 2261 CD19+ B cells from 10x, 295 CD14+ monocytes from 10x, 3713 CD4+ helper T cells from 10x, 6657 CD56+ NK cells from 10x, 3990 CD8+ cytotoxic T cells from 10x, 3628 CD4+/CD45RO+ memory T cells from 10x, 3365 CD4+/CD25+ regulatory T cells from 10x, 3774 PBMCs using Drop-seq, and 2293 PBMCs from 10x Genomics (donor 2). (a, e) Without batch correction, PBMC datasets cluster by both cell type and dataset. (b, f) Scanorama integration results cells clustering by cell type. (c, g) Seurat CCA integration results in overcorrection. (d, h) scran MNN obtains a similar result as that of Scanorama because a large dataset of PBMCs was chosen as the first dataset. We expect performance to degrade if the large dataset were not chosen first. (i) Scanorama alignment scores capture relationships between the datasets. (j) Scanorama has the highest distribution of Silhouette Coefficients (median of 0.05) compared to scran MNN (median of 0.03; independent, two-sided t-test P = 0.0011; n = 47,994 cells), the uncorrected data (median of -0.08; P = 1e-51), and Seurat CCA (median of -0.18; P = 9e-194). An asterisk (*) indicates a significantly higher Silhouette Coefficient distribution (Bonferroni corrected P < 0.05) between Scanorama and no correction and a double dagger (‡) indicates significance over Seurat CCA. Box plot boxes extend from lower to upper quartiles, whiskers indicate range, an orange line indicates the median, and a green triangle indicates the mean.

Supplementary Fig. 9 Marker gene expression in Scanorama-integrated and batch-corrected PBMC datasets.

Gene expression after integration of 18018 PBMCs from 10x Genomics (donor 1), 2261 CD19+ B cells from 10x, 295 CD14+ monocytes from 10x, 3713 CD4+ helper T cells from 10x, 6657 CD56+ NK cells from 10x, 3990 CD8+ cytotoxic T cells from 10x, 3628 CD4+/CD45RO+ memory T cells from 10x, 3365 CD4+/CD25+ regulatory T cells from 10x, 3774 PBMCs using Drop-seq, and 2293 PBMCs from 10x Genomics (donor 2). (a-f) Marker gene expression heatmaps of the t-SNE embedded panorama of PBMCs. We observe higher expression of MS4A1 in (a) B cells, (b) CD8A in NK cells, (c, d) CD3E and CD4 in T cells, and (e, f) CD14 and S100A8 in monocytes. t-SNE visualizations use a learning rate of 200 and a perplexity of 400.

Supplementary Fig. 10 Twenty-six-dataset quality control.

(a) Cells in our experiment integrating 26 diverse datasets (n = 105,476 cells) cluster according to cell type instead of by relative differences in the number of unique genes. E.g., the two HSC datasets are aligned despite different dataset-specific gene percentages (the MARS-Seq dataset has a relatively low average percentage of nonzero genes at 30% versus the Smart-seq2 dataset with an average of 79% nonzero genes), as are the pancreas datasets. (b) In our analysis of 26 datasets, cells were included if they contained greater than 600 unique genes. We observe a bimodal distribution of cells according to their number of unique genes and we filter out the mode of cells that have lower amounts of unique genes due to either transcriptional quiescence, high amounts of dropout, or other technical artefacts. (c) We compute the SVD of the concatenation of the 26 datasets and visualize the top 300 singular values in a bar plot. To preserve most of the variation in the data, indicated by the ‘elbow’ in the bar plot, we use a conservative cutoff of the top 100 components from the SVD. (d) Integrating datasets (n = 105,476 cells) based on the union of all genes (setting unobserved gene expression values to zero) results in similar results as with taking the intersection (although interestingly, a small portion of CD14+ monocytes align with macrophages, which may have some biological basis); however, we caution against a union-based approach since this could introduce variability that is not reflective of the underlying biology.

Supplementary Fig. 11 Silhouette coefficient distributions across 26 scRNA-seq datasets.

In addition to visually inspecting the clusters produced by a method like t-SNE, we can quantify the integrative performance of our method by computing a Silhouette Coefficient for each cell (Methods). Higher values indicate that samples from the same cell type also cluster together, indicating better clustering performance. For our experiment in which we integrate 26 diverse scRNA-seq datasets, we compute Silhouette Coefficients using low dimensional embeddings as described in Methods. Scanorama has a significantly higher Silhouette Coefficient distribution (median of 0.17) compared to scran MNN (median of -0.03; P < 5e-324), Seurat CCA (median of -0.18; P < 5e-324), and no correction (median of 0.14; P = 4e-6) when integrating our collection of 26 datasets containing 105,476 cells (Fig. 2a-c). Notably, scran MNN and Seurat CCA have lower median Silhouette Coefficients than if no correction had been applied, indicating large amounts of overcorrection. Box plot boxes extend from lower to upper quartiles with an orange line at the median and green triangle at the mean; whiskers show the range. P-values are determined using an independent, two-sided t-test (n = 105,476 cells). An asterisk (*) indicates a significantly higher Silhouette Coefficient distribution (Bonferroni corrected P < 0.05) between Scanorama and no correction, a dagger (†) indicates significance over scran MNN, and a double dagger (‡) indicates significance over Seurat CCA.

Supplementary Fig. 12 Silhouette coefficient distributions for 26-dataset integration at different parameters.

Sensitivity analysis of Scanorama alignment parameters and t-SNE visualization parameters for the integration of 26 diverse scRNA-seq datasets. Box plots show distributions of Silhouette Coefficients at different parameter settings. Box plot boxes extend from lower to upper quartiles with an orange line at the median and green triangle at the mean; whiskers show the range. All distributions are over the same 105,476 cells across 26 heterogeneous scRNA-seq datasets. An asterisk (*) indicates a significantly higher Silhouette Coefficient distribution (two-sided independent t-test, Bonferroni corrected P < 0.05) between Scanorama and no correction, a dagger (†) indicates significance over scran MNN, and a double dagger (‡) indicates significance over Seurat CCA. Importantly, in the analysis for alignment parameters (a-d), Silhouette Coefficients are calculated for the integrated, low-dimensional embeddings. When assessing the sensitivity of t-SNE visualization parameters (e, f), we calculate the Silhouette Coefficients on the 2-dimensional t-SNE embeddings (which are computed off of the low dimensional embeddings). All plots also include the Silhouette Coefficient distributions for uncorrected data, Seurat CCA integration, and scran MNN correction on low dimensional embeddings as described in Methods. (a) The k nearest neighbor parameter is largely insensitive around the default value of 20 and can go as low as 5 without affecting performance. At larger values of k, the matches become more permissive and the Silhouette Coefficients start to drop, where at k = 100 the median Silhouette Coefficient (0.091) is below that of the uncorrected case. (b) There is no significant change in the distribution of Silhouette Coefficients between the approximate and exact nearest neighbors settings (independent, two-sided t-test P = 0.39; n = 105,476 cells), although the integration runtime increases to more than 60 minutes without the approximation algorithm. (c) We recommend keeping α to a low value greater than zero, which can be learned from the data if some of the cell types being integrated are known. Lower values may introduce overcorrection, while higher values approach the uncorrected case. (d) The median Silhouette Coefficient is largely insensitive to different values of the smoothing parameter σ for the Gaussian kernel function. (e) Visualizing the integration of 26 datasets requires a high perplexity (around 500 or greater) to obtain a median Silhouette Coefficient comparable to that for the low dimensional embeddings. We set the perplexity to 1,200 for visualizing the 26 datasets (Fig. 3a). (f) When visualizing the 26 datasets, a higher t-SNE learning rate improves the median Silhouette Coefficient to be comparable that for the low dimensional embeddings. The Silhouette Coefficient distributions for the t-SNE embeddings are generally wider than those for the lower dimensional embeddings since it is harder to obtain large separations between clusters in two dimensions.

Supplementary Fig. 13 Scanorama integration of different regions of the mouse CNS.

Visualization of 10% (n = 109,553 cells) of 1,095,538 mouse CNS cells after Scanorama integration colored by dataset. Corresponding cell type labels and marker genes are given in Fig. 4.

Supplementary Fig. 14 Comparison of scRNA-seq integration methods on datasets with no overlapping cell types.

(a) A collection of three diverse datasets (9032 mouse neurons, 2401 mouse HSCs, and 4510 human macrophages) cluster separately without correction, as expected. (b) When given a collection of three diverse datasets with no overlapping cell types (mouse neurons, HSCs, and unstimulated macrophages), Scanorama finds a few spurious alignments between datasets, but none of the alignment scores pass the cutoff threshold of 10% (e). (c, d) scran MNN and Seurat CCA are more prone to overcorrection. (f) Both the uncorrected and Scanorama corrected data have the highest Silhouette Coefficients (both have a median of 0.37) compared to scran MNN (median of 0.20; independent, two-sided t-test P = 7e-252; n = 15,794 cells) and Seurat CCA (median of -0.12; P < 5e-324). Box plot boxes extend from lower to upper quartiles with an orange line at the median and green triangle at the mean; whiskers show the range. (g) A collection of six diverse datasets (2885 293T cells, 9032 mouse neurons, 2401 mouse HSCs, 4510 human macrophages, 8569 human pancreatic islet cells, and 18018 human PBMCs) cluster separately without correction, as expected. (h) When given the same collection of six diverse datasets with no overlapping cell types, Scanorama keeps disparate cell types separate with only a small amount of overcorrection in matching a small portion of 293T cells with PBMCs. (i, j) Because they are not designed for heterogeneous dataset integration, both scran MNN and Seurat CCA integrate biologically disparate cell types among the same collection of datasets. (k) Scanorama alignment scores are at or very close to zero between the different datasets. (l) While the highest Silhouette Coefficient distribution belongs to the data without batch correction (median of 0.35), Scanorama has the least overcorrection among the datasets and has higher Silhouette Coefficients (median of 0.20) than scran MNN (median of 0.10; two-sided independent t-test P = 5.3e-98; n = 36,755 cells) and Seurat CCA (median of -0.18; P < 5e-324). A dagger (†) indicates a significantly higher Silhouette Coefficient distribution (Bonferroni corrected P < 0.05) between Scanorama and scran MNN, and a double dagger (‡) indicates significance over Seurat CCA. t-SNE visualizations use a learning rate of 200 and a perplexity of 400. Box plot boxes extend from lower to upper quartiles with an orange line at the median and green triangle at the mean; whiskers indicate the range.

Supplementary Fig. 15 Scanorama alignment scores reconstruct temporal relationships between datasets.

Blue nodes indicate datasets and gray edges make up the maximum spanning tree (MST) on the graph with Scanorama alignment scores as the edge weights. In (a) mouse dendritic cells stimulated with LPS over 6 hours and (c) human CD14+ monocytes stimulated with M-CSF over 6 days, MST edges perfectly correspond to the temporal ordering of the datasets and only connect replicate timepoints or adjacent timepoints. In (b) D. melanogaster brain cells over 50 days, most edges connect replicate or adjacent timepoints except for edges between 3 and 9 days, between 1 and 6 days, and between 6 and 15 days, possibly indicating greater transcriptional similarity at the midpoint of the time series.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat Biotechnol 37, 685–691 (2019). https://doi.org/10.1038/s41587-019-0113-3

Download citation

Received: 09 May 2018
Accepted: 08 March 2019
Published: 06 May 2019
Issue Date: June 2019
DOI: https://doi.org/10.1038/s41587-019-0113-3

This article is cited by

An atlas of cell-type-specific interactome networks across 44 human tumor types
- Zekun Li
- Gerui Liu
- Yang Yang
Genome Medicine (2024)
Integration of scRNA-seq data by disentangled representation learning with condition domain adaptation
- Renjing Liu
- Kun Qian
- Hongwei Li
BMC Bioinformatics (2024)
The implications of single-cell RNA-seq analysis in prostate cancer: unraveling tumor heterogeneity, therapeutic implications and pathways towards personalized therapy
- De-Chao Feng
- Wei-Zhen Zhu
- Lu Yang
Military Medical Research (2024)
scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding
- Wei Li
- Fan Yang
- Jianhua Yao
Nature Methods (2024)
Pianno: a probabilistic framework automating semantic annotation for spatial transcriptomics
- Yuqiu Zhou
- Wei He
- Ying Zhu
Nature Communications (2024)