Computational single-cell RNA-seq (scRNA-seq) methods have been successfully applied to experiments representing a single condition, technology, or species to discover and define cellular phenotypes. However, identifying subpopulations of cells that are present across multiple data sets remains challenging. Here, we introduce an analytical strategy for integrating scRNA-seq data sets based on common sources of variation, enabling the identification of shared populations across data sets and downstream comparative analysis. We apply this approach, implemented in our R toolkit Seurat (http://satijalab.org/seurat/), to align scRNA-seq data sets of peripheral blood mononuclear cells under resting and stimulated conditions, hematopoietic progenitors sequenced using two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across data sets, while boosting statistical power through integrated analysis. Our approach facilitates general comparisons of scRNA-seq data sets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Cellular & Molecular Biology Letters Open Access 15 August 2023
Identification of lipid metabolism-related biomarkers for diagnosis and molecular classification of atherosclerosis
Lipids in Health and Disease Open Access 06 July 2023
Evaluating the mouse neural precursor line, SN4741, as a suitable proxy for midbrain dopaminergic neurons
BMC Genomics Open Access 07 June 2023
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Klein, A.M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Zilionis, R. et al. Single-cell barcoding and sequencing using droplet microfluidics. Nat. Protoc. 12, 44–73 (2017).
Macosko, E.Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Zheng, G.X.Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323.e30 (2016).
Villani, A.-C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Welch, J.D., Hartemink, A.J. & Prins, J.F. SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol. 17, 106 (2016).
Satija, R., Farrell, J.A., Gennert, D., Schier, A.F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Achim, K. et al. High-throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nat. Biotechnol. 33, 503–509 (2015).
DeLaughter, D.M. et al. Single-cell resolution of temporal gene expression during heart development. Dev. Cell 39, 480–490 (2016).
Bendall, S.C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Blakeley, P. et al. Defining the three cell lineages of the human blastocyst by single-cell RNA-seq. Development 142, 3613 (2015).
Johnson, M.B. et al. Single-cell analysis reveals transcriptional heterogeneity of neural progenitors in human cortex. Nat. Neurosci. 18, 1–30 (2015).
Regev, A. et al. The Human Cell Atlas. Elife 6, 1–30 (2017).
Kharchenko, P.V., Silberstein, L. & Scadden, D.T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017).
Kiselev, V.Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Lin, P., Troup, M. & Ho, J.W.K. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 18, 59 (2017).
Prabhakaran, S., Azizi, E. & Pe'er, D. Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. Proc. 33rd Int. Conf. Mach. Learn. 48, 1070–1079 (2016).
Ntranos, V., Kamath, G.M., Zhang, J.M., Pachter, L. & Tse, D.N. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol. 17, 112 (2016).
Xu, C. & Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).
Lei, Z., Bai, Q., He, R. & Li, S.Z. Face shape recovery from a single image using CCA mapping between tensor spaces. 26th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR doi:10.1109/CVPR.2008.4587341 (2008).
Zhou, F. & Torre, F. in Advances in Neural Information Processing Systems 22; NIPS 2009 (eds. Y. Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I. & Culotta, A.) https://papers.nips.cc/paper/3728-canonical-time-warping-for-alignment-of-human-behavior (Neural Information Processing Systems Foundation, Inc., 2009).
Wang, C. & Mahadevan, S. in Proc. Twenty-Second International Joint Conference on Artificial Intelligence, Vol. 2 (ed. Walsh, T.) 1541–1546 (AAAI, 2011).
Huang, H., He, H., Fan, X. & Zhang, J. Super-resolution of human face image using canonical correlation analysis. Pattern Recognit. 43, 2532–2543 (2010).
Hotelling, H. Relations between two sets of variates. Biometrika 28, 321–377 (1936).
Hardoon, D.R., Szedmak, S. & Shawe-Taylor, J. Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16, 2639–2664 (2004).
Witten, D.M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).
Lê Cao, K.-A., Martin, P.G., Robert-Granié, C. & Besse, P. Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10, 34 (2009).
Waaijenborg, S., Verselewel de Witt Hamer, P.C. & Zwinderman, A.H. Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Stat. Appl. Genet. Mol. Biol. 7, e3 (2008).
Kettenring, J. Canonical analysis of several sets of variables. Biometrika 58, 433–451 (1971).
Nielsen, A.A. Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data. IEEE Trans. Image Process. 11, 293–305 (2002).
Berndt, D. & Clifford, J. Using dynamic time warping to find patterns in time series. Work. Knowl. Knowl. Discov. Databases 398, 359–370 (1994).
Kang, H.M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Adolfsson, J. et al. Identification of Flt3+ lympho-myeloid stem cells lacking erythro-megakaryocytic potential a revised road map for adult blood lineage commitment. Cell 121, 295–306 (2005).
Lacar, B. et al. Corrigendum: nuclear RNA-seq of single neurons reveals molecular signatures of activation. Nat. Commun. 8, 15047 (2017).
Poli, A. et al. CD56bright natural killer (NK) cells: an important NK cell subset. Immunology 126, 458–465 (2009).
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).
Scheuner, D. & Kaufman, R.J. The unfolded protein response: a pathway that links insulin demand with β-cell failure and diabetes. Endocr. Rev. 29, 317–333 (2008).
Walter, W., Sánchez-Cabo, F. & Ricote, M. GOplot: an R package for visually combining expression data with functional analysis. Bioinformatics 31, 2912–2914 (2015).
Jiang, H.-Y. et al. Activating transcription factor 3 is integral to the eukaryotic initiation factor 2 kinase stress response. Mol. Cell. Biol. 24, 1365–1377 (2004).
Papa, F.R. Endoplasmic reticulum stress, pancreatic β-cell degeneration, and diabetes. Cold Spring Harb. Perspect. Med. 2, a007666 (2012).
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).
Johnson, W.E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Lake, B.B. et al. Neuronal subtypes and diversity revealed by single-nucleus RNA sequencing of the human brain. Science 352, 1586–1590 (2016).
Ziegenhain, C. et al. Comparative analysis of single-cell RNA sequencing methods. Mol. Cell 65, 631–643.e4 (2017).
Svensson, V. et al. Power analysis of single-cell RNA-sequencing experiments. Nat. Methods 14, 381–387 (2017).
Junker, J.P. et al. Genome-wide RNA tomography in the zebrafish embryo. Cell 159, 662–675 (2014).
Lee, J.H. et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 1360–1363 (2014).
Ståhl, P.L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).
Scialdone, A. et al. Resolving early mesoderm diversification through single-cell expression profiling. Nature 535, 289–293 (2016).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).
Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 17, 29 (2016).
Dudoit, S., Fridlyans, J. & Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97, 77–87 (2002).
Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18, 104–117 (2003).
Baglama, J. & Reichel, L. Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput. (2005).
Giorgino, T. Computing and visualizing dynamic time warping alignments in R: the dtw package. J. Stat. Softw. 31, 1–24 (2009).
Waltman, L. & Van Eck, N.J. A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B 86, 1–33 (2013).
Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 1–21 (2014).
Richards, J. diffusionMap: diffusion map. (2014) at https://cran.r-project.org/package=diffusionMap.
Hastie, T. & Stuetzle, W. Principal curves. J. Am. Stat. Assoc. 84, 502 (1989).
S original by Trevor Hastie R port by Andreas Weingessel. princurve: Fits a Principal Curve in Arbitrary Dimension. https://cran.r-project.org/package=princurve (2013).
Tseng, G.C., Ghosh, D. & Feingold, E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 40, 3785–3799 (2012).
Kuleshov, M.V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).
Mayer, C. et al. Developmental diversification of cortical inhibitory interneurons. Nature 555, 457–462 (2018).
Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).
We thank members of the Satija laboratory, as well as P. Roelli, M. Stoeckius, G. Fishell, C. Desplan, R. Bonneau, E. Macosko, and A. Corvelo for their valuable feedback, and F. Hamey, HM Kang, and J. Ye for assistance with published data sets. This work was supported by an NIH New Innovator Award (1DP2HG009623-01) and R01 (5R01MH071679-12) to R.S. and an NSF Graduate Fellowship (DGE1342536) to A.B.
The authors declare no competing financial interests.
Supplementary Figures 1–15 (PDF 8713 kb)
Cell metadata for IFNB response analysis (TXT 681 kb)
Cell metadata for murine hematopoiesis analysis (TXT 83 kb)
Cell metadata for cross-species pancreatic islet analysis (TXT 545 kb)
This table contains a summary of the data distributions and statistical details related to the manuscript figures (XLSX 32 kb)
Source code and installation instructions for software used in described analyses (ZIP 810 kb)
About this article
Cite this article
Butler, A., Hoffman, P., Smibert, P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36, 411–420 (2018). https://doi.org/10.1038/nbt.4096
This article is cited by
Cellular & Molecular Biology Letters (2023)
Comparative developmental genomics of sex-biased gene expression in early embryogenesis across mammals
Biology of Sex Differences (2023)
Identification of lipid metabolism-related biomarkers for diagnosis and molecular classification of atherosclerosis
Lipids in Health and Disease (2023)
Gene expression variability across cells and species shapes the relationship between renal resident macrophages and infiltrated macrophages
BMC Bioinformatics (2023)
BMC Bioinformatics (2023)