Analysis

Integrating single-cell transcriptomic data across different conditions, technologies, and species

Received:
Accepted:
Published:

Abstract

Computational single-cell RNA-seq (scRNA-seq) methods have been successfully applied to experiments representing a single condition, technology, or species to discover and define cellular phenotypes. However, identifying subpopulations of cells that are present across multiple data sets remains challenging. Here, we introduce an analytical strategy for integrating scRNA-seq data sets based on common sources of variation, enabling the identification of shared populations across data sets and downstream comparative analysis. We apply this approach, implemented in our R toolkit Seurat (http://satijalab.org/seurat/), to align scRNA-seq data sets of peripheral blood mononuclear cells under resting and stimulated conditions, hematopoietic progenitors sequenced using two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across data sets, while boosting statistical power through integrated analysis. Our approach facilitates general comparisons of scRNA-seq data sets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.

  • Subscribe to Nature Biotechnology for full access:

    $250

    Subscribe

Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.

Accessions

Primary accessions

ArrayExpress

References

  1. 1.

    et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

  2. 2.

    et al. Single-cell barcoding and sequencing using droplet microfluidics. Nat. Protoc. 12, 44–73 (2017).

  3. 3.

    et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

  4. 4.

    et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  5. 5.

    et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323.e30 (2016).

  6. 6.

    et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).

  7. 7.

    et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).

  8. 8.

    , & SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol. 17, 106 (2016).

  9. 9.

    , , , & Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).

  10. 10.

    et al. High-throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nat. Biotechnol. 33, 503–509 (2015).

  11. 11.

    et al. Single-cell resolution of temporal gene expression during heart development. Dev. Cell 39, 480–490 (2016).

  12. 12.

    et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).

  13. 13.

    et al. Defining the three cell lineages of the human blastocyst by single-cell RNA-seq. Development 142, 3613 (2015).

  14. 14.

    et al. Single-cell analysis reveals transcriptional heterogeneity of neural progenitors in human cortex. Nat. Neurosci. 18, 1–30 (2015).

  15. 15.

    et al. The Human Cell Atlas. Elife 6, 1–30 (2017).

  16. 16.

    , & Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).

  17. 17.

    et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

  18. 18.

    , , , & Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017).

  19. 19.

    et al. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).

  20. 20.

    , & CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 18, 59 (2017).

  21. 21.

    , & Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. Proc. 33rd Int. Conf. Mach. Learn. 48, 1070–1079 (2016).

  22. 22.

    , , , & Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol. 17, 112 (2016).

  23. 23.

    & Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).

  24. 24.

    , , & Face shape recovery from a single image using CCA mapping between tensor spaces. 26th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR doi:10.1109/CVPR.2008.4587341 (2008).

  25. 25.

    & in Advances in Neural Information Processing Systems 22; NIPS 2009 (eds. Y. Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I. & Culotta, A.) (Neural Information Processing Systems Foundation, Inc., 2009).

  26. 26.

    & in Proc. Twenty-Second International Joint Conference on Artificial Intelligence, Vol. 2 (ed. Walsh, T.) 1541–1546 (AAAI, 2011).

  27. 27.

    , , & Super-resolution of human face image using canonical correlation analysis. Pattern Recognit. 43, 2532–2543 (2010).

  28. 28.

    Relations between two sets of variates. Biometrika 28, 321–377 (1936).

  29. 29.

    , & Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16, 2639–2664 (2004).

  30. 30.

    , & A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).

  31. 31.

    , , & Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10, 34 (2009).

  32. 32.

    , & Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Stat. Appl. Genet. Mol. Biol. 7, e3 (2008).

  33. 33.

    Canonical analysis of several sets of variables. Biometrika 58, 433–451 (1971).

  34. 34.

    Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data. IEEE Trans. Image Process. 11, 293–305 (2002).

  35. 35.

    & Using dynamic time warping to find patterns in time series. Work. Knowl. Knowl. Discov. Databases 398, 359–370 (1994).

  36. 36.

    et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

  37. 37.

    et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).

  38. 38.

    et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

  39. 39.

    et al. Identification of Flt3+ lympho-myeloid stem cells lacking erythro-megakaryocytic potential a revised road map for adult blood lineage commitment. Cell 121, 295–306 (2005).

  40. 40.

    et al. Corrigendum: nuclear RNA-seq of single neurons reveals molecular signatures of activation. Nat. Commun. 8, 15047 (2017).

  41. 41.

    et al. CD56bright natural killer (NK) cells: an important NK cell subset. Immunology 126, 458–465 (2009).

  42. 42.

    et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).

  43. 43.

    & The unfolded protein response: a pathway that links insulin demand with β-cell failure and diabetes. Endocr. Rev. 29, 317–333 (2008).

  44. 44.

    , & GOplot: an R package for visually combining expression data with functional analysis. Bioinformatics 31, 2912–2914 (2015).

  45. 45.

    et al. Activating transcription factor 3 is integral to the eukaryotic initiation factor 2 kinase stress response. Mol. Cell. Biol. 24, 1365–1377 (2004).

  46. 46.

    Endoplasmic reticulum stress, pancreatic β-cell degeneration, and diabetes. Cold Spring Harb. Perspect. Med. 2, a007666 (2012).

  47. 47.

    et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).

  48. 48.

    , & Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

  49. 49.

    et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

  50. 50.

    et al. Neuronal subtypes and diversity revealed by single-nucleus RNA sequencing of the human brain. Science 352, 1586–1590 (2016).

  51. 51.

    et al. Comparative analysis of single-cell RNA sequencing methods. Mol. Cell 65, 631–643.e4 (2017).

  52. 52.

    et al. Power analysis of single-cell RNA-sequencing experiments. Nat. Methods 14, 381–387 (2017).

  53. 53.

    et al. Genome-wide RNA tomography in the zebrafish embryo. Cell 159, 662–675 (2014).

  54. 54.

    et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 1360–1363 (2014).

  55. 55.

    et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).

  56. 56.

    et al. Resolving early mesoderm diversification through single-cell expression profiling. Nature 535, 289–293 (2016).

  57. 57.

    et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189–196 (2016).

  58. 58.

    et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 17, 29 (2016).

  59. 59.

    , & Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97, 77–87 (2002).

  60. 60.

    , , & Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18, 104–117 (2003).

  61. 61.

    & Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput. (2005).

  62. 62.

    Computing and visualizing dynamic time warping alignments in R: the dtw package. J. Stat. Softw. 31, 1–24 (2009).

  63. 63.

    & A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B 86, 1–33 (2013).

  64. 64.

    Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 1–21 (2014).

  65. 65.

    diffusionMap: diffusion map. (2014) at .

  66. 66.

    & Principal curves. J. Am. Stat. Assoc. 84, 502 (1989).

  67. 67.

    S original by Trevor Hastie R port by Andreas Weingessel. princurve: Fits a Principal Curve in Arbitrary Dimension. (2013).

  68. 68.

    , & Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 40, 3785–3799 (2012).

  69. 69.

    et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).

  70. 70.

    et al. Developmental diversification of cortical inhibitory interneurons. Nature 555, 457–462 (2018).

  71. 71.

    et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).

Download references

Acknowledgements

We thank members of the Satija laboratory, as well as P. Roelli, M. Stoeckius, G. Fishell, C. Desplan, R. Bonneau, E. Macosko, and A. Corvelo for their valuable feedback, and F. Hamey, HM Kang, and J. Ye for assistance with published data sets. This work was supported by an NIH New Innovator Award (1DP2HG009623-01) and R01 (5R01MH071679-12) to R.S. and an NSF Graduate Fellowship (DGE1342536) to A.B.

Author information

Affiliations

  1. New York Genome Center, New York, New York, USA.

    • Andrew Butler
    • , Paul Hoffman
    • , Peter Smibert
    • , Efthymia Papalexi
    •  & Rahul Satija
  2. Center for Genomics and Systems Biology, New York University, New York, New York, USA.

    • Andrew Butler
    • , Efthymia Papalexi
    •  & Rahul Satija

Authors

  1. Search for Andrew Butler in:

  2. Search for Paul Hoffman in:

  3. Search for Peter Smibert in:

  4. Search for Efthymia Papalexi in:

  5. Search for Rahul Satija in:

Contributions

A.B. and R.S. conceived the research. A.B., P.H., and R.S. implemented the alignment procedure, performed all data analysis, and wrote the manuscript. E.P. performed the PBMC validation experiments, and P.S. performed the ddSeq experiments.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Rahul Satija.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–15

  2. 2.

    Life Sciences Reporting Summary

Text files

  1. 1.

    Supplementary Dataset 1

    Cell metadata for IFNB response analysis

  2. 2.

    Supplementary Dataset 2

    Cell metadata for murine hematopoiesis analysis

  3. 3.

    Supplementary Dataset 3

    Cell metadata for cross-species pancreatic islet analysis

Excel files

  1. 1.

    Supplementary Dataset 4

    This table contains a summary of the data distributions and statistical details related to the manuscript figures

Zip files

  1. 1.

    Supplementary Software

    Source code and installation instructions for software used in described analyses