Fast, sensitive and accurate integration of single-cell data with Harmony

Article metrics

Abstract

The emerging diversity of single-cell RNA-seq datasets allows for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. However, it is challenging to analyze them together, particularly when datasets are assayed with different technologies, because biological and technical differences are interspersed. We present Harmony (https://github.com/immunogenomics/harmony), an algorithm that projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological factors. In six analyses, we demonstrate the superior performance of Harmony to previously published algorithms while requiring fewer computational resources. Harmony enables the integration of ~106 cells on a personal computer. We apply Harmony to peripheral blood mononuclear cells from datasets with large experimental differences, five studies of pancreatic islet cells, mouse embryogenesis datasets and the integration of scRNA-seq with spatial transcriptomics data.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of Harmony algorithm.
Fig. 2: Quantitative assessment of dataset mixing and cell-type accuracy with cell-line datasets.
Fig. 3: Computational efficiency benchmarks. BBKNN, Scanorama, MNN Correct and MultiCCA are compared on five downsampled HCA datasets of increasing sizes.
Fig. 4: Fine-grained subpopulation identification in PBMCs across technologies.
Fig. 5: Integration of pancreatic islet cells by both donor and technology.
Fig. 6: Harmony integrates spatially resolved transcriptomic with dissociated scRNAseq datasets.

Data availability

All data analyzed in this article are publicly available through online sources. We included links to all data sources in Supplementary Table 8.

Code availability

Harmony and LISI are available as R packages on https://github.com/immunogenomics/harmony and https://github.com/immunogenomics/lisi. Scripts to reproduce results of the primary analyses will be made available on https://github.com/immunogenomics/harmony2019. Additionally, vignettes are included as Supplementary Notes. Supplementary Note 1 provides a detailed walkthrough of Harmony, connecting theoretical algorithm components to their code implementations. Supplementary Note 2 demonstrates the LISI metric and how to evaluate its statistical significance. Supplementary Note 1 uses Harmony with simulated datasets.

References

  1. 1.

    Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of single-cell RNA-seq in the past decade. Nat. Protocols 13, 599–604 (2018).

  2. 2.

    Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).

  3. 3.

    Zhang, F. et al. Defining inflammatory cell states in rheumatoid arthritis joint synovial tissues by integrating single-cell transcriptomics and mass cytometry. Nat. Immunol. 20, 928–942 (2019).

  4. 4.

    Arazi, A. et al. The immune cell landscape in kidneys of lupus nephritis patients. Nat. Immunol. 20, 902–914 (2019).

  5. 5.

    Der, E. et al. Tubular cell and keratinocyte single-cell transcriptomics applied to lupus nephritis reveal type I IFN and fibrosis relevant pathways. Nat. Immunol. 20, 915–927 (2019).

  6. 6.

    Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2017).

  7. 7.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

  8. 8.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

  9. 9.

    Hie, B. L., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2018).

  10. 10.

    Polanski, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics https://doi.org/10.1093/bioinformatics/btz625 (2019).

  11. 11.

    Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

  12. 12.

    Li, B. et al. HCA Data Portal: census of immune cells (Human Cell Atlas, 2019).

  13. 13.

    Segerstolpe, A. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

  14. 14.

    Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

  15. 15.

    Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).

  16. 16.

    Grun, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

  17. 17.

    Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

  18. 18.

    Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

  19. 19.

    Gao, T. et al. Pdx1 maintains β cell identity and function by repressing an α cell program. Cell Metab. 19, 259–271 (2014).

  20. 20.

    Jia, S. et al. Insm1 cooperates with neurod1 and foxa2 to maintain mature pancreatic β-cell function. EMBO J. 34, 1417–1433 (2015).

  21. 21.

    Sachdeva, M. M. et al. Pdx1 (MODY4) regulates pancreatic beta cell susceptibility to ER stress. Proc. Natl Acad. Sci. USA 106, 19090–19095 (2009).

  22. 22.

    Katoh, M. C. et al. MafB is critical for glucagon production and secretion in mouse pancreatic α cells in vivo. Mol. Cell. Biol. 38, e00504–e00517 (2018).

  23. 23.

    Liu, J. et al. Islet-1 regulates arx transcription during pancreatic islet α-cell development. J. Biol. Chem. 286, 15352–15360 (2011).

  24. 24.

    Akiyama, M. et al. X-box binding protein 1 is essential for insulin regulation of pancreatic α-cell function. Diabetes 62, 2439–2449 (2013).

  25. 25.

    Burcelin, R., Knauf, C. & Cani, P. D. Pancreatic alpha-cell dysfunction in diabetes. Diabetes Metab. 34, S49–S55 (2008).

  26. 26.

    Pijuan-Sala, B. et al. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature 566, 490–495 (2019).

  27. 27.

    Moffitt, J. R.et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).

  28. 28.

    Moffitt, J. et al. Data from: Molecular, Spatial and Functional Single-cell Profiling of the Hypothalamic Preoptic Region (Dryad, Dataset, 2018); https://doi.org/10.5061/dryad.8t8s248

  29. 29.

    Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).

  30. 30.

    Close, J. et al. Satb1 is an activity-modulated transcription factor required for the terminal differentiation and connectivity of medial ganglionic eminence-derived cortical interneurons. J. Neurosci. 32, 17690–17705 (2012).

  31. 31.

    Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2007).

  32. 32.

    Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expressionstudies by surrogate variable analysis. PloS Genet. 3, e161 (2007).

  33. 33.

    Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protocols 7, 500–507 (2012).

  34. 34.

    Mizoguchi, F. et al. Functionally distinct disease-associated fibroblast subsets in rheumatoid arthritis. Nat. Commun. 9, 789 (2018).

  35. 35.

    Manno, G. L. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).

  36. 36.

    Mao, Q., Wang, L., Goodison, S. & Sun, Y. Dimensionality reduction via graph structure learning. In Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2015, 765–774 (ACM, 2015).

  37. 37.

    Dhillon, I. S. & Modha, D. S. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 143–175 (2001).

  38. 38.

    Jordan, M. I. & Jacobs, R. A. Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 6, 181–214 (1994).

  39. 39.

    Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

  40. 40.

    Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018).

  41. 41.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  42. 42.

    McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  43. 43.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).

  44. 44.

    Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000 Res. 5, 2122 (2016).

  45. 45.

    Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008, P10008 (2008).

  46. 46.

    Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).

  47. 47.

    Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).

  48. 48.

    The Gene Ontology Consortium. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).

  49. 49.

    Ashburner, M. et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat. Genet. 25, 25–29 (2000).

Download references

Acknowledgements

This work was supported in part by funding from the National Institutes of Health (grant nos. UH2AR067677 and U19AI111224 and no. 1R01AR063759 (to S.R.) and T32 AR007530-31 (to I.K.)). We thank members of the Raychaudhuri and Brenner labs for comments and discussion. I.K. and K.W. were funded as part of a collaborative research agreement with F. Hoffmann-La Roche Ltd (Basel, Switzerland), to S.R. and M.B.B.

Author information

S.R. and I.K. conceived the research. I.K. led computational work under the guidance of S.R., assisted by N.M., P.L., J.F. and K.S. All authors participated in interpretation and writing the manuscript.

Correspondence to Soumya Raychaudhuri.

Ethics declarations

Competing interests

I.K. does paid bioinformatics consulting through Brilyant LLC.

Additional information

Peer review information Nicole Rusk was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–19.

Reporting Summary

Supplementary Software 1

Harmony R package. Software to perform Harmony integration analysis.

Supplementary Software 2

LISI R package. Software to compute the Local Inverse Simpson’s Index.

Supplementary Tables 1–8

Jurkat LISI, Time benchmark, Memory Benchmark, HCA LISI, PBMC LISI, Inhibitory, Excitatory, Data Sources.

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Korsunsky, I., Millard, N., Fan, J. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16, 1289–1296 (2019) doi:10.1038/s41592-019-0619-0

Download citation

Further reading