MARS: discovering novel cell types across heterogeneous single-cell experiments

Brbić, Maria; Zitnik, Marinka; Wang, Sheng; Pisco, Angela O.; Altman, Russ B.; Darmanis, Spyros; Leskovec, Jure

doi:10.1038/s41592-020-00979-3

Article
Published: 19 October 2020

MARS: discovering novel cell types across heterogeneous single-cell experiments

Nature Methods volume 17, pages 1200–1206 (2020)Cite this article

18k Accesses
63 Citations
83 Altmetric
Metrics details

Subjects

Abstract

Although tremendous effort has been put into cell-type annotation, identification of previously uncharacterized cell types in heterogeneous single-cell RNA-seq data remains a challenge. Here we present MARS, a meta-learning approach for identifying and annotating known as well as new cell types. MARS overcomes the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS uses deep learning to learn a cell embedding function as well as a set of landmarks in the cell embedding space. The method has a unique ability to discover cell types that have never been seen before and annotate experiments that are as yet unannotated. We apply MARS to a large mouse cell atlas and show its ability to accurately identify cell types, even when it has never seen them before. Further, MARS automatically generates interpretable names for new cell types by probabilistically defining a cell type in the embedding space.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: MARS is a meta-learning approach for discovery of new cell types across heterogeneous single-cell experiments.**

**Fig. 2: MARS achieves positive learning transfer and accurately annotates cells.**

**Fig. 3: MARS accurately identifies cell types, even when tissues have no cell types in common, and automatically generates interpretable names for new cell types.**

SIMBA: single-cell embedding along with features

Article Open access 29 May 2023

Huidong Chen, Jayoung Ryu, … Luca Pinello

scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data

Article Open access 14 December 2023

Zhi-Hua Du, Wei-Lin Hu, … Yu-An Huang

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

Article Open access 11 May 2020

Xiangjie Li, Kui Wang, … Mingyao Li

Data availability

The Tabula Muris Senis dataset is publicly available at https://figshare.com/projects/Tabula_Muris_Senis/64982. The Tabula Muris dataset is publicly available at https://doi.org/10.6084/m9.figshare.5829687.v8. We retrieved data from the website on 2 November 2019. We made Tabula Muris and Tabula Muris Senis datasets in h5ad format available at https://snap.stanford.edu/mars/data/tms-facs-mars.tar.gz. The Pollen dataset⁴⁰ is available in the NCBI Sequence Read Archive under accession number SRP041736. Kolodziejczyk⁴¹ sequencing data are available in the ArrayExpress database under accession number E-MTAB-2600. CellBench³⁷ and Allen Brain datasets³⁹ are downloaded from https://doi.org/10.5281/zenodo.3357167. Originally, three brain datasets—Allen Mouse Brain (AMB), VISp and ALM—were from the Allen Institute Brain Atlas (http://celltypes.brain-map.org/rnaseq) and are available under accession number GSE115746. The CellBench 10X dataset is available under accession number GSM3618014, while CellBench CEL-Seq2dataset is from three datasets (GSM3618022, GSM3618023, GSM3618024). The project website with links to data and code can be accessed at http://snap.stanford.edu/mars/.

Code availability

MARS is written in Python using the PyTorch library. The source code is available on Github at https://github.com/snap-stanford/mars.

References

Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360, 758–763 (2018).
CAS PubMed PubMed Central Google Scholar
McKenna, A. & Gagnon, J. A. Recording development with single cell dynamic lineage tracing. Development 146, dev169730 (2019).
CAS PubMed PubMed Central Google Scholar
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotech. 33, 495–502 (2015).
CAS Google Scholar
Montoro, D. T. et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Nature 560, 319–324 (2018).
CAS PubMed PubMed Central Google Scholar
Han, X. et al. Mapping the mouse cell atlas by Microwell-seq. Cell 172, 1091–1107 (2018).
CAS PubMed Google Scholar
Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris: the Tabula Muris consortium. Nature 562, 367–372 (2018).
PubMed Central Google Scholar
Regev, A. et al. Science forum: the Human Cell Atlas. eLife 6, e27041 (2017).
PubMed PubMed Central Google Scholar
Plasschaert, L. W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).
CAS PubMed PubMed Central Google Scholar
Suo, S. et al. Revealing the critical regulators of cell identity in the mouse cell atlas. Cell Rep. 25, 1436–1445 (2018).
CAS PubMed PubMed Central Google Scholar
Aevermann, B. D. et al. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum. Mol. Genet. 27, R40–R47 (2018).
CAS PubMed PubMed Central Google Scholar
Wu, H., Kirita, Y., Donnelly, E. L. & Humphreys, B. D. Advantages of single-nucleus over single-cell RNA sequencing of adult kidney: rare cell types and novel cell states revealed infibrosis. J. Am. Soc. Nephrol. 30, 23–32 (2019).
CAS PubMed Google Scholar
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
CAS PubMed PubMed Central Google Scholar
Ding, J., Condon, A. & Shah, S. P. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018).
PubMed PubMed Central Google Scholar
Wang, J. et al. Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16, 875–878 (2019).
CAS PubMed PubMed Central Google Scholar
Amodio, M. et al. Exploring single-cell data with deep multitasking neural networks. Nat. Methods 16, 1139–1145 (2019).
CAS PubMed Google Scholar
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).
CAS PubMed PubMed Central Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
CAS PubMed PubMed Central Google Scholar
Tyssowski, K. M. et al. Different neuronal activity patterns induce different gene expression programs. Neuron 98, 530–546 (2018).
CAS PubMed PubMed Central Google Scholar
Kotliar, D. et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. eLife 8, e43803 (2019).
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16, 983–986 (2019).
CAS PubMed PubMed Central Google Scholar
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Preprint at bioRxiv https://doi.org/10.1101/532895 (2020).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
CAS PubMed PubMed Central Google Scholar
Wang, T. et al. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol. 20, 1–15 (2019).
Google Scholar
Schmidhuber, J. Evolutionary Principles in Self-referential Learning. Diploma thesis, Technische Univ. München (1987).
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D. & Lillicrap, T. Meta-learning with memory-augmented neural networks. In Proc. International Conference on Machine Learning 33 (eds Balcan, M. F. et al.), 1842–1850 (PMLR, 2016).
Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. Proc Adv. Neural Inform. Proc. Syst. 31 (eds Guyon, I. et al.), 4077–4087 (Curran Associates, 2017).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning 34 (eds Precup, D. et al.) 1126–1135 (PMLR, 2017).
The Tabula Muris Consortium. A single cell transcriptomic atlas characterizes aging tissues in the mouse. Nature 583, 590–595 (2020).
PubMed Central Google Scholar
Albright, J. W. & Albright, J. F. Age-associated impairment of murine natural killer activity. Proc. Natl Acad. Sci. USA 80, 6371–6375 (1983).
CAS PubMed PubMed Central Google Scholar
Nogusa, S., Ritz, B. W., Kassim, S. H., Jennings, S. R. & Gardner, E. M. Characterization of age-related changes in natural killer cells during primary influenza infection in mice. Mech. Ageing Dev. 129, 223–230 (2008).
CAS PubMed Google Scholar
Nair, S., Fang, M. & Sigal, L. J. The natural killer cell dysfunction of aged mice is due to the bone marrow stroma and is not restored by IL-15/IL-15Rα treatment. Aging Cell 14, 180–190 (2015).
CAS PubMed Google Scholar
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-Seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017).
CAS PubMed Google Scholar
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
CAS PubMed PubMed Central Google Scholar
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008, P10008 (2008).
Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
PubMed PubMed Central Google Scholar
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnology 36, 411–420 (2018).
CAS Google Scholar
Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).
CAS PubMed Google Scholar
Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 20, 194 (2019).
PubMed PubMed Central Google Scholar
Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018).
CAS PubMed PubMed Central Google Scholar
Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotech. 32, 1053–1058 (2014).
CAS Google Scholar
Kolodziejczyk, A. A. et al. Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485 (2015).
CAS PubMed PubMed Central Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
CAS PubMed PubMed Central Google Scholar
Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotech. 37, 685–691 (2019).
CAS Google Scholar
Haniffa, M. A., Collin, M. P., Buckley, C. D. & Dazzi, F. Mesenchymal stem cells: the fibroblasts new clothes? Haematologica 94, 258–263 (2009).
CAS PubMed Google Scholar
Hematti, P. Mesenchymal stromal cells and fibroblasts: a case of mistaken identity? Cytotherapy 14, 516–521 (2012).
CAS PubMed Google Scholar
Klopfenstein, D. et al. GOATOOLS: a Python library for Gene Ontology analyses. Sci. Rep. 8, 10872 (2018).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the support of DARPA under nos. FA865018C7880 (ASED) and N660011924033 (MCS); ARO under nos. W911NF-16-1-0342 (MURI) and W911NF-16-1-0171 (DURIP); the National Science Foundation (NSF) under nos. OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940 (Expeditions) and IIS-2030477 (RAPID); the Stanford Data Science Initiative, Wu Tsai Neurosciences Institute, Chan Zuckerberg Biohub, Amazon, Boeing, Chase, Docomo, Hitachi, JD.com, NVIDIA, Dell. J.L. is a Chan Zuckerberg Biohub investigator. M.Z. is supported, in part, by NSF grant nos. IIS-2030459 and IIS-2033384, and by the Harvard Data Science Initiative. A.O.P is supported by CZ Biohub. R.B.A. is supported by CZ Biohub and grant no. NIH GM102365.

Author information

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, CA, USA
Maria Brbić & Jure Leskovec
Department of Biomedical Informatics, Harvard University, Boston, MA, USA
Marinka Zitnik
Department of Bioengineering, Stanford University, Stanford, CA, USA
Sheng Wang & Russ B. Altman
Chan Zuckerberg Biohub, San Francisco, CA, USA
Angela O. Pisco, Russ B. Altman, Spyros Darmanis & Jure Leskovec

Authors

Maria Brbić
View author publications
You can also search for this author in PubMed Google Scholar
Marinka Zitnik
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Angela O. Pisco
View author publications
You can also search for this author in PubMed Google Scholar
Russ B. Altman
View author publications
You can also search for this author in PubMed Google Scholar
Spyros Darmanis
View author publications
You can also search for this author in PubMed Google Scholar
Jure Leskovec
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.B., M.Z. and J.L. conceived the study, designed and performed research, contributed new analytical tools, analyzed data and wrote the manuscript. M.B. also developed the software, performed experiments and developed the metrics. S.W. discussed the results and contributed to the writing. A.O.P. helped procure and interpret the datasets. S.D. and R.B.A. supervised research and contributed to the writing. J.L. supervised the research and the entire project.

Corresponding author

Correspondence to Jure Leskovec.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Brown adipose tissue embedding using PCA.

Joint low-dimensional embedding of brown adipose tissue (BAT) cell types during the life span of a mouse obtained using the PCA. We performed PCA using 100 components, corresponding to the dimensionality of low-dimensional MARS’s embeddings. Opposed to the MARS embedding space, NK cells do not change their position across different time points and are joined with the T-cells.

Extended Data Fig. 2 MARS outperforms other baselines and it is robust to embedding dimension.

Median performance of MARS and baseline methods evaluated using (a) adjusted mutual information (b) accuracy, (c) macro-F1-score, (d) macro-precision, and (e) macro-recall. For Leiden³³ and Louvain³⁴ we report adjusted mutual information and accuracy (Supplementary Note 3). Median is calculated across 21 different tissues. Error bars are standard errors estimated as a standard deviation of the mean by bootstrapping cells within tissue with n = 20 iterations. f, Median performance of MARS and K-means clustering applied in the latent space of the autoencoder at the end of the MARS pretraining. ARI stands for adjusted Rand index, F1 for macro-F1 score, and AMI for adjusted mutual information. Median is calculated across 21 different tissues. Error bars are standard errors estimated as a standard deviation of the mean by bootstrapping cells within tissue with n = 20 iterations. g, Performance of MARS when varying number of neurons in the last layer of the neural network which corresponds to the dimension of learned low-dimensional cell representation. Distribution is estimated with n = 20 runs of the method with different initial random seeds.

Extended Data Fig. 3 Cell-type level performance.

Cell-type level comparison of MARS’s F1-score with the SIMLR³² on (a) cell types that appear in only one tissue grouped by the number of differentially expressed genes in a cell type, (b) all cell types grouped by the number of differentially expressed genes in a cell type, and (c) all cell types grouped by the number of cells in a cell type. Standard errors are estimated as a standard deviation of the mean by bootstrapping cells within each tissue with n = 20 iterations. Number of differentially expressed genes is calculated using the Tabula Muris annotations by taking all genes with Benjamini-Hochberg FDR adjusted p-value<0.01 (two-tailed t-test).

Extended Data Fig. 4 Tissue-level performance.

Comparison of the MARS’s performance on individual tissues with the baseline methods. Performance is measured as adjusted Rand index score. a, Across all tissues, MARS achieves 34.3% higher area under the curve compared to the SIMLR, and 44.3% higher compared to the ScVi baseline. For each method, tissues are ranked in the decreasing order of the achieved score. b, Comparison with the second best performing method SIMLR³² on individual tissues. MARS significantly outperforms SIMLR (p = 1e − 4; two-tailed Wilcoxon signed-rank test). Tissues are ranked according to the MARS’s ARI score. Performance is measured in a single run for both methods.

Extended Data Fig. 5 Performance on other datasets.

Mean performance of MARS and four baseline methods evaluated using adjusted Rand index (Adj-Rand), F1-score (F1) and adjusted mutual information (Adj-MI) on (a) two CellBench datasets^37,38 (b) Pollen⁴⁰ and Kolodziejczyk clustering benchmark datasets⁴¹, and (c) three Allen Brain datasets^38,39. For all metrics, higher value indicates better performance. MARS is trained in leave-one-dataset-out manner, and the held out dataset was completely unannotated.

Extended Data Fig. 6 Positive knowledge transfer on heart and mesenteric fat tissues.

Effect of the number of annotated tissues in the meta-dataset on MARS’s performance when using (a) heart tissue as unannotated experiment, and (b) mesenteric fat as unannotated experiment. Performance is measured as average adjusted Rand index across 20 runs of the method. Error bands are confidence intervals (95%) determined across 20 runs.

Extended Data Fig. 7 Embeddings after pretraining step.

UMAP visualizations of embeddings obtained with the MARS autoencoder (left), and the final MARS model (right) on (a) diaphragm tissue, and (b) liver tissue. Color indicates Tabula Muris cell type annotations. Autoencoder embeddings are obtained at the end of the MARS pretraining. Network parameters of the encoder and cluster centers from the K-means clustering are used to initialize MARS network and landmarks, respectively. MARS then learns new embeddings and new landmarks. SMS stands for skeletal muscle cell, MS for mesenchymal stem cell, HS for hepatic sinusoid, and MNKTC for mature NK T-cell.

Extended Data Fig. 8 MARS discovers cell subtypes.

a,b,UMAP visualization of MARS’s embedding of mammary gland tissue cells. Colors indicate (a) Tabula Muris cell type annotations according to Cell Ontology class, and (b) free annotations in Tabula Muris that provide additional cell type resolution. Separation of cells labeled as luminal epithelial cells into two different clusters agrees perfectly with the free annotations and separate cluster found by MARS is labeled as luminal progenitor cells. MARS also correctly assigns one basal cell misannotated as luminal epithelial cells by Cell Ontology class annotations. c,d, UMAP visualization of MARS’s embedding of subtypes of (c) basal cells of epidermis, and (d) dendritic cells. Colors indicate free annotations in Tabula Muris. We use all tissues as annotated experiments except the ones in which basal cells of epidermis or dendritic cells appear, and test the MARS ability to separate subtypes of these cell types. Clusters discovered by MARS perfectly agree with the free annotations.

Extended Data Fig. 9 Alignment of B cells.

Using MARS, B-cells in Tabula Muris data are extremely well aligned across 11 different tissues, including brown adipose tissue, diaphragm, gonodal fat, heart, kidney, limb muscle, lung, liver, mesenteric fat, subcutaneous fat, and spleen. Limb muscle is used as an unannotated tissue.

Extended Data Fig. 10 Robustness to hyperparameters.

MARS’s performance when varying (a) regularizer λ, (b) regularizer τ, and (c) number of epochs. Performance is measured as average adjusted Rand index score. Average is calculated over all tissues by including each tissue as an unannotated dataset and using other tissues as annotated experiments. Error bars are standard errors estimated as a standard deviation of the mean by bootstrapping cells within tissue with n = 20 iterations. For each value, we train MARS with all other parameters fixed.

Supplementary information

Supplementary Information

Supplementary Notes 1–4.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brbić, M., Zitnik, M., Wang, S. et al. MARS: discovering novel cell types across heterogeneous single-cell experiments. Nat Methods 17, 1200–1206 (2020). https://doi.org/10.1038/s41592-020-00979-3

Download citation

Received: 21 February 2020
Accepted: 15 September 2020
Published: 19 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1038/s41592-020-00979-3

This article is cited by

Computational immunogenomic approaches to predict response to cancer immunotherapies
- Venkateswar Addala
- Felicity Newell
- Nicola Waddell
Nature Reviews Clinical Oncology (2024)
Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN
- Yanay Rosen
- Maria Brbić
- Jure Leskovec
Nature Methods (2024)
Pan-Peptide Meta Learning for T-cell receptor–antigen binding recognition
- Yicheng Gao
- Yuli Gao
- Qi Liu
Nature Machine Intelligence (2023)
A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data
- Yuqi Cheng
- Xingyu Fan
- Yu Li
Communications Biology (2023)
Population-level integration of single-cell datasets enables multi-scale analysis across samples
- Carlo De Donno
- Soroor Hediyeh-Zadeh
- Fabian J. Theis
Nature Methods (2023)