Single-cell technologies have made it possible to profile millions of cells, but for these resources to be useful they must be easy to query and access. To facilitate interactive and intuitive access to single-cell data we have developed scfind, a single-cell analysis tool that facilitates fast search of biologically or clinically relevant marker genes in cell atlases. Using transcriptome data from six mouse cell atlases, we show how scfind can be used to evaluate marker genes, perform in silico gating, and identify both cell-type-specific and housekeeping genes. Moreover, we have developed a subquery optimization routine to ensure that long and complex queries return meaningful results. To make scfind more user friendly, we use indices of PubMed abstracts and techniques from natural language processing to allow for arbitrary queries. Finally, we show how scfind can be used for multi-omics analyses by combining single-cell ATAC-seq data with transcriptome data.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Nature Methods Open Access 31 December 2022
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
The index LinnarssonAtlas.rds for the the BCA data from http://linnarssonlab.org/data/ can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/LinnarssonAtlas.rds. The index mca.rds for the the MCA data from https://figshare.com/s/865e694ad06d5857db4b can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/mca.rds. The index tm_10x.rds for the the TM, 10X and TM, FACS data from https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733 can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/tm_10X.rds and https://scfind.cog.sanger.ac.uk/indexes/tm_facs.rds respectively. The index atacseq.rds for the the sciATAC-seq data from http://atlas.gs.washington.edu/mouse-atac/data/ can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/atacseq.rds. The source data underlying Figs. 1–3 and 5 are provided as a Source Data file. Source data are provided with this paper.
The code for scfind is available at github.com/hemberg-lab/scfind and the code for generating the figures in this manuscript is available at https://github.com/hemberg-lab/scfind-paper-figures. A Code Ocean capsule of the tool is provided (https://doi.org/10.24433/CO.2453077.v1)59.
Tabula Muris Consortium. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Han, X. et al. Mapping the Mouse Cell Atlas by microwell-seq. Cell 172, 1091–1107.e17 (2018).
Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174, 1015–1030.e16 (2018).
Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014.e22 (2018).
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Regev, A. et al. The Human Cell Atlas. elife 6, e27041 (2017).
Howick, V. M. et al. The Malaria Cell Atlas: single parasite transcriptomes across the complete Plasmodium life cycle. Science 365, eaaw2619 (2019).
MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).
The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).
Sewell, W. Medical Subject Headings in MEDLARS. Bull. Assoc. Med Libr 52, 164–170 (1964).
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Cariaso, M. & Lennon, G. SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 40, D1308–D1312 (2012).
Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).
Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–D858 (2019).
Athar, A. et al. ArrayExpress update—from bulk to single-cell expression data. Nucleic Acids Res. 47, D711–D715 (2019).
Srivastava, D., Iyer, A., Kumar, V. & Sengupta, D. CellAtlasSearch: a scalable search engine for single cells. Nucleic Acids Res. 46, W141–W147 (2018).
Sato, K., Tsuyuzaki, K., Shimizu, K. & Nikaido, I. CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing. Genome Biol. 20, 31 (2019).
Vigna, S. Quasi-succinct indices. in Proc. Sixth ACM International Conference on Web Search and Data Mining—WSDM ’13 https://doi.org/10.1145/2433396.2433409 (ACM Press, 2013).
Tabula Muris Consortium et al. Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris. Nature 562, 367–372 (2018).
Golubovskaya, V. & Wu, L. Different subsets of T cells, memory, effector functions, and CAR-T immunotherapy. Cancers 8, 36 (2016).
Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).
Bausch-Fluck, D. et al. A mass spectrometric-derived cell surface protein atlas. PLoS ONE 10, e0121314 (2015).
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Ju, W. et al. Defining cell-type specificity at the transcriptional level in human disease. Genome Res. 23, 1862–1873 (2013).
Eisenberg, E. & Levanon, E. Y. Human housekeeping genes, revisited. Trends Genet. 29, 569–574 (2013).
Han, J., Pei, J., Yin, Y. & Mao, R. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8, 53–87 (2004).
Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972).
Piccini, I., Rao, J., Seebohm, G. & Greber, B. Human pluripotent stem cell-derived cardiomyocytes: genome-wide expression profiling of long-term in vitro maturation in comparison to human heart tissue. Genom. Data 4, 69–72 (2015).
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 47, D23–D28 (2019).
Wei, C.-H., Kao, H.-Y. & Lu, Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 41, W518–W522 (2013).
Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6, 5890 (2015).
Manica, M., Mathis, R., Cadow, J. & Rodríguez Martínez, M. Context-specific interaction networks from vector representation of words. Nat. Mach. Intell. 1, 181–190 (2019).
Hastings, J. et al. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 41, D456–D463 (2013).
Kim, S., Yeganova, L., Comeau, D. C., Wilbur, W. J. & Lu, Z. PubMed phrases, an open set of coherent phrases for searching biomedical literature. Sci. Data 5, 180104 (2018).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013).
Pyysalo, S., Ginter, F., Moen, H., Salakoski, T. & Ananiadou, S. Distributional semantics resources for biomedical text processing. In Proc. Languages in Biology and Medicine (LBM) 39–44 (2013).
Alfares, A. A. et al. Results of clinical genetic testing of 2,912 probands with hypertrophic cardiomyopathy: expanded panels offer limited additional sensitivity. Genet. Med. 17, 880–888 (2015).
Flavigny, J. et al. Identification of two novel mutations in the ventricular regulatory myosin light chain gene (MYL2) associated with familial and classical forms of hypertrophic cardiomyopathy. J. Mol. Med. 76, 208–214 (1998).
Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–947 (2013).
Parker, S. C. J. et al. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proc. Natl Acad. Sci. USA 110, 17921–17926 (2013).
Khan, A. & Zhang, X. dbSUPER: a database of super-enhancers in mouse and human genome. Nucleic Acids Res. 44, D164–D171 (2016).
Joo, M. S., Koo, J. H., Kim, T. H., Kim, Y. S. & Kim, S. G. LRH1-driven transcription factor circuitry for hepatocyte identity: super-enhancer cistromic analysis. EBioMedicine 40, 488–503 (2019).
Thomas, G. D. et al. Deleting an Nr4a1 super-enhancer subdomain ablates Ly6Clow monocytes while preserving macrophage gene function. Immunity 45, 975–987 (2016).
Kleftogiannis, D., Kalnis, P. & Bajic, V. B. Progress and challenges in bioinformatics approaches for enhancer identification. Brief. Bioinforma. 17, 967–979 (2016).
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Bradshaw, A. D. & Sage, E. H. SPARC, a matricellular protein that functions in cellular differentiation and tissue response to injury. J. Clin. Invest. 107, 1049–1054 (2001).
Callaham, M. L., Wears, R. L., Weber, E. J., Barton, C. & Young, G. Positive-outcome bias and other limitations in the outcome of research abstracts submitted to a scientific meeting. JAMA 280, 254–257 (1998).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 57, 289–300 (1995).
McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. JOSS 3, 861 (2018).
Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).
Chazarra-Gil, R., Hemberg, M., Kiselev, V. Y. & van Dongen, S. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench. Nucleic Acid Res. https://doi.org/10.1093/nar/gkab004 (2021).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).
Tan, G. & Lenhard, B. TFBSTools: an R/bioconductor package for transcription factor binding site analysis. Bioinformatics 32, 1555–1556 (2016).
Lee, J. T. H., Patikas, N., Kiselev, V. Y. & Hemberg, M. Fast Searches of Large Collections of Single Cell Data Using scfind (Code Ocean, 2021); https://doi.org/10.24433/CO.2453077.v1
J.T.H.L., N.P., V.Y.K. and M.H. were supported by a core grant from the Wellcome Trust. J.T.H.L. was also supported by a grant ‘Search tools for scRNA-seq data’ (RR-4145) from the Chan Zuckerberg Initiative and N.P. was supported by UK Dementia Research Institute (DRI) grant RRZA/175. We thank members of the Hemberg group, Y. Liu, T. Bergmann and A. Meziani for assisting with beta testing of the software and L. Garcia-Alonso, V.J. Hall, A. Ori, S. Teichmann and R. Vento for feedback on the manuscript. We thank J. Eliasova for assistance with Fig. 1a.
The authors declare no competing interests.
Peer review information Nature Methods thanks Qi Liu and Itoshi Nikaido for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Discussion, Tables 1–12 and Supplementary Figs. 1–18.
Precision, recall and F1 scores for all cell types in the atlases considered.
Information about the total number of marker genes and the precision and F1 scores that they provide for each cell type.
Cell-type specificity for the genes found in the MCA and the two TM datasets.
Number of maximal marker genes for each cell type in the MCA and the two TM datasets.
Number of cell-type-specific genes for each cell type in the MCA and the two TM datasets.
Best matches of the TM, FACS dataset from queries generated by sample variants from the index created from PubTator.
Best matches of the TM, FACS dataset from queries generated by sample diseases names/MeSH/OMIM IDs.
Best matches of the TM, FACS dataset from queries generated by sample chemical names and their corresponding IDs.
Best matches of the TM, FACS dataset from queries generated by sample phrases from the dictionary from the PubMed.
Cell-type specificity of super enhancers.
Cell-type-specific enhancer-gene pairs.
Top 20 and 30 marker genes in the three batch correction methods.
About this article
Cite this article
Lee, J.T.H., Patikas, N., Kiselev, V.Y. et al. Fast searches of large collections of single-cell data using scfind. Nat Methods 18, 262–271 (2021). https://doi.org/10.1038/s41592-021-01076-9
This article is cited by
Nature Methods (2023)