Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Fast searches of large collections of single-cell data using scfind

Abstract

Single-cell technologies have made it possible to profile millions of cells, but for these resources to be useful they must be easy to query and access. To facilitate interactive and intuitive access to single-cell data we have developed scfind, a single-cell analysis tool that facilitates fast search of biologically or clinically relevant marker genes in cell atlases. Using transcriptome data from six mouse cell atlases, we show how scfind can be used to evaluate marker genes, perform in silico gating, and identify both cell-type-specific and housekeeping genes. Moreover, we have developed a subquery optimization routine to ensure that long and complex queries return meaningful results. To make scfind more user friendly, we use indices of PubMed abstracts and techniques from natural language processing to allow for arbitrary queries. Finally, we show how scfind can be used for multi-omics analyses by combining single-cell ATAC-seq data with transcriptome data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Compression and search times.
Fig. 2: Basic search, identification of cell-type-specific and housekeeping genes.
Fig. 3: Subquery optimization.
Fig. 4: Free text searches and visualization of results.
Fig. 5: Determining cell-type specificity of distal enhancers using RNA-seq and ATAC-seq data.

Similar content being viewed by others

Data availability

The index LinnarssonAtlas.rds for the the BCA data from http://linnarssonlab.org/data/ can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/LinnarssonAtlas.rds. The index mca.rds for the the MCA data from https://figshare.com/s/865e694ad06d5857db4b can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/mca.rds. The index tm_10x.rds for the the TM, 10X and TM, FACS data from https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733 can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/tm_10X.rds and https://scfind.cog.sanger.ac.uk/indexes/tm_facs.rds respectively. The index atacseq.rds for the the sciATAC-seq data from http://atlas.gs.washington.edu/mouse-atac/data/ can be downloaded from https://scfind.cog.sanger.ac.uk/indexes/atacseq.rds. The source data underlying Figs. 13 and 5 are provided as a Source Data file. Source data are provided with this paper.

Code availability

The code for scfind is available at github.com/hemberg-lab/scfind and the code for generating the figures in this manuscript is available at https://github.com/hemberg-lab/scfind-paper-figures. A Code Ocean capsule of the tool is provided (https://doi.org/10.24433/CO.2453077.v1)59.

References

  1. Tabula Muris Consortium. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

    Article  CAS  Google Scholar 

  2. Han, X. et al. Mapping the Mouse Cell Atlas by microwell-seq. Cell 172, 1091–1107.e17 (2018).

    Article  CAS  PubMed  Google Scholar 

  3. Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174, 1015–1030.e16 (2018).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  4. Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014.e22 (2018).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  5. Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  6. Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  7. Regev, A. et al. The Human Cell Atlas. elife 6, e27041 (2017).

    Article  PubMed Central  PubMed  Google Scholar 

  8. Howick, V. M. et al. The Malaria Cell Atlas: single parasite transcriptomes across the complete Plasmodium life cycle. Science 365, eaaw2619 (2019).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  9. MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).

    Article  CAS  PubMed  Google Scholar 

  10. The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).

    Article  CAS  Google Scholar 

  11. Sewell, W. Medical Subject Headings in MEDLARS. Bull. Assoc. Med Libr 52, 164–170 (1964).

    CAS  Google Scholar 

  12. Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).

    Article  CAS  PubMed  Google Scholar 

  13. Cariaso, M. & Lennon, G. SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 40, D1308–D1312 (2012).

    Article  CAS  PubMed  Google Scholar 

  14. Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–D858 (2019).

    Article  CAS  PubMed  Google Scholar 

  16. Athar, A. et al. ArrayExpress update—from bulk to single-cell expression data. Nucleic Acids Res. 47, D711–D715 (2019).

    Article  CAS  PubMed  Google Scholar 

  17. Srivastava, D., Iyer, A., Kumar, V. & Sengupta, D. CellAtlasSearch: a scalable search engine for single cells. Nucleic Acids Res. 46, W141–W147 (2018).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  18. Sato, K., Tsuyuzaki, K., Shimizu, K. & Nikaido, I. CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing. Genome Biol. 20, 31 (2019).

    Article  PubMed Central  PubMed  Google Scholar 

  19. Vigna, S. Quasi-succinct indices. in Proc. Sixth ACM International Conference on Web Search and Data Mining—WSDM ’13 https://doi.org/10.1145/2433396.2433409 (ACM Press, 2013).

  20. Tabula Muris Consortium et al. Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris. Nature 562, 367–372 (2018).

  21. Golubovskaya, V. & Wu, L. Different subsets of T cells, memory, effector functions, and CAR-T immunotherapy. Cancers 8, 36 (2016).

    Article  CAS  PubMed Central  Google Scholar 

  22. Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).

    Article  CAS  PubMed  Google Scholar 

  23. Bausch-Fluck, D. et al. A mass spectrometric-derived cell surface protein atlas. PLoS ONE 10, e0121314 (2015).

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).

    Article  CAS  PubMed  Google Scholar 

  25. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).

    Article  CAS  PubMed  Google Scholar 

  26. Ju, W. et al. Defining cell-type specificity at the transcriptional level in human disease. Genome Res. 23, 1862–1873 (2013).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  27. Eisenberg, E. & Levanon, E. Y. Human housekeeping genes, revisited. Trends Genet. 29, 569–574 (2013).

    Article  CAS  PubMed  Google Scholar 

  28. Han, J., Pei, J., Yin, Y. & Mao, R. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8, 53–87 (2004).

    Article  Google Scholar 

  29. Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972).

    Article  Google Scholar 

  30. Piccini, I., Rao, J., Seebohm, G. & Greber, B. Human pluripotent stem cell-derived cardiomyocytes: genome-wide expression profiling of long-term in vitro maturation in comparison to human heart tissue. Genom. Data 4, 69–72 (2015).

    Article  PubMed Central  PubMed  Google Scholar 

  31. Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 47, D23–D28 (2019).

    Article  CAS  PubMed  Google Scholar 

  32. Wei, C.-H., Kao, H.-Y. & Lu, Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 41, W518–W522 (2013).

    Article  PubMed Central  PubMed  Google Scholar 

  33. Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6, 5890 (2015).

    Article  CAS  PubMed  Google Scholar 

  34. Manica, M., Mathis, R., Cadow, J. & Rodríguez Martínez, M. Context-specific interaction networks from vector representation of words. Nat. Mach. Intell. 1, 181–190 (2019).

    Article  Google Scholar 

  35. Hastings, J. et al. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 41, D456–D463 (2013).

    Article  CAS  PubMed  Google Scholar 

  36. Kim, S., Yeganova, L., Comeau, D. C., Wilbur, W. J. & Lu, Z. PubMed phrases, an open set of coherent phrases for searching biomedical literature. Sci. Data 5, 180104 (2018).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  37. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013).

    Google Scholar 

  38. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T. & Ananiadou, S. Distributional semantics resources for biomedical text processing. In Proc. Languages in Biology and Medicine (LBM) 39–44 (2013).

  39. Alfares, A. A. et al. Results of clinical genetic testing of 2,912 probands with hypertrophic cardiomyopathy: expanded panels offer limited additional sensitivity. Genet. Med. 17, 880–888 (2015).

    Article  PubMed  Google Scholar 

  40. Flavigny, J. et al. Identification of two novel mutations in the ventricular regulatory myosin light chain gene (MYL2) associated with familial and classical forms of hypertrophic cardiomyopathy. J. Mol. Med. 76, 208–214 (1998).

    Article  CAS  PubMed  Google Scholar 

  41. Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–947 (2013).

    Article  CAS  PubMed  Google Scholar 

  42. Parker, S. C. J. et al. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proc. Natl Acad. Sci. USA 110, 17921–17926 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Khan, A. & Zhang, X. dbSUPER: a database of super-enhancers in mouse and human genome. Nucleic Acids Res. 44, D164–D171 (2016).

    Article  CAS  PubMed  Google Scholar 

  44. Joo, M. S., Koo, J. H., Kim, T. H., Kim, Y. S. & Kim, S. G. LRH1-driven transcription factor circuitry for hepatocyte identity: super-enhancer cistromic analysis. EBioMedicine 40, 488–503 (2019).

    Article  PubMed Central  PubMed  Google Scholar 

  45. Thomas, G. D. et al. Deleting an Nr4a1 super-enhancer subdomain ablates Ly6Clow monocytes while preserving macrophage gene function. Immunity 45, 975–987 (2016).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  46. Kleftogiannis, D., Kalnis, P. & Bajic, V. B. Progress and challenges in bioinformatics approaches for enhancer identification. Brief. Bioinforma. 17, 967–979 (2016).

    Article  CAS  Google Scholar 

  47. Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  48. Bradshaw, A. D. & Sage, E. H. SPARC, a matricellular protein that functions in cellular differentiation and tissue response to injury. J. Clin. Invest. 107, 1049–1054 (2001).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  49. Callaham, M. L., Wears, R. L., Weber, E. J., Barton, C. & Young, G. Positive-outcome bias and other limitations in the outcome of research abstracts submitted to a scientific meeting. JAMA 280, 254–257 (1998).

    Article  CAS  PubMed  Google Scholar 

  50. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).

    Article  CAS  PubMed  Google Scholar 

  51. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 57, 289–300 (1995).

    Google Scholar 

  52. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. JOSS 3, 861 (2018).

    Article  Google Scholar 

  53. Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat. Methods 16, 479–487 (2019).

    Article  CAS  PubMed  Google Scholar 

  54. Chazarra-Gil, R., Hemberg, M., Kiselev, V. Y. & van Dongen, S. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench. Nucleic Acid Res. https://doi.org/10.1093/nar/gkab004 (2021).

  55. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  PubMed  Google Scholar 

  56. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  57. Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).

    Article  CAS  PubMed  Google Scholar 

  58. Tan, G. & Lenhard, B. TFBSTools: an R/bioconductor package for transcription factor binding site analysis. Bioinformatics 32, 1555–1556 (2016).

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  59. Lee, J. T. H., Patikas, N., Kiselev, V. Y. & Hemberg, M. Fast Searches of Large Collections of Single Cell Data Using scfind (Code Ocean, 2021); https://doi.org/10.24433/CO.2453077.v1

Download references

Acknowledgements

J.T.H.L., N.P., V.Y.K. and M.H. were supported by a core grant from the Wellcome Trust. J.T.H.L. was also supported by a grant ‘Search tools for scRNA-seq data’ (RR-4145) from the Chan Zuckerberg Initiative and N.P. was supported by UK Dementia Research Institute (DRI) grant RRZA/175. We thank members of the Hemberg group, Y. Liu, T. Bergmann and A. Meziani for assisting with beta testing of the software and L. Garcia-Alonso, V.J. Hall, A. Ori, S. Teichmann and R. Vento for feedback on the manuscript. We thank J. Eliasova for assistance with Fig. 1a.

Author information

Authors and Affiliations

Authors

Contributions

M.H. conceived the project and supervised the research. J.T.H.L., N.P., V.Y.K. and M.H. contributed to the code. J.T.H.L., N.P. and M.H. analyzed the data. J.T.H.L. and M.H. wrote the manuscript with input from N.P. and V.Y.K.

Corresponding author

Correspondence to Martin Hemberg.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Methods thanks Qi Liu and Itoshi Nikaido for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Discussion, Tables 1–12 and Supplementary Figs. 1–18.

Reporting Summary

Supplementary Table 1

Precision, recall and F1 scores for all cell types in the atlases considered.

Supplementary Table 2

Information about the total number of marker genes and the precision and F1 scores that they provide for each cell type.

Supplementary Table 3

Cell-type specificity for the genes found in the MCA and the two TM datasets.

Supplementary Table 4

Number of maximal marker genes for each cell type in the MCA and the two TM datasets.

Supplementary Table 5

Number of cell-type-specific genes for each cell type in the MCA and the two TM datasets.

Supplementary Table 6

Best matches of the TM, FACS dataset from queries generated by sample variants from the index created from PubTator.

Supplementary Table 7

Best matches of the TM, FACS dataset from queries generated by sample diseases names/MeSH/OMIM IDs.

Supplementary Table 8

Best matches of the TM, FACS dataset from queries generated by sample chemical names and their corresponding IDs.

Supplementary Table 9

Best matches of the TM, FACS dataset from queries generated by sample phrases from the dictionary from the PubMed.

Supplementary Table 10

Cell-type specificity of super enhancers.

Supplementary Table 11

Cell-type-specific enhancer-gene pairs.

Supplementary Table 12

Top 20 and 30 marker genes in the three batch correction methods.

Source data

Source Data Fig. 1

Data generated with R package scfind.

Source Data Fig. 2

Data generated with R package scfind.

Source Data Fig. 3

Data generated with R package scfind.

Source Data Fig. 5

Data generated with R package scfind.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, J.T.H., Patikas, N., Kiselev, V.Y. et al. Fast searches of large collections of single-cell data using scfind. Nat Methods 18, 262–271 (2021). https://doi.org/10.1038/s41592-021-01076-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-021-01076-9

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research