Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

scmap: projection of single-cell RNA-seq data across data sets

Abstract

Single-cell RNA-seq (scRNA-seq) allows researchers to define cell types on the basis of unsupervised clustering of the transcriptome. However, differences in experimental methods and computational analyses make it challenging to compare data across experiments. Here we present scmap (http://bioconductor.org/packages/scmap; web version at http://www.sanger.ac.uk/science/tools/scmap), a method for projecting cells from an scRNA-seq data set onto cell types or individual cells from other experiments.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: scmap use and performance.
Figure 2: scmap for combined references.

Similar content being viewed by others

References

  1. Regev, A. et al. eLife 6, e27041 (2017).

    Article  Google Scholar 

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  Google Scholar 

  3. Jégou, H., Douze, M. & Schmid, C. IEEE Trans. Pattern Anal. Mach. Intell. 33, 117–128 (2011).

    Article  Google Scholar 

  4. Andrews, T.S. & Hemberg, M. BioRxiv. Preprint at https://www.biorxiv.org/content/early/2017/05/25/065094 (2016).

  5. Brennecke, P. et al. Nat. Methods 10, 1093–1095 (2013).

    Article  CAS  Google Scholar 

  6. Cohen, J. Psychol. Bull. 70, 213–220 (1968).

    Article  CAS  Google Scholar 

  7. Picelli, S. et al. Nat. Protoc. 9, 171–181 (2014).

    Article  CAS  Google Scholar 

  8. Hashimshony, T. et al. Genome Biol. 17, 77 (2016).

    Article  Google Scholar 

  9. Klein, A.M. et al. Cell 161, 1187–1201 (2015).

    Article  CAS  Google Scholar 

  10. Wagner, A., Regev, A. & Yosef, N. Nat. Biotechnol. 34, 1145–1160 (2016).

    Article  CAS  Google Scholar 

  11. Trapnell, C. et al. Nat. Biotechnol. 32, 381–386 (2014).

    Article  CAS  Google Scholar 

  12. Jang, S. et al. elife 6, e20487 (2017).

    Article  Google Scholar 

  13. Treutlein, B. et al. Nature 534, 391–395 (2016).

    Article  Google Scholar 

  14. La Manno, G. et al. Cell 167, 566–580 (2016).

    Article  CAS  Google Scholar 

  15. Tung, P.-Y. Sci. Rep. 7, 39921 (2017).

    Article  CAS  Google Scholar 

  16. Camp, J.G. et al. Nature 546, 533–538 (2017).

    CAS  PubMed  Google Scholar 

  17. Crow, M., Paul, A., Ballouz, S., Huang, Z.J. & Gillis, J. Nat. Commun. 9, 884 (2018).

    Article  Google Scholar 

  18. Baron, M. et al. Cell Syst. 3, 346–360 (2016).

    Article  CAS  Google Scholar 

  19. Xin, Y. et al. Cell Metab. 24, 608–615 (2016).

    Article  CAS  Google Scholar 

  20. Segerstolpe, Å. et al. Cell Metab. 24, 593–607 (2016).

    Article  CAS  Google Scholar 

  21. McCarthy, D.J., Campbell, K.R., Lun, A.T.L. & Wills, Q.F. Bioinformatics 33, 1179–1186 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Butler, A. & Satija, R. BioRxiv. Preprint at https://www.biorxiv.org/content/early/2017/07/18/164889 (2017).

  23. Haghverdi, L., Lun, A.T.L., Morgan, M.D. & Marioni, J.C. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/07/18/165118 (2017).

  24. Shekhar, K. et al. Cell 166, 1308–1323 (2016).

    Article  CAS  Google Scholar 

  25. Macosko, E.Z. et al. Cell 161, 1202–1214 (2015).

    Article  CAS  Google Scholar 

  26. Muraro, M.J. et al. Cell Syst. 3, 385–394 (2016).

    Article  CAS  Google Scholar 

  27. Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Science 343, 193–196 (2014).

    Article  CAS  Google Scholar 

  28. Yan, L. et al. Nat. Struct. Mol. Biol. 20, 1131–1139 (2013).

    Article  CAS  Google Scholar 

  29. Ben-Hur, A., Horn, D., Siegelmann, H.T. & Vapnik, V. J. Mach. Learn. Res. 2, 125–137 (2001).

    Google Scholar 

  30. Breiman, L. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

Download references

Acknowledgements

We thank T. Andrews, K.N. Natarajan, G. Parada, M. Schaub, M. Stubbington, V. Svensson, J. Westoby and F. Wünnermann for helpful discussions, feedback on the manuscript and testing of the cloud implementation of scmap. Amazon Web Services (AWS) Cloud provided credits for running the scmap server for 1 year. V.Y.K., A.Y. and M.H. were supported by core funding to the Wellcome Sanger Institute provided by the Wellcome Trust.

Author information

Authors and Affiliations

Authors

Contributions

M.H. conceived the study and supervised the research; V.Y.K., A.Y. and M.H. contributed to the computational framework; V.Y.K. and M.H. wrote the manuscript.

Corresponding author

Correspondence to Martin Hemberg.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Feature-selection methods and self-projections.

(a) Dropout-based feature selection (see Methods) for Pollen1 (SMARTer protocol), Baron2 (inDrop3 protocol) and Macosko4 (Drop-seq4 protocol) datasets. The black line represents a linear fit to the distribution of the points, the red points represent top 500 positive residuals of the fit. (b) Cohen’s κ values of self-projections, corresponding to dropout-based, HVG5 and random feature selections. The plot is based on the datasets listed in Table S1. For each dataset 70% of the cells are sampled to create a Reference and to select features, and the remaining 30% of cells are used as queries. The procedure was repeated n=100 times per dataset. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points.

1. Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053–1058 (2014).

2. Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346–360.e4 (2016).

3. Klein, A. M. et al. Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells. Cell 161, 1187–1201 (2015).

4. Macosko, E. Z. et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202–1214 (2015).

5. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).

Source data

Supplementary Figure 2 scmap performance on positive controls.

scmap performance on positive controls (n=14, listed in Table S2) measured by (a) Cohen’s κ values and (b) percentage of unassigned cells. Values of similarity thresholds are shown on the right of the plots. (c) scmap performance on negative controls (n=18, listed in Table S3) measured by percentage of unassigned cells. Values of similarity thresholds are shown on the right of the plots. For all three panels, the middle row (for scmap-cluster, SVM and RF methods) and the top row (for scmap-cell method) correspond to Fig. 1b-d. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points.

Source data

Supplementary Figure 3 scmap performance on downsampled (by the number of cells) positive controls.

scmap performance on down-sampled (by the number of cells) positive controls (n=14, listed in Table S2) measured by (a) Cohen’s κ values and (b) Percentage of unassigned cells. Percentage of cells retained after down-sampling is shown on the right of the plots. For each dataset n=100 down-samplings were performed. (c) Robustness of feature selection measured by Jaccard Index calculated by comparing selected features in the original (listed in table S1) and down-sampled datasets. Percentage of cells retained after downsampling is shown on the right of the plots. For each dataset n=100 simulations were performed. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points.

Source data

Supplementary Figure 4 scmap performance on positive controls.

scmap performance on positive controls (n=12, listed in Table S2, except Shekhar and Macosko) with increased dropout rates measured by (a) Cohen’s κ values and (b) Percentage of unassigned cells. Percentage of extra dropouts is shown on the right of the plots. For each dataset n=100 random dropout assignments were performed. Shekhar and Macosko datasets were excluded from this analysis due to already high dropout rate. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points. (c) Dependence of Cohen’s κ on the % of dropouts on all positive controls (n=14, listed in Table S2). 200-500 features and 0.5 (for scmap-cell) and 0.7 (for scmap-cluster) threshold were used to plot the points. Lines are linear regression fits to the points. The gradients of the lines correspond to -0.47 (scmap-cluster) and 0 (scmap-cell).

Source data

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–4, Supplementary Tables 1–3 and Supplementary Notes 1–2

Life Sciences Reporting Summary

Supplementary Software

scmap software

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kiselev, V., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat Methods 15, 359–362 (2018). https://doi.org/10.1038/nmeth.4644

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.4644

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing