Abstract
Single-cell RNA-seq (scRNA-seq) allows researchers to define cell types on the basis of unsupervised clustering of the transcriptome. However, differences in experimental methods and computational analyses make it challenging to compare data across experiments. Here we present scmap (http://bioconductor.org/packages/scmap; web version at http://www.sanger.ac.uk/science/tools/scmap), a method for projecting cells from an scRNA-seq data set onto cell types or individual cells from other experiments.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Regev, A. et al. eLife 6, e27041 (2017).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).
Jégou, H., Douze, M. & Schmid, C. IEEE Trans. Pattern Anal. Mach. Intell. 33, 117–128 (2011).
Andrews, T.S. & Hemberg, M. BioRxiv. Preprint at https://www.biorxiv.org/content/early/2017/05/25/065094 (2016).
Brennecke, P. et al. Nat. Methods 10, 1093–1095 (2013).
Cohen, J. Psychol. Bull. 70, 213–220 (1968).
Picelli, S. et al. Nat. Protoc. 9, 171–181 (2014).
Hashimshony, T. et al. Genome Biol. 17, 77 (2016).
Klein, A.M. et al. Cell 161, 1187–1201 (2015).
Wagner, A., Regev, A. & Yosef, N. Nat. Biotechnol. 34, 1145–1160 (2016).
Trapnell, C. et al. Nat. Biotechnol. 32, 381–386 (2014).
Jang, S. et al. elife 6, e20487 (2017).
Treutlein, B. et al. Nature 534, 391–395 (2016).
La Manno, G. et al. Cell 167, 566–580 (2016).
Tung, P.-Y. Sci. Rep. 7, 39921 (2017).
Camp, J.G. et al. Nature 546, 533–538 (2017).
Crow, M., Paul, A., Ballouz, S., Huang, Z.J. & Gillis, J. Nat. Commun. 9, 884 (2018).
Baron, M. et al. Cell Syst. 3, 346–360 (2016).
Xin, Y. et al. Cell Metab. 24, 608–615 (2016).
Segerstolpe, Å. et al. Cell Metab. 24, 593–607 (2016).
McCarthy, D.J., Campbell, K.R., Lun, A.T.L. & Wills, Q.F. Bioinformatics 33, 1179–1186 (2017).
Butler, A. & Satija, R. BioRxiv. Preprint at https://www.biorxiv.org/content/early/2017/07/18/164889 (2017).
Haghverdi, L., Lun, A.T.L., Morgan, M.D. & Marioni, J.C. bioRxiv Preprint at https://www.biorxiv.org/content/early/2017/07/18/165118 (2017).
Shekhar, K. et al. Cell 166, 1308–1323 (2016).
Macosko, E.Z. et al. Cell 161, 1202–1214 (2015).
Muraro, M.J. et al. Cell Syst. 3, 385–394 (2016).
Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Science 343, 193–196 (2014).
Yan, L. et al. Nat. Struct. Mol. Biol. 20, 1131–1139 (2013).
Ben-Hur, A., Horn, D., Siegelmann, H.T. & Vapnik, V. J. Mach. Learn. Res. 2, 125–137 (2001).
Breiman, L. Mach. Learn. 45, 5–32 (2001).
Acknowledgements
We thank T. Andrews, K.N. Natarajan, G. Parada, M. Schaub, M. Stubbington, V. Svensson, J. Westoby and F. Wünnermann for helpful discussions, feedback on the manuscript and testing of the cloud implementation of scmap. Amazon Web Services (AWS) Cloud provided credits for running the scmap server for 1 year. V.Y.K., A.Y. and M.H. were supported by core funding to the Wellcome Sanger Institute provided by the Wellcome Trust.
Author information
Authors and Affiliations
Contributions
M.H. conceived the study and supervised the research; V.Y.K., A.Y. and M.H. contributed to the computational framework; V.Y.K. and M.H. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Feature-selection methods and self-projections.
(a) Dropout-based feature selection (see Methods) for Pollen1 (SMARTer protocol), Baron2 (inDrop3 protocol) and Macosko4 (Drop-seq4 protocol) datasets. The black line represents a linear fit to the distribution of the points, the red points represent top 500 positive residuals of the fit. (b) Cohen’s κ values of self-projections, corresponding to dropout-based, HVG5 and random feature selections. The plot is based on the datasets listed in Table S1. For each dataset 70% of the cells are sampled to create a Reference and to select features, and the remaining 30% of cells are used as queries. The procedure was repeated n=100 times per dataset. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points.
1. Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol. 32, 1053–1058 (2014).
2. Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346–360.e4 (2016).
3. Klein, A. M. et al. Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells. Cell 161, 1187–1201 (2015).
4. Macosko, E. Z. et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202–1214 (2015).
5. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
Supplementary Figure 2 scmap performance on positive controls.
scmap performance on positive controls (n=14, listed in Table S2) measured by (a) Cohen’s κ values and (b) percentage of unassigned cells. Values of similarity thresholds are shown on the right of the plots. (c) scmap performance on negative controls (n=18, listed in Table S3) measured by percentage of unassigned cells. Values of similarity thresholds are shown on the right of the plots. For all three panels, the middle row (for scmap-cluster, SVM and RF methods) and the top row (for scmap-cell method) correspond to Fig. 1b-d. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points.
Supplementary Figure 3 scmap performance on downsampled (by the number of cells) positive controls.
scmap performance on down-sampled (by the number of cells) positive controls (n=14, listed in Table S2) measured by (a) Cohen’s κ values and (b) Percentage of unassigned cells. Percentage of cells retained after down-sampling is shown on the right of the plots. For each dataset n=100 down-samplings were performed. (c) Robustness of feature selection measured by Jaccard Index calculated by comparing selected features in the original (listed in table S1) and down-sampled datasets. Percentage of cells retained after downsampling is shown on the right of the plots. For each dataset n=100 simulations were performed. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points.
Supplementary Figure 4 scmap performance on positive controls.
scmap performance on positive controls (n=12, listed in Table S2, except Shekhar and Macosko) with increased dropout rates measured by (a) Cohen’s κ values and (b) Percentage of unassigned cells. Percentage of extra dropouts is shown on the right of the plots. For each dataset n=100 random dropout assignments were performed. Shekhar and Macosko datasets were excluded from this analysis due to already high dropout rate. The center of the boxplot is the median and the hinges correspond to the inter-quartile range, the distance between the first and third quartiles, the whiskers extend no more than 1.5 times the inter-quartile range and data beyond this range are plotted as individual points. (c) Dependence of Cohen’s κ on the % of dropouts on all positive controls (n=14, listed in Table S2). 200-500 features and 0.5 (for scmap-cell) and 0.7 (for scmap-cluster) threshold were used to plot the points. Lines are linear regression fits to the points. The gradients of the lines correspond to -0.47 (scmap-cluster) and 0 (scmap-cell).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–4, Supplementary Tables 1–3 and Supplementary Notes 1–2
Supplementary Software
scmap software
Rights and permissions
About this article
Cite this article
Kiselev, V., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat Methods 15, 359–362 (2018). https://doi.org/10.1038/nmeth.4644
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.4644
This article is cited by
-
Single-cell transcriptomic analysis reveals tumor cell heterogeneity and immune microenvironment features of pituitary neuroendocrine tumors
Genome Medicine (2024)
-
A single-cell atlas of Drosophila trachea reveals glycosylation-mediated Notch signaling in cell fate specification
Nature Communications (2024)
-
Single-cell division tracing and transcriptomics reveal cell types and differentiation paths in the regenerating lung
Nature Communications (2024)
-
The impacts of active and self-supervised learning on efficient annotation of single-cell expression data
Nature Communications (2024)
-
Dictionary learning for integrative, multimodal and scalable single-cell analysis
Nature Biotechnology (2024)