Automated mapping of phenotype space with single-cell data

Samusik, Nikolay; Good, Zinaida; Spitzer, Matthew H; Davis, Kara L; Nolan, Garry P

doi:10.1038/nmeth.3863

Brief Communication
Published: 16 May 2016

Automated mapping of phenotype space with single-cell data

Nikolay Samusik¹,
Zinaida Good^1,2,
Matthew H Spitzer^1,2,
Kara L Davis^1,3 &
…
Garry P Nolan¹

Nature Methods volume 13, pages 493–496 (2016)Cite this article

15k Accesses
210 Citations
23 Altmetric
Metrics details

Subjects

Abstract

Accurate identification of cell subsets in complex populations is key to discovering novelty in multidimensional single-cell experiments. We present X-shift (http://web.stanford.edu/~samusik/vortex/), an algorithm that processes data sets using fast k-nearest-neighbor estimation of cell event density and arranges populations by marker-based classification. X-shift enables automated cell-subset clustering and access to biological insights that 'prior knowledge' might prevent the researcher from discovering.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: X-shift algorithm design and validation.**

**Figure 2: X-shift clustering reveals novel features of mouse hematopoietic differentiation.**

diffcyt: Differential discovery in high-dimensional cytometry via high-resolution clustering

Article Open access 14 May 2019

Automated subset identification and characterization pipeline for multidimensional flow and mass cytometry data clustering and visualization

Article Open access 20 June 2019

DELVE: feature selection for preserving biological trajectories in single-cell data

Article Open access 29 March 2024

References

Zunder, E.R. et al. Nat. Protoc. 10, 316–333 (2015).
Article CAS Google Scholar
Bendall, S.C. et al. Science 332, 687–696 (2011).
Article CAS Google Scholar
Bendall, S.C. et al. Cell 157, 714–725 (2014).
Article CAS Google Scholar
Aghaeepour, N. et al. Nat. Methods 10, 228–238 (2013).
Article CAS Google Scholar
Biau, G., Chazal, F., Cohen-Steiner, D., Devroye, L. & Rodríguez, C. Electron. J. Stat. 5, 204–237 (2011).
Article Google Scholar
Georgescu, B., Shimshoni, I. & Meer, P. in Proceedings of the Ninth IEEE International Conference on Computer Vision 456–475 (2003).
Aghaeepour, N., Nikolic, R., Hoos, H.H. & Brinkman, R.R. Cytometry A 79A, 6–13 (2011).
Article Google Scholar
Finak, G., Bashashati, A., Brinkman, R. & Gottardo, R. Adv. Bioinformatics 2009, 247646 (2009).
Article Google Scholar
Spitzer, M.H. et al. Science 349, 1259425 (2015).
Article Google Scholar
Comaniciu, D., Ramesh, V. & Meer, P. Proc. 18th IEEE Int. Conf. Comput. Vis. 1, 438–445 (2001).
Google Scholar
Qian, Y. et al. Cytometry B Clin. Cytom. 78B, S69–S82 (2010).
Article CAS Google Scholar
Ge, Y. & Sealfon, S.C. Bioinformatics 28, 2052–2058 (2012).
Article CAS Google Scholar
Qiu, P. et al. Nat. Biotechnol. 29, 886–891 (2011).
Article CAS Google Scholar
Levine, J.H. et al. Cell 162, 184–197 (2015).
Article CAS Google Scholar
Mosmann, T.R. et al. Cytometry A 85, 422–433 (2014).
Article Google Scholar
Zare, H., Shooshtari, P., Gupta, A. & Brinkman, R.R. BMC Bioinformatics 11, 403 (2010).
Article Google Scholar
Whitmire, J.K., Eam, B. & Whitton, J.L. Eur. J. Immunol. 39, 1494–1504 (2009).
Article CAS Google Scholar
Hänninen, A., Maksimow, M., Alam, C., Morgan, D.J. & Jalkanen, S. Eur. J. Immunol. 41, 634–644 (2011).
Article Google Scholar
Pelletier, N. et al. Nat. Immunol. 11, 1110–1118 (2010).
Article CAS Google Scholar
Yang, G.-X. et al. J. Immunol. 174, 3197–3203 (2005).
Article CAS Google Scholar
Cheng, Y. IEEE Trans. Pattern Anal. Mach. Intell. 17, 790–799 (1995).
Article Google Scholar
Vedaldi, A. & Soatto, S. in Computer Vision – ECCV 2008 (eds. Forsyth, D. et al.) 705–718 (2008).
Rodriguez, A. & Laio, A. Science 344, 1492–1496 (2014).
Article CAS Google Scholar
Hering, T. Technol. Web P-216, 257–266 (2013).
Google Scholar
Gabriel, K.R. & Sokal, R.R. Syst. Biol. 18, 259–278 (1969).
Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2009).
Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. PLoS ONE 9, e98679 (2014).
Article Google Scholar
Finck, R. et al. Cytometry A 83A, 483–494 (2013).
Article CAS Google Scholar

Download references

Acknowledgements

We thank M. Angst, W.J. Fantl, A. Surnov, E. Freeman, and W.H. Wong for help in manuscript editing and preparation. This work was supported by NIH grant R01GM109836 (N.S.); NIH grants U19 AI057229, 1U19AI100627, R01-CA184968, 1R33-CA183654-01, R33-CA183692, 1R01-GM10983601, 1R21-CA183660, 1R01-NS08953301, OPP-1113682, 5UH2-AR067676, 1R01-CA19665701 and R01-HL120724 (G.P.N.); Immunology Training grant 5T32AI007290 (Z.G.); US Department of Defense Teal Innovator Award (G.P.N.); Northrop-Grumman Corporation (G.P.N.); the US Food and Drug Administration grant BAA HHSF223201210194c (G.P.N.) and the Rachford and Carlota A. Harris Endowed Chair (G.P.N.).

Author information

Authors and Affiliations

Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, California, USA
Nikolay Samusik, Zinaida Good, Matthew H Spitzer, Kara L Davis & Garry P Nolan
Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
Zinaida Good & Matthew H Spitzer
Department of Pediatric Hematology and Oncology, Stanford University School of Medicine, Stanford, California, USA
Kara L Davis

Authors

Nikolay Samusik
View author publications
You can also search for this author in PubMed Google Scholar
Zinaida Good
View author publications
You can also search for this author in PubMed Google Scholar
Matthew H Spitzer
View author publications
You can also search for this author in PubMed Google Scholar
Kara L Davis
View author publications
You can also search for this author in PubMed Google Scholar
Garry P Nolan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.S. conceived and designed the algorithms, performed all computational experiments and wrote the manuscript; Z.G. performed comparisons with other clustering algorithms and hand gating of CyTOF data; M.H.S. performed CyTOF experiments and hand gating of CyTOF data; K.L.D. performed hand gating of CyTOF data; G.P.N. supervised the work and wrote the manuscript.

Corresponding author

Correspondence to Garry P Nolan.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 kNN-DE and X-shift clustering.

(a-d) Workflow of X-shift algorithm (a) K nearest neighbor estimate (kNN-DE) in synthetic datasets sampled from three Gaussian distributions in different number of dimensions. Theoretical probability density of datapoints is plotted against the kNN-DE (b) Visual example of K nearest neighbors density estimation for three randomly chosen datapoints in a synthetic 2-dimensional dataset. Lines show connections to 20 nearest neighbors. The density estimate is inversely proportional to the sum of line lengths. Example sets of 20 nearest neighbors are shown for 3 data points. (c) Connecting datapoints against the gradient of density estimate and finding local maxima (d) Testing neighboring populations for density-separation. (e) Testing the runtime of X-shift on synthetic data with 20 normally distributed populations. The dataset was clustered with X-shift using either the standard exhaustive kNN search algorithm or the custom fast kNN search algorithm that uses reference points and triangle inequality to guide the neighbor search. Runtime measurements demonstrate that the fast search allows X-shift to run in sub-quadratic time: the empirical estimate of complexity is O(n^1.77), compared to O(n²) of the exhaustive search. Experiments were performed on a 16-core PC workstation equipped with two E5-2620 Intel Xeon CPUs and 64GB RAM. X-shift was run on the 64-bit Oracle HotSpot JVM version 7 update 71 under Windows 7 Pro.

Supplementary Figure 2 X-shift clustering of a synthetic 15-dimensional data set with that consists of a 50:50 mixture of noncentral t-distributions (ν = 3) and multivariate Gaussian distributions, with a total of 10 distributions in the mixture.

Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=22) and in the exponential phase (K=6). The graph shows the dependence of the number of clusters produced by X-shift on the setting of K.

Supplementary Figure 3 X-shift clustering of a synthetic 25-dimensional data set that consists of a 50:50 mixture of noncentral t-distributions (ν = 4) and multivariate Gaussian distributions, with a total of 20 distributions in the mixture.

Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=10) and in the exponential phase (K=6). The graph shows the dependence of the number of clusters produced by X-shift on the setting of K

Supplementary Figure 4 X-shift clustering of a synthetic 35-dimensional data set with that consists of a 50:50 mixture of noncentral t-distributions (ν = 5) and multivariate Gaussian distributions, with a total of 30 distributions in the mixture.

Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=28) and in the exponential phase (K=6).

Supplementary Figure 5 Biaxial gating hierarchy for the mouse bone marrow CyTOF data set.

Gating strategy that was used to find 24 reference populations in the mouse bone marrow CyTOF data. Pre-gating step involved removal of doublets, dead cells, erythrocytes and neutrophils. Non-neutrophils population was either subject to cluster analysis or subsequent gating. Dotted boxes represent 24 terminal gates that were selected as reference populations for clustering comparison Gating of B-cell populations was performed according to Hardy et at [1]. Gates for CMP and GMP populations were set according to [2]. The table inset provides the information about the Cell Ontology [3] identifiers that correspond to each gated population.

1. Hardy RR, Carmack CE, Shinton SA, Kemp JD & Hayakawa K (1991) J. Exp. Med. 173, 1213–1225

2. Challen GA, Boles N, Lin KK, Goodell MA (2009) Cytometry A 75(1): 14–24

3. Bard J,Rhee SY, Ashburner M. An ontology for cell types. Genome Biol. 2005; 6(2): R21.

Supplementary Figure 6 Workflows documenting attempts to run AdiCYT and ImmPORT FLOCK that terminated with errors.

Supplementary Figure 7 An R workflow documenting the attempt to run flowClust/flowMerge that terminated with programmatic errors.

Supplementary Figure 8 Comparison of X-shift and other cytometry clustering algorithms on the CyTOF data set generated from mouse bone marrow stained with a panel of 38 antibodies.

The performance of X-shift was compared to commonly used flow cytometry clustering methods on mouse bone marrow CyTOF dataset (see Main Text). Attempts to run FLOCK algorithm were unsuccesful because the dataset dimensionality exceeded the technical limit of the grid-based density estimate. AdiCyt and FlowMerge algorithm runs failed with programmatic errors. flowMeans was run up until MaxN of100, but because above that number most of runs frequently failed with ‘singular matrix inversion’ error. FlowPeaks algorithm has two free parameters, h and h0, which, after the suggestion of the user manual, we have varied concomitantly within the range [(0.1,0.15)-(1.0,1.5)]. Following the user manual, SamSPECTRAL sigma was fixed at 10.0 and separation factor was varied between 0.3 and 1.9. PhenoGraph was ran with K=5 to 30, 30 being the default setting.

Supplementary Figure 9 Analysis of cell populations found by X-shift in mouse replicate 7 (X-shift k = 20, 48 clusters).

(a) Divisive marker tree of clusters reconstructed from Mouse replicate #7 (X-shift K=20, 48 clusters), visualized in a form of a dendrogram. Nodes of the dendrogram are labelled with the marker and the cutoff value, all clusters in the blue sub-branch have median expression of the indicated marker below the cutoff value, while all clusters in the red sub-branch have median expression of the indicated marker above the cutoff value. The heatmap represents marker expression values, color code is relative to the maximum of a median expression of the given marker. Unlike the DMT on Figure 2A, the reconstruction was run on the median cluster expression profiles, not mean. This reconstruction created a different ordering of markers, although the overall grouping remains similar, which suggests that the DMT algorithm is robust to small variations in cluster phenotypes. The table below shows the allocation of hand-gated populations into clusters, numbers represent the fraction of a given population that is contained in a corresponding cluster. (b) F-measures of individual cell populations computed against X-shift clusters (Mouse #7, X-shift K=20, 48 clusters), before and after the manual merging of clusters that originate from X-shift finding biologically relevant sub-divisions in hand-gated cell populations. When all clusters corresponding to CD4 T cells, CD8 T cells, GMP, Plasma Cells and pDCs were merged, this lead to a general improvement of the F-measures of the aforementioned cell types. The overall sum of F-measures increased from 16.98 to 18.21, improving the average F-measure per population from 0.70 up to 0.76. (c). Median expression profiles of clusters forming the apparent pDC development trajectory (Defined by the force-directed layout on Figure 2E). (d) Median expression profiles of clusters forming the branched monocyte development trajectory (Defined by the force-directed layout on Figure 2D). Upper panel shows early progenitors and the population at the hypothetical branching point between classical and the intermediate/non-classical pathways. Boxes indicate the markers that appear to distinguish the hypothetical branching point population and the early progenitors. Middle panel shows the populations that are situated between the apparent branching point and the mature classical monocytes. Lower panel shows the populations that are situated between the apparent branching point and the mature intermediate/non-classical monocytes. Arrows designate marker expression patterns that appear to differ between the two branches.

Supplementary Figure 10 Single-cell force-directed layout (mouse 7, X-shift k = 20, 48 clusters) showing the distribution of hand-gated cell populations.

Single-cell Force Directed Layout identical to the one on Figure 2C, but with color codes showing the distribution of hand-gated cell populations (gating according to Supplementary Figure 12). Grey color represents cells that are not a part of any of the 24 hand-gated populations.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15 (PDF 3618 kb)

Supplementary Data

xx (XLSX 173 kb)

Source data

Source data to Fig. 1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Samusik, N., Good, Z., Spitzer, M. et al. Automated mapping of phenotype space with single-cell data. Nat Methods 13, 493–496 (2016). https://doi.org/10.1038/nmeth.3863

Download citation

Received: 13 January 2016
Accepted: 12 April 2016
Published: 16 May 2016
Issue Date: June 2016
DOI: https://doi.org/10.1038/nmeth.3863

This article is cited by

SuperCellCyto: enabling efficient analysis of large scale cytometry datasets
- Givanna H. Putri
- George Howitt
- Belinda Phipson
Genome Biology (2024)
Characteristics of circulating immune cells in HBV-related acute-on-chronic liver failure following artificial liver treatment
- Tao Ju
- Daixi Jiang
- Dong Yan
BMC Immunology (2023)
Latent dirichlet allocation for double clustering (LDA-DC): discovering patients phenotypes and cell populations within a single Bayesian framework
- Elie-Julien El Hachem
- Nataliya Sokolovska
- Hedi Soula
BMC Bioinformatics (2023)
The effect of dosage on the protective efficacy of whole-sporozoite formulations for immunization against malaria
- Diana Moita
- Catarina Rôla
- Miguel Prudêncio
npj Vaccines (2023)
Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering
- Candace C. Liu
- Noah F. Greenwald
- Michael Angelo
Nature Communications (2023)