Accurate identification of cell subsets in complex populations is key to discovering novelty in multidimensional single-cell experiments. We present X-shift (http://web.stanford.edu/~samusik/vortex/), an algorithm that processes data sets using fast k-nearest-neighbor estimation of cell event density and arranges populations by marker-based classification. X-shift enables automated cell-subset clustering and access to biological insights that 'prior knowledge' might prevent the researcher from discovering.
This is a preview of subscription content, access via your institution
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Zunder, E.R. et al. Nat. Protoc. 10, 316–333 (2015).
Bendall, S.C. et al. Science 332, 687–696 (2011).
Bendall, S.C. et al. Cell 157, 714–725 (2014).
Aghaeepour, N. et al. Nat. Methods 10, 228–238 (2013).
Biau, G., Chazal, F., Cohen-Steiner, D., Devroye, L. & Rodríguez, C. Electron. J. Stat. 5, 204–237 (2011).
Georgescu, B., Shimshoni, I. & Meer, P. in Proceedings of the Ninth IEEE International Conference on Computer Vision 456–475 (2003).
Aghaeepour, N., Nikolic, R., Hoos, H.H. & Brinkman, R.R. Cytometry A 79A, 6–13 (2011).
Finak, G., Bashashati, A., Brinkman, R. & Gottardo, R. Adv. Bioinformatics 2009, 247646 (2009).
Spitzer, M.H. et al. Science 349, 1259425 (2015).
Comaniciu, D., Ramesh, V. & Meer, P. Proc. 18th IEEE Int. Conf. Comput. Vis. 1, 438–445 (2001).
Qian, Y. et al. Cytometry B Clin. Cytom. 78B, S69–S82 (2010).
Ge, Y. & Sealfon, S.C. Bioinformatics 28, 2052–2058 (2012).
Qiu, P. et al. Nat. Biotechnol. 29, 886–891 (2011).
Levine, J.H. et al. Cell 162, 184–197 (2015).
Mosmann, T.R. et al. Cytometry A 85, 422–433 (2014).
Zare, H., Shooshtari, P., Gupta, A. & Brinkman, R.R. BMC Bioinformatics 11, 403 (2010).
Whitmire, J.K., Eam, B. & Whitton, J.L. Eur. J. Immunol. 39, 1494–1504 (2009).
Hänninen, A., Maksimow, M., Alam, C., Morgan, D.J. & Jalkanen, S. Eur. J. Immunol. 41, 634–644 (2011).
Pelletier, N. et al. Nat. Immunol. 11, 1110–1118 (2010).
Yang, G.-X. et al. J. Immunol. 174, 3197–3203 (2005).
Cheng, Y. IEEE Trans. Pattern Anal. Mach. Intell. 17, 790–799 (1995).
Vedaldi, A. & Soatto, S. in Computer Vision – ECCV 2008 (eds. Forsyth, D. et al.) 705–718 (2008).
Rodriguez, A. & Laio, A. Science 344, 1492–1496 (2014).
Hering, T. Technol. Web P-216, 257–266 (2013).
Gabriel, K.R. & Sokal, R.R. Syst. Biol. 18, 259–278 (1969).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2009).
Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. PLoS ONE 9, e98679 (2014).
Finck, R. et al. Cytometry A 83A, 483–494 (2013).
We thank M. Angst, W.J. Fantl, A. Surnov, E. Freeman, and W.H. Wong for help in manuscript editing and preparation. This work was supported by NIH grant R01GM109836 (N.S.); NIH grants U19 AI057229, 1U19AI100627, R01-CA184968, 1R33-CA183654-01, R33-CA183692, 1R01-GM10983601, 1R21-CA183660, 1R01-NS08953301, OPP-1113682, 5UH2-AR067676, 1R01-CA19665701 and R01-HL120724 (G.P.N.); Immunology Training grant 5T32AI007290 (Z.G.); US Department of Defense Teal Innovator Award (G.P.N.); Northrop-Grumman Corporation (G.P.N.); the US Food and Drug Administration grant BAA HHSF223201210194c (G.P.N.) and the Rachford and Carlota A. Harris Endowed Chair (G.P.N.).
The authors declare no competing financial interests.
Integrated supplementary information
(a-d) Workflow of X-shift algorithm (a) K nearest neighbor estimate (kNN-DE) in synthetic datasets sampled from three Gaussian distributions in different number of dimensions. Theoretical probability density of datapoints is plotted against the kNN-DE (b) Visual example of K nearest neighbors density estimation for three randomly chosen datapoints in a synthetic 2-dimensional dataset. Lines show connections to 20 nearest neighbors. The density estimate is inversely proportional to the sum of line lengths. Example sets of 20 nearest neighbors are shown for 3 data points. (c) Connecting datapoints against the gradient of density estimate and finding local maxima (d) Testing neighboring populations for density-separation. (e) Testing the runtime of X-shift on synthetic data with 20 normally distributed populations. The dataset was clustered with X-shift using either the standard exhaustive kNN search algorithm or the custom fast kNN search algorithm that uses reference points and triangle inequality to guide the neighbor search. Runtime measurements demonstrate that the fast search allows X-shift to run in sub-quadratic time: the empirical estimate of complexity is O(n1.77), compared to O(n2) of the exhaustive search. Experiments were performed on a 16-core PC workstation equipped with two E5-2620 Intel Xeon CPUs and 64GB RAM. X-shift was run on the 64-bit Oracle HotSpot JVM version 7 update 71 under Windows 7 Pro.
Supplementary Figure 2 X-shift clustering of a synthetic 15-dimensional data set with that consists of a 50:50 mixture of noncentral t-distributions (ν = 3) and multivariate Gaussian distributions, with a total of 10 distributions in the mixture.
Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=22) and in the exponential phase (K=6). The graph shows the dependence of the number of clusters produced by X-shift on the setting of K.
Supplementary Figure 3 X-shift clustering of a synthetic 25-dimensional data set that consists of a 50:50 mixture of noncentral t-distributions (ν = 4) and multivariate Gaussian distributions, with a total of 20 distributions in the mixture.
Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=10) and in the exponential phase (K=6). The graph shows the dependence of the number of clusters produced by X-shift on the setting of K
Supplementary Figure 4 X-shift clustering of a synthetic 35-dimensional data set with that consists of a 50:50 mixture of noncentral t-distributions (ν = 5) and multivariate Gaussian distributions, with a total of 30 distributions in the mixture.
Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=28) and in the exponential phase (K=6).
Gating strategy that was used to find 24 reference populations in the mouse bone marrow CyTOF data. Pre-gating step involved removal of doublets, dead cells, erythrocytes and neutrophils. Non-neutrophils population was either subject to cluster analysis or subsequent gating. Dotted boxes represent 24 terminal gates that were selected as reference populations for clustering comparison Gating of B-cell populations was performed according to Hardy et at . Gates for CMP and GMP populations were set according to . The table inset provides the information about the Cell Ontology  identifiers that correspond to each gated population.
1. Hardy RR, Carmack CE, Shinton SA, Kemp JD & Hayakawa K (1991) J. Exp. Med. 173, 1213–1225
2. Challen GA, Boles N, Lin KK, Goodell MA (2009) Cytometry A 75(1): 14–24
3. Bard J,Rhee SY, Ashburner M. An ontology for cell types. Genome Biol. 2005; 6(2): R21.
Supplementary Figure 8 Comparison of X-shift and other cytometry clustering algorithms on the CyTOF data set generated from mouse bone marrow stained with a panel of 38 antibodies.
The performance of X-shift was compared to commonly used flow cytometry clustering methods on mouse bone marrow CyTOF dataset (see Main Text). Attempts to run FLOCK algorithm were unsuccesful because the dataset dimensionality exceeded the technical limit of the grid-based density estimate. AdiCyt and FlowMerge algorithm runs failed with programmatic errors. flowMeans was run up until MaxN of100, but because above that number most of runs frequently failed with ‘singular matrix inversion’ error. FlowPeaks algorithm has two free parameters, h and h0, which, after the suggestion of the user manual, we have varied concomitantly within the range [(0.1,0.15)-(1.0,1.5)]. Following the user manual, SamSPECTRAL sigma was fixed at 10.0 and separation factor was varied between 0.3 and 1.9. PhenoGraph was ran with K=5 to 30, 30 being the default setting.
Supplementary Figure 9 Analysis of cell populations found by X-shift in mouse replicate 7 (X-shift k = 20, 48 clusters).
(a) Divisive marker tree of clusters reconstructed from Mouse replicate #7 (X-shift K=20, 48 clusters), visualized in a form of a dendrogram. Nodes of the dendrogram are labelled with the marker and the cutoff value, all clusters in the blue sub-branch have median expression of the indicated marker below the cutoff value, while all clusters in the red sub-branch have median expression of the indicated marker above the cutoff value. The heatmap represents marker expression values, color code is relative to the maximum of a median expression of the given marker. Unlike the DMT on Figure 2A, the reconstruction was run on the median cluster expression profiles, not mean. This reconstruction created a different ordering of markers, although the overall grouping remains similar, which suggests that the DMT algorithm is robust to small variations in cluster phenotypes. The table below shows the allocation of hand-gated populations into clusters, numbers represent the fraction of a given population that is contained in a corresponding cluster. (b) F-measures of individual cell populations computed against X-shift clusters (Mouse #7, X-shift K=20, 48 clusters), before and after the manual merging of clusters that originate from X-shift finding biologically relevant sub-divisions in hand-gated cell populations. When all clusters corresponding to CD4 T cells, CD8 T cells, GMP, Plasma Cells and pDCs were merged, this lead to a general improvement of the F-measures of the aforementioned cell types. The overall sum of F-measures increased from 16.98 to 18.21, improving the average F-measure per population from 0.70 up to 0.76. (c). Median expression profiles of clusters forming the apparent pDC development trajectory (Defined by the force-directed layout on Figure 2E). (d) Median expression profiles of clusters forming the branched monocyte development trajectory (Defined by the force-directed layout on Figure 2D). Upper panel shows early progenitors and the population at the hypothetical branching point between classical and the intermediate/non-classical pathways. Boxes indicate the markers that appear to distinguish the hypothetical branching point population and the early progenitors. Middle panel shows the populations that are situated between the apparent branching point and the mature classical monocytes. Lower panel shows the populations that are situated between the apparent branching point and the mature intermediate/non-classical monocytes. Arrows designate marker expression patterns that appear to differ between the two branches.
Supplementary Figure 10 Single-cell force-directed layout (mouse 7, X-shift k = 20, 48 clusters) showing the distribution of hand-gated cell populations.
Single-cell Force Directed Layout identical to the one on Figure 2C, but with color codes showing the distribution of hand-gated cell populations (gating according to Supplementary Figure 12). Grey color represents cells that are not a part of any of the 24 hand-gated populations.
About this article
Cite this article
Samusik, N., Good, Z., Spitzer, M. et al. Automated mapping of phenotype space with single-cell data. Nat Methods 13, 493–496 (2016). https://doi.org/10.1038/nmeth.3863
This article is cited by
Characteristics of circulating immune cells in HBV-related acute-on-chronic liver failure following artificial liver treatment
BMC Immunology (2023)
Latent dirichlet allocation for double clustering (LDA-DC): discovering patients phenotypes and cell populations within a single Bayesian framework
BMC Bioinformatics (2023)
The effect of dosage on the protective efficacy of whole-sporozoite formulations for immunization against malaria
npj Vaccines (2023)
Nature Communications (2023)
Switching from imatinib to nilotinib plus pegylated interferon-α2b in chronic phase CML failing to achieve deep molecular response: clinical and immunological effects
Annals of Hematology (2023)