Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Automated mapping of phenotype space with single-cell data

Abstract

Accurate identification of cell subsets in complex populations is key to discovering novelty in multidimensional single-cell experiments. We present X-shift (http://web.stanford.edu/~samusik/vortex/), an algorithm that processes data sets using fast k-nearest-neighbor estimation of cell event density and arranges populations by marker-based classification. X-shift enables automated cell-subset clustering and access to biological insights that 'prior knowledge' might prevent the researcher from discovering.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: X-shift algorithm design and validation.
Figure 2: X-shift clustering reveals novel features of mouse hematopoietic differentiation.

References

  1. Zunder, E.R. et al. Nat. Protoc. 10, 316–333 (2015).

    CAS  Article  Google Scholar 

  2. Bendall, S.C. et al. Science 332, 687–696 (2011).

    CAS  Article  Google Scholar 

  3. Bendall, S.C. et al. Cell 157, 714–725 (2014).

    CAS  Article  Google Scholar 

  4. Aghaeepour, N. et al. Nat. Methods 10, 228–238 (2013).

    CAS  Article  Google Scholar 

  5. Biau, G., Chazal, F., Cohen-Steiner, D., Devroye, L. & Rodríguez, C. Electron. J. Stat. 5, 204–237 (2011).

    Article  Google Scholar 

  6. Georgescu, B., Shimshoni, I. & Meer, P. in Proceedings of the Ninth IEEE International Conference on Computer Vision 456–475 (2003).

  7. Aghaeepour, N., Nikolic, R., Hoos, H.H. & Brinkman, R.R. Cytometry A 79A, 6–13 (2011).

    Article  Google Scholar 

  8. Finak, G., Bashashati, A., Brinkman, R. & Gottardo, R. Adv. Bioinformatics 2009, 247646 (2009).

    Article  Google Scholar 

  9. Spitzer, M.H. et al. Science 349, 1259425 (2015).

    Article  Google Scholar 

  10. Comaniciu, D., Ramesh, V. & Meer, P. Proc. 18th IEEE Int. Conf. Comput. Vis. 1, 438–445 (2001).

    Google Scholar 

  11. Qian, Y. et al. Cytometry B Clin. Cytom. 78B, S69–S82 (2010).

    CAS  Article  Google Scholar 

  12. Ge, Y. & Sealfon, S.C. Bioinformatics 28, 2052–2058 (2012).

    CAS  Article  Google Scholar 

  13. Qiu, P. et al. Nat. Biotechnol. 29, 886–891 (2011).

    CAS  Article  Google Scholar 

  14. Levine, J.H. et al. Cell 162, 184–197 (2015).

    CAS  Article  Google Scholar 

  15. Mosmann, T.R. et al. Cytometry A 85, 422–433 (2014).

    Article  Google Scholar 

  16. Zare, H., Shooshtari, P., Gupta, A. & Brinkman, R.R. BMC Bioinformatics 11, 403 (2010).

    Article  Google Scholar 

  17. Whitmire, J.K., Eam, B. & Whitton, J.L. Eur. J. Immunol. 39, 1494–1504 (2009).

    CAS  Article  Google Scholar 

  18. Hänninen, A., Maksimow, M., Alam, C., Morgan, D.J. & Jalkanen, S. Eur. J. Immunol. 41, 634–644 (2011).

    Article  Google Scholar 

  19. Pelletier, N. et al. Nat. Immunol. 11, 1110–1118 (2010).

    CAS  Article  Google Scholar 

  20. Yang, G.-X. et al. J. Immunol. 174, 3197–3203 (2005).

    CAS  Article  Google Scholar 

  21. Cheng, Y. IEEE Trans. Pattern Anal. Mach. Intell. 17, 790–799 (1995).

    Article  Google Scholar 

  22. Vedaldi, A. & Soatto, S. in Computer Vision – ECCV 2008 (eds. Forsyth, D. et al.) 705–718 (2008).

  23. Rodriguez, A. & Laio, A. Science 344, 1492–1496 (2014).

    CAS  Article  Google Scholar 

  24. Hering, T. Technol. Web P-216, 257–266 (2013).

    Google Scholar 

  25. Gabriel, K.R. & Sokal, R.R. Syst. Biol. 18, 259–278 (1969).

    Google Scholar 

  26. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2009).

  27. Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. PLoS ONE 9, e98679 (2014).

    Article  Google Scholar 

  28. Finck, R. et al. Cytometry A 83A, 483–494 (2013).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank M. Angst, W.J. Fantl, A. Surnov, E. Freeman, and W.H. Wong for help in manuscript editing and preparation. This work was supported by NIH grant R01GM109836 (N.S.); NIH grants U19 AI057229, 1U19AI100627, R01-CA184968, 1R33-CA183654-01, R33-CA183692, 1R01-GM10983601, 1R21-CA183660, 1R01-NS08953301, OPP-1113682, 5UH2-AR067676, 1R01-CA19665701 and R01-HL120724 (G.P.N.); Immunology Training grant 5T32AI007290 (Z.G.); US Department of Defense Teal Innovator Award (G.P.N.); Northrop-Grumman Corporation (G.P.N.); the US Food and Drug Administration grant BAA HHSF223201210194c (G.P.N.) and the Rachford and Carlota A. Harris Endowed Chair (G.P.N.).

Author information

Authors and Affiliations

Authors

Contributions

N.S. conceived and designed the algorithms, performed all computational experiments and wrote the manuscript; Z.G. performed comparisons with other clustering algorithms and hand gating of CyTOF data; M.H.S. performed CyTOF experiments and hand gating of CyTOF data; K.L.D. performed hand gating of CyTOF data; G.P.N. supervised the work and wrote the manuscript.

Corresponding author

Correspondence to Garry P Nolan.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 kNN-DE and X-shift clustering.

(a-d) Workflow of X-shift algorithm (a) K nearest neighbor estimate (kNN-DE) in synthetic datasets sampled from three Gaussian distributions in different number of dimensions. Theoretical probability density of datapoints is plotted against the kNN-DE (b) Visual example of K nearest neighbors density estimation for three randomly chosen datapoints in a synthetic 2-dimensional dataset. Lines show connections to 20 nearest neighbors. The density estimate is inversely proportional to the sum of line lengths. Example sets of 20 nearest neighbors are shown for 3 data points. (c) Connecting datapoints against the gradient of density estimate and finding local maxima (d) Testing neighboring populations for density-separation. (e) Testing the runtime of X-shift on synthetic data with 20 normally distributed populations. The dataset was clustered with X-shift using either the standard exhaustive kNN search algorithm or the custom fast kNN search algorithm that uses reference points and triangle inequality to guide the neighbor search. Runtime measurements demonstrate that the fast search allows X-shift to run in sub-quadratic time: the empirical estimate of complexity is O(n1.77), compared to O(n2) of the exhaustive search. Experiments were performed on a 16-core PC workstation equipped with two E5-2620 Intel Xeon CPUs and 64GB RAM. X-shift was run on the 64-bit Oracle HotSpot JVM version 7 update 71 under Windows 7 Pro.

Supplementary Figure 2 X-shift clustering of a synthetic 15-dimensional data set with that consists of a 50:50 mixture of noncentral t-distributions (ν = 3) and multivariate Gaussian distributions, with a total of 10 distributions in the mixture.

Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=22) and in the exponential phase (K=6). The graph shows the dependence of the number of clusters produced by X-shift on the setting of K.

Supplementary Figure 3 X-shift clustering of a synthetic 25-dimensional data set that consists of a 50:50 mixture of noncentral t-distributions (ν = 4) and multivariate Gaussian distributions, with a total of 20 distributions in the mixture.

Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=10) and in the exponential phase (K=6). The graph shows the dependence of the number of clusters produced by X-shift on the setting of K

Supplementary Figure 4 X-shift clustering of a synthetic 35-dimensional data set with that consists of a 50:50 mixture of noncentral t-distributions (ν = 5) and multivariate Gaussian distributions, with a total of 30 distributions in the mixture.

Contingency tables show the separation of populations in the linear phase (K=74), at the switch-point (K=28) and in the exponential phase (K=6).

Supplementary Figure 5 Biaxial gating hierarchy for the mouse bone marrow CyTOF data set.

Gating strategy that was used to find 24 reference populations in the mouse bone marrow CyTOF data. Pre-gating step involved removal of doublets, dead cells, erythrocytes and neutrophils. Non-neutrophils population was either subject to cluster analysis or subsequent gating. Dotted boxes represent 24 terminal gates that were selected as reference populations for clustering comparison Gating of B-cell populations was performed according to Hardy et at [1]. Gates for CMP and GMP populations were set according to [2]. The table inset provides the information about the Cell Ontology [3] identifiers that correspond to each gated population.

1. Hardy RR, Carmack CE, Shinton SA, Kemp JD & Hayakawa K (1991) J. Exp. Med. 173, 1213–1225

2. Challen GA, Boles N, Lin KK, Goodell MA (2009) Cytometry A 75(1): 14–24

3. Bard J,Rhee SY, Ashburner M. An ontology for cell types. Genome Biol. 2005; 6(2): R21.

Supplementary Figure 6 Workflows documenting attempts to run AdiCYT and ImmPORT FLOCK that terminated with errors.

Supplementary Figure 7 An R workflow documenting the attempt to run flowClust/flowMerge that terminated with programmatic errors.

Supplementary Figure 8 Comparison of X-shift and other cytometry clustering algorithms on the CyTOF data set generated from mouse bone marrow stained with a panel of 38 antibodies.

The performance of X-shift was compared to commonly used flow cytometry clustering methods on mouse bone marrow CyTOF dataset (see Main Text). Attempts to run FLOCK algorithm were unsuccesful because the dataset dimensionality exceeded the technical limit of the grid-based density estimate. AdiCyt and FlowMerge algorithm runs failed with programmatic errors. flowMeans was run up until MaxN of100, but because above that number most of runs frequently failed with ‘singular matrix inversion’ error. FlowPeaks algorithm has two free parameters, h and h0, which, after the suggestion of the user manual, we have varied concomitantly within the range [(0.1,0.15)-(1.0,1.5)]. Following the user manual, SamSPECTRAL sigma was fixed at 10.0 and separation factor was varied between 0.3 and 1.9. PhenoGraph was ran with K=5 to 30, 30 being the default setting.

Supplementary Figure 9 Analysis of cell populations found by X-shift in mouse replicate 7 (X-shift k = 20, 48 clusters).

(a) Divisive marker tree of clusters reconstructed from Mouse replicate #7 (X-shift K=20, 48 clusters), visualized in a form of a dendrogram. Nodes of the dendrogram are labelled with the marker and the cutoff value, all clusters in the blue sub-branch have median expression of the indicated marker below the cutoff value, while all clusters in the red sub-branch have median expression of the indicated marker above the cutoff value. The heatmap represents marker expression values, color code is relative to the maximum of a median expression of the given marker. Unlike the DMT on Figure 2A, the reconstruction was run on the median cluster expression profiles, not mean. This reconstruction created a different ordering of markers, although the overall grouping remains similar, which suggests that the DMT algorithm is robust to small variations in cluster phenotypes. The table below shows the allocation of hand-gated populations into clusters, numbers represent the fraction of a given population that is contained in a corresponding cluster. (b) F-measures of individual cell populations computed against X-shift clusters (Mouse #7, X-shift K=20, 48 clusters), before and after the manual merging of clusters that originate from X-shift finding biologically relevant sub-divisions in hand-gated cell populations. When all clusters corresponding to CD4 T cells, CD8 T cells, GMP, Plasma Cells and pDCs were merged, this lead to a general improvement of the F-measures of the aforementioned cell types. The overall sum of F-measures increased from 16.98 to 18.21, improving the average F-measure per population from 0.70 up to 0.76. (c). Median expression profiles of clusters forming the apparent pDC development trajectory (Defined by the force-directed layout on Figure 2E). (d) Median expression profiles of clusters forming the branched monocyte development trajectory (Defined by the force-directed layout on Figure 2D). Upper panel shows early progenitors and the population at the hypothetical branching point between classical and the intermediate/non-classical pathways. Boxes indicate the markers that appear to distinguish the hypothetical branching point population and the early progenitors. Middle panel shows the populations that are situated between the apparent branching point and the mature classical monocytes. Lower panel shows the populations that are situated between the apparent branching point and the mature intermediate/non-classical monocytes. Arrows designate marker expression patterns that appear to differ between the two branches.

Supplementary Figure 10 Single-cell force-directed layout (mouse 7, X-shift k = 20, 48 clusters) showing the distribution of hand-gated cell populations.

Single-cell Force Directed Layout identical to the one on Figure 2C, but with color codes showing the distribution of hand-gated cell populations (gating according to Supplementary Figure 12). Grey color represents cells that are not a part of any of the 24 hand-gated populations.

Supplementary Figure 11 Single-cell force-directed layout, color-coded by expression of individual protein markers.

Layout is identical to the one in Figure 2c.

Supplementary Figure 12 Single-cell force-directed layout, color-coded by expression of individual protein markers.

Layout is identical to the one in Figure 2c.

Supplementary Figure 13 Single-cell force-directed layout, color-coded by expression of individual protein markers.

Layout is identical to the one in Figure 2c.

Supplementary Figure 14 Single-cell force-directed layout, color-coded by expression of individual protein markers.

Layout is identical to the one in Figure 2c.

Supplementary Figure 15 Single-cell force-directed layout, color-coded by expression of individual protein markers.

Layout is identical to the one in Figure 2c.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15 (PDF 3618 kb)

Supplementary Data

xx (XLSX 173 kb)

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Samusik, N., Good, Z., Spitzer, M. et al. Automated mapping of phenotype space with single-cell data. Nat Methods 13, 493–496 (2016). https://doi.org/10.1038/nmeth.3863

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3863

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing