Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets

Orlova, Darya Y.; Herzenberg, Leonore A.; Walther, Guenther

doi:10.1038/nri.2017.150

Download PDF

Correspondence
Published: 22 December 2017

Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets

Darya Y. Orlova¹,
Leonore A. Herzenberg¹ &
Guenther Walther²

Nature Reviews Immunology volume 18, page 77 (2018)Cite this article

6002 Accesses
13 Citations
7 Altmetric
Metrics details

Subjects

Flow cytometry

Automated approaches that cluster high-dimensional flow and mass cytometry data simultaneously in multiple dimensions, such as those discussed in Saeys et al. (Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol. 16, 449–462 2016)¹, are currently coming into routine use in biomedical settings. However, the simultaneous clustering approach underlying these methods is fundamentally flawed. This is due to what statisticians call the 'curse of dimensionality' (Ref. 2), which is well known to compromise both the statistical validity and the computational performance of clustering methods that operate on multiple dimensions at once.

Although the curse of dimensionality is a well-known problem, the statistical component of this problem, which renders clustering outcomes invalid, has not been properly recognized in flow and mass cytometry. This crucial problem arises from the marked increase in statistical uncertainty that occurs as the number of dimensions for which data are being considered increases (even three dimensions can be problematical³).

That is, as the number of dimensions increases: one, data become increasingly sparsely distributed; two, definitions of density and distance between points become increasingly meaningless; and three, fitting a mathematical model to the data set becomes infeasible because the number of combinations of possible parameters to be considered increases dramatically as the number of dimensions increases above three or four. These problems compromise high-dimensional clustering algorithms that rely on estimation of density and/or distance, or on fitting of mathematical models. Here, we show directly how the curse of dimensionality leads to invalid conclusions by some commonly used clustering methods (Fig. 1, Rphenograph⁴, X-shift⁵ and flowMeans⁶).

**Figure 1: Commonly used high-dimensional clustering methods yield irreproducible results and may report populations that do not exist.**

t-distributed stochastic neighbour embedding (t-SNE)⁷ has recently been introduced into high-dimensional flow cytometry analyses as a preprocessing step intended to reduce data dimensionality before clustering. However, when t-SNE is applied to high-dimensional data with intrinsically high dimensional structure (that is, when N dimensional data cannot be closely approximated by some combination of n<<N dimensions), it becomes subject to the curse of dimensionality⁷. We used Maximum Likelihood Estimation of Intrinsic Dimension (MLE) proposed by Levina et al.⁸ to estimate the intrinsic dimensionality of a typical flow cytometry data set. MLE revealed four intrinsic dimensions for a 12-parameter flow cytometry sample (10-colour + side and forward scatter) shown in Fig. 1d. However, even three dimensions can be problematical and the severity of the curse of dimensionality problem increases sharply thereafter.

t-SNE also does not preserve either distances or density very well. It only preserves nearest-neighbours, and only to some extent. This means that distance or density-based clustering algorithms are not usable with t-SNE maps (Fig. 1). Furthermore, these properties of t-SNE, in addition to the curse of dimensionality are the primary causes of the lack of reproducibility illustrated in Fig. 1c,d.

The curse of dimensionality thus clearly mitigates against the use of high-dimensional simultaneous clustering methods for flow and mass cytometry data analysis. In contrast, automation⁹ of sequential analysis methods that have been used for years offers statistically robust clustering and readily usable tools for flow cytometry and other technologies.

There is a reply to this Correspondence by Saeys, Y., Van Gassen, S. and Lambrecht, B. Nat. Rev. Immunol. http://dx.doi.org/10.1038/nri.2017.151-c1 (2017)

References

Saeys, Y., Gassen, S. V. & Lambrecht, B. N. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol. 16, 449–462 (2016).
Article CAS Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer-Verlag, 2009).
Book Google Scholar
Scott, D. W. Multivariate Density Estimation — Theory, Practice and Visualization (Wiley, 1992).
Book Google Scholar
Chen, H. et al. Cytofkit: a bioconductor package for an integrated mass cytometry data analysis pipeline. PLoS Comput. Biol. 12, e1005112 (2016).
Article Google Scholar
Samusik, N. et al. Automated mapping of phenotype space with single-cell data. Nat. Methods 13, 493–496 (2016).
Article CAS Google Scholar
Broad Institute. Flow cytometry gating and clustering. GenePatternhttp://software.broadinstitute.org/cancer/software/genepattern/flow-cytometry-gating-and-clustering (2017).
van der Maaten, L. Hinton, G. Visualizing data using t-SNE. J. Machine Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Levina, E. & Bickel, P. in Advances in Neural Information Processing Systems 17 (NIPS 2004) (eds Saul, L. K., Weiss, Y. and Bottou, L.) (MIT Press, 2004).
Google Scholar
Meehan, S. et al. AutoGate: automating analysis of flow cytometry data. Immunol. Res. 58, 218–223 (2014).
Article Google Scholar
Orlova, D. et al. Earth Mover's Distance (EMD): a true metric for comparing biomarker expression levels in cell populations. PLoS ONE 11, e0151859 (2016).
Article Google Scholar

Download references

Author information

orlova@stanford.edu
dyorlova@gmail.com

Authors and Affiliations

Darya Y. Orlova and Leonore A. Herzenberg are at the Department of Genetics, Stanford University School of Medicine, Stanford, California 943051, USA.,
Darya Y. Orlova & Leonore A. Herzenberg
Guenther Walther is at the Department of Statistics, Stanford University, Stanford, California 94305, USA.,
Guenther Walther

Authors

Darya Y. Orlova
View author publications
You can also search for this author in PubMed Google Scholar
Leonore A. Herzenberg
View author publications
You can also search for this author in PubMed Google Scholar
Guenther Walther
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Darya Y. Orlova or Guenther Walther.

PowerPoint slides

PowerPoint slide for Fig. 1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Orlova, D., Herzenberg, L. & Walther, G. Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets. Nat Rev Immunol 18, 77 (2018). https://doi.org/10.1038/nri.2017.150

Download citation

Published: 22 December 2017
Issue Date: January 2018
DOI: https://doi.org/10.1038/nri.2017.150

This article is cited by

INFLECT: an R-package for cytometry cluster evaluation using marker modality
- Jan Verhoeff
- Sanne Abeln
- Juan J. Garcia-Vallejo
BMC Bioinformatics (2022)
High-Dimensional Immune Monitoring for Chimeric Antigen Receptor T Cell Therapies
- Sujata Sharma
- David Quinn
- Iulian Pruteanu-Malinici
Current Hematologic Malignancy Reports (2021)
Automated subset identification and characterization pipeline for multidimensional flow and mass cytometry data clustering and visualization
- Stephen Meehan
- Gleb A. Kolyagin
- Darya Y. Orlova
Communications Biology (2019)
RefCell: multi-dimensional analysis of image-based high-throughput screens based on ‘typical cells’
- Yang Shen
- Nard Kubben
- Wolfgang Losert
BMC Bioinformatics (2018)
Response to Orlova et al. “Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets”
- Yvan Saeys
- Sofie Van Gassen
- Bart Lambrecht
Nature Reviews Immunology (2018)

Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets

Subjects

References

Author information

Authors and Affiliations

Corresponding authors

PowerPoint slides

PowerPoint slide for Fig. 1

Rights and permissions

About this article

Cite this article

This article is cited by

INFLECT: an R-package for cytometry cluster evaluation using marker modality

High-Dimensional Immune Monitoring for Chimeric Antigen Receptor T Cell Therapies

Automated subset identification and characterization pipeline for multidimensional flow and mass cytometry data clustering and visualization

RefCell: multi-dimensional analysis of image-based high-throughput screens based on ‘typical cells’

Response to Orlova et al. “Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets”

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Corresponding authors

PowerPoint slides

PowerPoint slide for Fig. 1

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

INFLECT: an R-package for cytometry cluster evaluation using marker modality

High-Dimensional Immune Monitoring for Chimeric Antigen Receptor T Cell Therapies

Automated subset identification and characterization pipeline for multidimensional flow and mass cytometry data clustering and visualization

RefCell: multi-dimensional analysis of image-based high-throughput screens based on ‘typical cells’

Response to Orlova et al. “Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets”

Search

Quick links