Automated approaches that cluster high-dimensional flow and mass cytometry data simultaneously in multiple dimensions, such as those discussed in Saeys et al. (Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol. 16, 449–462 2016)1, are currently coming into routine use in biomedical settings. However, the simultaneous clustering approach underlying these methods is fundamentally flawed. This is due to what statisticians call the 'curse of dimensionality' (Ref. 2), which is well known to compromise both the statistical validity and the computational performance of clustering methods that operate on multiple dimensions at once.
Although the curse of dimensionality is a well-known problem, the statistical component of this problem, which renders clustering outcomes invalid, has not been properly recognized in flow and mass cytometry. This crucial problem arises from the marked increase in statistical uncertainty that occurs as the number of dimensions for which data are being considered increases (even three dimensions can be problematical3).
That is, as the number of dimensions increases: one, data become increasingly sparsely distributed; two, definitions of density and distance between points become increasingly meaningless; and three, fitting a mathematical model to the data set becomes infeasible because the number of combinations of possible parameters to be considered increases dramatically as the number of dimensions increases above three or four. These problems compromise high-dimensional clustering algorithms that rely on estimation of density and/or distance, or on fitting of mathematical models. Here, we show directly how the curse of dimensionality leads to invalid conclusions by some commonly used clustering methods (Fig. 1, Rphenograph4, X-shift5 and flowMeans6).
t-distributed stochastic neighbour embedding (t-SNE)7 has recently been introduced into high-dimensional flow cytometry analyses as a preprocessing step intended to reduce data dimensionality before clustering. However, when t-SNE is applied to high-dimensional data with intrinsically high dimensional structure (that is, when N dimensional data cannot be closely approximated by some combination of n<<N dimensions), it becomes subject to the curse of dimensionality7. We used Maximum Likelihood Estimation of Intrinsic Dimension (MLE) proposed by Levina et al.8 to estimate the intrinsic dimensionality of a typical flow cytometry data set. MLE revealed four intrinsic dimensions for a 12-parameter flow cytometry sample (10-colour + side and forward scatter) shown in Fig. 1d. However, even three dimensions can be problematical and the severity of the curse of dimensionality problem increases sharply thereafter.
t-SNE also does not preserve either distances or density very well. It only preserves nearest-neighbours, and only to some extent. This means that distance or density-based clustering algorithms are not usable with t-SNE maps (Fig. 1). Furthermore, these properties of t-SNE, in addition to the curse of dimensionality are the primary causes of the lack of reproducibility illustrated in Fig. 1c,d.
The curse of dimensionality thus clearly mitigates against the use of high-dimensional simultaneous clustering methods for flow and mass cytometry data analysis. In contrast, automation9 of sequential analysis methods that have been used for years offers statistically robust clustering and readily usable tools for flow cytometry and other technologies.
There is a reply to this Correspondence by Saeys, Y., Van Gassen, S. and Lambrecht, B. Nat. Rev. Immunol. http://dx.doi.org/10.1038/nri.2017.151-c1 (2017)
References
Saeys, Y., Gassen, S. V. & Lambrecht, B. N. Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol. 16, 449–462 (2016).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer-Verlag, 2009).
Scott, D. W. Multivariate Density Estimation — Theory, Practice and Visualization (Wiley, 1992).
Chen, H. et al. Cytofkit: a bioconductor package for an integrated mass cytometry data analysis pipeline. PLoS Comput. Biol. 12, e1005112 (2016).
Samusik, N. et al. Automated mapping of phenotype space with single-cell data. Nat. Methods 13, 493–496 (2016).
Broad Institute. Flow cytometry gating and clustering. GenePatternhttp://software.broadinstitute.org/cancer/software/genepattern/flow-cytometry-gating-and-clustering (2017).
van der Maaten, L. Hinton, G. Visualizing data using t-SNE. J. Machine Learn. Res. 9, 2579–2605 (2008).
Levina, E. & Bickel, P. in Advances in Neural Information Processing Systems 17 (NIPS 2004) (eds Saul, L. K., Weiss, Y. and Bottou, L.) (MIT Press, 2004).
Meehan, S. et al. AutoGate: automating analysis of flow cytometry data. Immunol. Res. 58, 218–223 (2014).
Orlova, D. et al. Earth Mover's Distance (EMD): a true metric for comparing biomarker expression levels in cell populations. PLoS ONE 11, e0151859 (2016).
Author information
Authors and Affiliations
Corresponding authors
PowerPoint slides
Rights and permissions
About this article
Cite this article
Orlova, D., Herzenberg, L. & Walther, G. Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets. Nat Rev Immunol 18, 77 (2018). https://doi.org/10.1038/nri.2017.150
Published:
Issue Date:
DOI: https://doi.org/10.1038/nri.2017.150
This article is cited by
-
INFLECT: an R-package for cytometry cluster evaluation using marker modality
BMC Bioinformatics (2022)
-
High-Dimensional Immune Monitoring for Chimeric Antigen Receptor T Cell Therapies
Current Hematologic Malignancy Reports (2021)
-
Automated subset identification and characterization pipeline for multidimensional flow and mass cytometry data clustering and visualization
Communications Biology (2019)
-
RefCell: multi-dimensional analysis of image-based high-throughput screens based on ‘typical cells’
BMC Bioinformatics (2018)
-
Response to Orlova et al. “Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets”
Nature Reviews Immunology (2018)