Automated approaches that cluster high-dimensional flow and mass cytometry data simultaneously in multiple dimensions, such as those discussed in Saeys et al. (Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol. 16, 449–462 2016)1, are currently coming into routine use in biomedical settings. However, the simultaneous clustering approach underlying these methods is fundamentally flawed. This is due to what statisticians call the 'curse of dimensionality' (Ref. 2), which is well known to compromise both the statistical validity and the computational performance of clustering methods that operate on multiple dimensions at once.

Although the curse of dimensionality is a well-known problem, the statistical component of this problem, which renders clustering outcomes invalid, has not been properly recognized in flow and mass cytometry. This crucial problem arises from the marked increase in statistical uncertainty that occurs as the number of dimensions for which data are being considered increases (even three dimensions can be problematical3).

That is, as the number of dimensions increases: one, data become increasingly sparsely distributed; two, definitions of density and distance between points become increasingly meaningless; and three, fitting a mathematical model to the data set becomes infeasible because the number of combinations of possible parameters to be considered increases dramatically as the number of dimensions increases above three or four. These problems compromise high-dimensional clustering algorithms that rely on estimation of density and/or distance, or on fitting of mathematical models. Here, we show directly how the curse of dimensionality leads to invalid conclusions by some commonly used clustering methods (Fig. 1, Rphenograph4, X-shift5 and flowMeans6).

Figure 1: Commonly used high-dimensional clustering methods yield irreproducible results and may report populations that do not exist.
figure 1

a | We simulated a mixture of two 20-dimensional (20D) Gaussian distributions with unit variance in each dimension and the following means: M1, [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]; M2, [2.3 2.3 2.3 2 2 2 1 1 2.3 2.3 2 3 5 1 3 2.3 2 4 5 2]. One distribution (Subset 1) consists of 10,000 events; another distribution (Subset 2) consists of 5,000 events. The t-distributed stochastic neighbour embedding (t-SNE) map for this data set is shown in the figure. b | Algorithms that work directly on the high-dimensional data (Rphenograph4 and X-shift5) and algorithms that are applied to the t-SNE embedded map (ClusterX4 and DensVM4) were run on the 20D data from part a. The output was colour-coded and presented in t-SNE parameter space (Rphenograph, ClusterX and DensVM) and in a force-directed layout (for X-shift only). The results of these clustering methods show no connection to the actual population structure in the data. c | We repeated (five times) the simulations shown in part a. The table shows the number of clusters for each simulation that each of the tested clustering algorithms reported. d | The table shows the number of clusters identified by the five distinct clustering algorithms applied to the two halves of the same sample (even and odd rank numbers of a single flow cytometry run) and applied to technical replicates (separate flow cytometry runs) of the same sample. We used a 10-colour data set previously published (see Figure 6b in Ref. 10). Data were compensated, Logicle transformed and pre-gated for live singlets using AutoGate (www.cytoGenie.org). We used the default input parameters provided by each clustering algorithm but omitted the data transformation as the data were already Logicle transformed. Clustering results are available here: https://drive.google.com/open?id=0B1SkmBF14Q2lOVhuclhDWldOVEU and https://www.dropbox.com/sh/4xbl0k5fb5qpk5s/AAAVEefS3rTUbPu9uJqDpc9Ba?dl=0

PowerPoint slide

t-distributed stochastic neighbour embedding (t-SNE)7 has recently been introduced into high-dimensional flow cytometry analyses as a preprocessing step intended to reduce data dimensionality before clustering. However, when t-SNE is applied to high-dimensional data with intrinsically high dimensional structure (that is, when N dimensional data cannot be closely approximated by some combination of n<<N dimensions), it becomes subject to the curse of dimensionality7. We used Maximum Likelihood Estimation of Intrinsic Dimension (MLE) proposed by Levina et al.8 to estimate the intrinsic dimensionality of a typical flow cytometry data set. MLE revealed four intrinsic dimensions for a 12-parameter flow cytometry sample (10-colour + side and forward scatter) shown in Fig. 1d. However, even three dimensions can be problematical and the severity of the curse of dimensionality problem increases sharply thereafter.

t-SNE also does not preserve either distances or density very well. It only preserves nearest-neighbours, and only to some extent. This means that distance or density-based clustering algorithms are not usable with t-SNE maps (Fig. 1). Furthermore, these properties of t-SNE, in addition to the curse of dimensionality are the primary causes of the lack of reproducibility illustrated in Fig. 1c,d.

The curse of dimensionality thus clearly mitigates against the use of high-dimensional simultaneous clustering methods for flow and mass cytometry data analysis. In contrast, automation9 of sequential analysis methods that have been used for years offers statistically robust clustering and readily usable tools for flow cytometry and other technologies.

There is a reply to this Correspondence by Saeys, Y., Van Gassen, S. and Lambrecht, B. Nat. Rev. Immunol. http://dx.doi.org/10.1038/nri.2017.151-c1 (2017)