Estimating the intrinsic dimension of datasets by a minimal neighborhood information

Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.


Supplementary information 1 Distribution of shells volumes for a homogeneous Poisson process
Let Φ be a homogeneous Poisson process in R 2 with intensity λ (see [1] for more information about Poisson processes); in particular Φ satisfies the following properties: i) for any disjoint Borel sets A 1 and A 2 the random variables N (A 1 ) and N (A 2 ) describing the number of points falling in A 1 and A 2 respectively are independent, ii) the number of points N (A) falling in a Borel set A is distributed as a Poisson variable with parameter λµ(A), where µ(A) is the measure of A: P(A contains exactly n points) . = P (n, A) = (λµ(A)) n n! e −λµ(A) The intensity λ corresponds to the average density of points: E[P (n, A)] = λµ(A). Moreover, the second property implies that in an infinitesimally small area dA there are no multiple points. From the definition of a Poisson process it also follows that the probability of having no points in a Borel set A (void probability) is given by: Given a point o in Φ, let d 1 , d 2 , ..., d n be the ordered distances from o of the first n neighbours. If we define ∆v 1 as the volume of the ball B o,d1 , ∆v 2 as the volume of the annulus C r1,r2 , and so on we see that the distances d 1 , d 2 , ..., d n identify n disjoint volumes ∆v 1 , ∆v 2 , ..., ∆v n that can be seen as the volumes 'occupied' by the neighbours. We want to find an expression for the joint probability distribution g(∆v 1 , ∆v 2 , ..., ∆v n ). To this purpose, we start from a slightly easier problem and look for the joint probability distribution of the distances f (d 1 , d 2 , ..., d n ).
The probability of the first distance d 1 to fall in an infinitesimally small annulus C r1,r1+dr1 is given by the probability of having no points in the ball B o,r1 and having at least one point in the annulus C r1,r1+dr1 : Here the second equality is due to independence property, while the last one comes from the formula for the void distribution. Since dr 1 is very small we conclude that The second step is to define the probability that the second nearest neighbour is found at a distance r 2 from o given that the first one is found at a distance r 1 .
This argument can be easily generalized to R N .

A comparison between TWO-NN and DANCo
We compare our results with those obtained with DANCo [2] since, according to the analysis in [3], it seems to outperform the other estimators (a public version of DANCo algorithm is available at https://it.mathworks.com/matlabcentral/fileexchange/40112intrinsic-dimensionality-estimation-techniques/content/idEstimation/DANCoFit.m.). In order to test DANCo in the case of uniform hypercubes with periodic boundary conditions we modified the computation of distances in the code. First of all we analyzed the estimates of DANCo and TWO-NN on datasets with 2500 points and dimension ranging from 1 to 20. The selected datasets are hypercubes without periodic boundary conditions, hypercubes with periodic boundary conditions, Cauchy dataset and Gaussians. We embed the datasets in higher dimensional spaces through the identity map to prevent the algorithms from selecting the number of columns as an upper bound. In the case of hypercubes without pbc (panel A) TWO-NN produces an underestimation (about 1.5 in dimension 10 and 4 in dimension 20), due to the sharp drop in density at the border. This systematic error becomes smaller and smaller when the number of points is increased. A similar but lighter effect is visible in the case of gaussian distributions (panel D): here the density changes rapidly but in a smoother fashion. We notice an underestimation of around 0.1 in dimension 10 and 3 in dimension 20. In panel B we see that considering periodic boundary conditions (and thus reproducing a most uniform environment) allows TWO-NN to estimate the ID almost correctly, with an underestimation of the order of 1 in dimension 20. In the case of Cauchy dataset (panel C) TWO-NN slightly overestimates the intrinsic dimension. As for DANCo, we notice that it slightly overestimates the dimension for the Hypercubes and for the Gaussian, while it strongly underestimates the value of the ID in the case of Cauchy dataset ( the estimate for a Cauchy dataset in dimension 20 is around 13). We believe that the origin of this significant systematic error lies in the fact that DANCo estimates the ID by comparing the theoretical functions obtained in the dataset with those retrieved on uniform spheres: this strategy works well in the case of sharp boundaries but is less suitable in the presence of heavy tails.

Discarding the points with highest values of µ
In Section 3 we claim that in order to make the procedure more robust we discard the 10% of the points characterized by highest values of µ from the fitting.
Indeed, outliers in the dataset display high values of µ and are able to affect the linear fitting procedure in a meaningless way. Simply cutting the very last points away from the dataset S to fit makes the procedure robust. The decision to exclude the last 10% of points is arbitrary, but the estimate of the dimension is robust respect to this threshold. In Figure 2 we see that the estimated dimension is the same for a percentage of retained points ranging from 80% to 95%, while including all of the points causes instability and underestimation. Cauchy datasets are characterized by heavy tails and so the presence of outliers is important; if we consider uniform hypercubes outliers are nearly absent and the ID estimate is not affected by the exclusion from the fit of the last points with highest µ, as we can see in Figure 3.  percubes embedded through an identity map in a higher dimensional space; on the latters we test the method applying periodic boundary conditions (pbc) in order to simulate as much as possible a uniform environment. As suggested in [3] we generated 20 instances of each dataset and averaged the achieved results; The result of the tests is summarized in Figure 4. We omit to display the measure for dataset M 10d since its ID is 70, and estimating the dimension of such datasets is beyond the intentions of TWO-NN (indeed as we expect we undergo a strong underestimation of 41 in this case).  Table 1 plus 7 additional datasets described in Table  2 . For each dimension we take as ID estimate the average over 20 instances of the dataset. On the x-axis and y-axis we represent the true dimension of the dataset d and the estimated dimensiond respectively.