Classification at the accuracy limit: facing the problem of data ambiguity

Data classification, the process of analyzing data and organizing it into categories or clusters, is a fundamental computing task of natural and artificial information processing systems. Both supervised classification and unsupervised clustering work best when the input vectors are distributed over the data space in a highly non-uniform way. These tasks become however challenging in weakly structured data sets, where a significant fraction of data points is located in between the regions of high point density. We derive the theoretical limit for classification accuracy that arises from this overlap of data categories. By using a surrogate data generation model with adjustable statistical properties, we show that sufficiently powerful classifiers based on completely different principles, such as perceptrons and Bayesian models, all perform at this universal accuracy limit under ideal training conditions. Remarkably, the accuracy limit is not affected by certain non-linear transformations of the data, even if these transformations are non-reversible and drastically reduce the information content of the input data. We further compare the data embeddings that emerge by supervised and unsupervised training, using the MNIST data set and human EEG recordings during sleep. We find for MNIST that categories are significantly separated not only after supervised training with back-propagation, but also after unsupervised dimensionality reduction. A qualitatively similar cluster enhancement by unsupervised compression is observed for the EEG sleep data, but with a very small overall degree of cluster separation. We conclude that the handwritten letters in MNIST can be considered as ’natural kinds’, whereas EEG sleep recordings are a relatively weakly structured data set, so that unsupervised clustering will not necessarily re-cover the human-defined sleep stages.


Scientific Reports
| (2022) 12:22121 | https://doi.org/10.1038/s41598-022-26498-z www.nature.com/scientificreports/ for a sitting posture than by the object's superficial appearance. Similarly, a professional biologist may assign two cells to the same cell type, even though their microscopic images appear rather dissimilar for somebody without training in microbiology. Therefore, because object classes are defined in a user-dependent way, they need not necessarily correspond with the natural clustering structure of the object distribution in data space (Fig. 1b).
On the other hand, the locations of the objects in data space may not even have a cluster-like structure in the first place, but may be distributed rather homogeneously (Fig. 1d). Such a lack of internal structure in the input data does however not prevent an agent from defining useful classes. For example, human-defined color names do not reflect characteristic peaks and troughs in the wavelength spectrum of natural light, but are determined by the physiology of the visual system and by language evolution 2 . For a naive observer, it would be impossible to infer the discrete color classes from the objective environmental light spectrum (unsupervised clustering) without knowledge of human physiology and language, but a member of the same language community can of course be trained to use the color names in a proper way (supervised classification). Similarly, a particular organism may treat a certain interval of temperatures as 'attractive' and all others as 'repulsive' , even if the temperature distribution in the environment is broad and featureless. Here, too, the two classes reflect the properties and needs of the agent, not the objective structure of the raw data distribution.
As a further complication, different individual agents of the same species (or even the same agent at subsequent times) may place a given fixed object into different classes. For example, animals like the 'Fisher cat' and the 'Fossa mongoose' may be assigned either as cats or dogs by untrained human observers. This type of ambiguity happens frequently in weakly structured data spaces, which may have overlapping regions of slightly larger or smaller point density, but no clearly defined point clusters (Fig. 1c). To classify such spaces, the agent can either draw the boundaries of the classes arbitrarily (leading potentially to inconsistent class assignments between individual agents), or it can define the classes as continuous probability distributions over the data space 3,4 . In the latter case, the class-specific distributions will more or less overlap with each other.
In this paper, we consider the challenges of unsupervised clustering (the assignment of distinct classes to unlabeled data points) and of supervised classification (the learning of a classification model from already labeled data points) in weakly structured data spaces. We arrived at this research question by analyzing electroencephalographic (EEG) signals, recorded from humans during different sleep stages [5][6][7] . We found that each of the five sleep stages (Wake, REM, N1, N2, N3) is associated with characteristic probability distributions of data features, and that these slight differences can be exploited by an automatic classifier to discriminate between the stages. However, the feature distributions belonging to different sleep stages strongly overlap with each other, so that classification accuracies above 70 percent can never be reached, even after extensive supervised training with large data sets. We confirm these older findings in the present work, using a modified pre-processing of the data and different features for discrimination. These results suggest that sleep EEG recordings are a case of weakly structured data in the sense defined above, and they raise the question whether the five human-defined www.nature.com/scientificreports/ sleep stages are supported well enough by a natural clustering in the raw data, so that these classes could also be discovered by unsupervised clustering without any prior knowledge. In order to explore the consequences of weakly structured data spaces, we use artificially generated surrogate data (compare e.g. 8 ), as well as real-world bio-medical data. For the first time (to our knowledge) we derive analytically how the overlap of data classes leads to a theoretical upper limit of classification accuracy. We then compute this limit numerically for a two-dimensional example (Fig. 2) and demonstrate that it depends in a systematic way on the statistical properties of the data set. We find that sufficiently powerful classifier models of different kinds all perform at this same upper limit of accuracy, even if they are based on completely different operating principles. Interestingly, this accuracy limit is not affected by applying certain non-linear transformations to the data, even if these transformations are non-reversible and drastically reduce the information content (entropy) of the input data.
In a next step, the same three models that reached the common classification limit for artificial data are now applied to human EEG data measured during sleep. In a pre-processing step, two kinds of features are extracted from raw EEG signals, yielding different marginal distributions and mutual correlations. It turns out that a more complex Bayesian model, based on correlated multi-variate Gaussian likelihoods (CMVG), performs worse than two other models (naive Bayes, perceptron), because the statistical properties of the pre-processed features do not match those of the likelihoods. In contrast, the perceptron and the naive Bayes model still show very similar classification accuracies, indicating that both reach the theoretical accuracy limit for sleep stage classification.

Figure 2.
Illustration of the accuracy limit in classification tasks. A data source is generating data points from two possible classes i = 0 and i = 1 , which occur with probabilities w 0 = 0.3 and w 1 = 0.7 and are distributed according to p gen (x|0) and p gen (x|1) , respectively (Panels a, e). A classifier trained on these data can end up learning the class distributions perfectly, so that p lea (x|i) = p gen (x|i) (Panel b). It can also end up with an imperfect approximation of the distributions, so that p lea (x|i) ≈ p gen (x|i) (Panel f). The learned distributions p lea (x|i) determine the probabilities q cla (j|x) ∈ [0, 1] that a given input data point x will be assigned to class j (Panels c,g. See Methods for details). In a classifier with ' one-hot'-output, the continuous classification probabilities are binarized to q cla (j|x) ∈ {0, 1} , so that the input space is partitioned into disjunct, class-specific regions (Panels d,h). The classification probabilities q cla (j|x) together with the true data distributions p gen (x|i) determine the (normalized) confusion matrix C ji = q cla (j | x) p gen (x | i) dx of the classifier, from which the accuracy can be computed as A = K i=1 w i C ii (See matrices and numerical examples below the figure panels). For the case shown in the figure, the accuracy of the imperfect classifier A = 0.785 is smaller than that of the perfect classifier with A = 0.859 , but even the latter is smaller than one. It represents the accuracy limit of this classification problem, which is determined by the natural overlap of data classes. www.nature.com/scientificreports/ Finally, we address the question whether typical human-defined object categories can also be considered as 'natural kinds' , that is, whether the data vectors in input space have a built-in cluster structure that can be detected by objective machine-learning models even in non-labeled data. For this purpose, we use as real-world examples the MNIST data set 9 , as well as the above EEG sleep data. We find that a simple visualization by multidimensional scaling (MDS) 10-13 already reveals an inherent cluster structure of the data in both cases. Interestingly, the degree of clustering, quantified by the general discrimination value (GDV) 14,15 , can be enhanced by a step-wise dimensionality reduction of the data, using an autoencoder that is trained in an unsupervised manner. A perceptron classifier with a layer design comparable to the autoencoder, trained on the same data in a supervised fashion, achieves as expected a much stronger cluster separation. However, the enhancement of clustering by unsupervised data compression, combined with automatic labeling methods, could be a promising way to automatically detect 'natural kinds' in non-labeled data.

Methods
Part 1: accuracy limit. Derivation of theoretical accuracy limit. We assume a data source that provides a large set of vectors x , each consisting of N real-valued components x n∈{1,N} . The data vectors x are assumed to fall into K possible classes i = 1 . . . K that occur with fixed probabilities w i , and each class i is characterized by a specific probability density distribution p gen (� x|i) , here called the 'generation density' (Fig. 2a,e).
Classification is the general problem of assigning the ' correct' class label i = 1 . . . K to each given input data vector x . For this purpose, the classifier has to learn approximations p lea (� x|i) ≈ p gen (� x|i) of the generation densities, which can in principle (for a large enough training data set) become arbitrarily precise (Fig. 2b,f).
However, even after a perfect learning of the generation densities, when p lea (� x|i) = p gen (� x|i) , an unambiguous discrimination between classes may not be possible for all of the input data vectors, because the K distributions p gen (� x|i) can have a significant mutual overlap. In the following we derive the maximal possible accuracy A max that can be achieved by a perfect classifier in the case of non-zero class overlap.
From the learned distributions p lea (� x|j) we can compute the probabilities that a given input data vector x will be assigned to class j by These 'classification probabilities' are continuous numbers, properly normalized to q cla ( j | � x ) ∈ [0, 1] (Fig. 2c,g).
Note that Bayesian classifiers can also learn an estimation of the class occurrence probabilities, w lea,i ≈ w i , which can then be used as priors in the computation of the classification probabilities: In classifiers with ' one-hot' output, they can be converted to binary class labels q cla ( j | � x ) ∈ {0, 1} by assigning to each data vector the label which produces the largest classification probability: where δ jk is the Kronecker delta with δ jk = 1 if j = k and δ jk = 0 if j = k . We call the binary quantity q cla ( j | � x ) the 'class indicator function' (Fig. 2b,f). It has the value one for all data points x assigned to class j, and the value zero for all other data points (See also Fig. 3c,d for two-dimensional examples).
Next we define a 'confusion density' as the product It can be interpreted as the probability density that the data source is producing a vector x under class i, which is then assigned to class j by the classifier. Because there is usually a non-zero chance that any vector x can occur under any class i, we expect that the non-diagonal elements C j =i (� x) are larger than zero as well. These non-diagonal confusion densities will have their largest values in regions of data space where the classes i and j overlap (Fig. 3f).
By integrating the confusion density over all possible data vectors x, we obtain the normalized 'confusion matrix' of the classifier, the probability that a data point originating from class i is assigned to class j. Now we can compute the accuracy A of the classifier as an average over the diagonal elements of the confusion matrix, weighted with the probability of occurance w i of each class: The theoretical limit of the classification accuracy, denoted by A max , is given by the extreme case when the  x . The first row shows the probability densities p gen (� x|i) that data point x is generated under class i = 0 (a) or class i = 1 (b). The second row shows the binary probability distributions q cla (j|� x) ∈ {0, 1} that data point x is assigned to class j = 0 (c) or class j = 1 (d), assuming a perfect classifier. The third row shows the ' confusion densities' q cla (j|� x) p gen (� x|i) that data point x is generated under class i = 0 and assigned to class j = 0 (e) or class j = 1 (f). Integrating this density over x yields the confusion matrix C ji , from which the classification accuracy A can be computed. Panel (g) in the fourth row shows the maximum possible classification accuracy (black curve) when the distance d between the centers of the two data classes (a) and (b) in feature space is increased from 0 to 5. All classifier models considered in this paper (colored curves), except the Naive Bayes model (orange) at small distances, reach the theoretical accuracy limit.  Fig. 3, the above quantities have been numerically evaluated for a simple Gaussian test data set. For this purpose, the two-dimensional integral 5 has been evaluated numerically on a regular grid of linear spacing 0.01, ranging from -8 to +8 in each feature dimension.

Scientific Reports
Classifiers and input data. In the following subsections, we provide the implementation details for the different classifier models that are compared in this work. The input data for these models is given as lists of D-dimensional feature vectors � u = (u 1 , u 2 , . . . , u f , . . . , u D ) , each belonging to one of K possible classes c. In the case of artificially generated data, these lists contain 10000 feature vectors distributed equally over the data classes. They are split randomly into training (80%) and test (20%) data sets.
Perceptron model. The perceptron model is implemented using Keras/Tensorflow. It has one hidden layer, containing N neu = 100 neurons with RELU activation function. The output layer has N out neurons with softmax activation function, where N out = D corresponds to the number of data classes. The loss function is categorical crossentropy. We optimize the perceptron on each training data set using the Adams optimizer over at least 10 epochs with a batch size of 128 and a validation split of 0.2. After training, the accuracy of the perceptron is evaluated with the independent test data set.
Naive Bayesian model. The naive Bayesian model is implemented using the Python libraries Numpy and Scipy.
In the training phase, the training data set is sorted according to the K class labels c. Then an individual Gaussian Kernel Density (KDE) approximation (Scott method) is computed for each feature f and class label c, corresponding to the empirical marginalized probability densities p f ,c (u f ).
In the testing phase, the accuracy of the model is evaluated with the independent test data set as follows: According to the naive Bayes approach, the global likelihood L(� u | c) of a data vector � u = (u 1 , u 2 , . . . , u D ) under class c is approximated by a product of the marginalized probabilities, so that Since we assume a flat prior probability ( P prior (c) = 1/K ) over the data classes, the posterior probability of data class c, given the input data vector u , is given by Naive Bayesian model with Random Dimensionality Expansion (RDE). Since the naive Bayesian model takes into account only the marginal feature distributions p f ,c (u f ) , it cannot distinguish data classes which accidentally have identical p f ,c (u f ) distributions, but differ in the correlations between the features. In principle, this problem can be fixed by multiplying the D-dimensional input vectors u by a random D 2 × D matrix M , for example with normally distributed entries M ij ∝ N(µ = 0, σ = 1) , which yields transformed vectors � v = M� u . Provided that D 2 ≫ D , at least some of the new feature linear combinations v f will have marginal distributions that vary between the data classes.
CMVG Bayesian model. The Correlated Multi-Variate Gaussian (CMVG) Bayesian model is also implemented using the Python libraries Numpy and Scipy.
In the training phase, the training data set is sorted according to the two class labels c. Then, for each class label c, we compute the mean values µ fg between features f and g. These quantities are packed as one vector µ (c) and one matrix � (c) for each class c.
In the testing phase, the global likelihood L(� u | c) of a data vector � u = (u 1 , u 2 , . . . , u D ) under class c is computed as the correlated, multi-variate Gaussian probability density Since we assume a flat prior probability ( P prior (c) = 1/2 ) for the two data classes, the posterior probability of data class c, given the input data vector u , is given by Part 2: The DSC data model. We consider an artificial classification problem with two multivariate Gaussian data classes c ∈ {0, 1} and with statistical properties that can be tuned by three control quantities: the dimensionality D of the feature space, the separation S between the centers of the point clouds, and the correlation C between features (within the same class), which is associated with the shape of the point cloud. The generation of artificial data within this DSC model works as follows: Starting from a given triple D, S, C of control quantities, we first generate N rep independent parameter sets [µ fg is the covariance of features f and g in class c. in class 1 are random numbers, drawn from a uniform distribution with values in the range from 0 to S. The separation quantity S is therefore the maximum distance between corresponding feature mean values in each dimension f.
The diagonal elements � (c) ff of the symmetric covariance matrix are set to 1 in both classes. The off-diagonal elements f =g are assigned independent, continuous random numbers x, drawn from a box-shaped probability density distribution q(x, C) that depends on the correlation quantity C as follows: For C = 0 , the distribution q(x, C) peaks at x = 0 , so that ij becomes a diagonal unit matrix. For C = 1 , the distribution q(x, C) is uniform in the range [0, 1], and for C = 2 it peaks at x = 1 , leading to ij = 1 . A plot of the distribution is shown in 4b.
According to the parameter set [µ fg ] , we then generate for each of the two classes c a number N vec /2 of random, Gaussian data vectors u(t) , in which the D components (features) are correlated to a degree controlled by quantity C. In the limiting case C = 0 , the D time series u f (t) become statistically independent, whereas for C = 1 , the time series become fully correlated and thus identical. The total number of N vec data vectors is combined to a complete data set, in which vectors from the two classes (with corresponding labels c) appear in random order. By this way, we obtain for each triple D, S, C of control quantities a total number of N rep independent data sets, each consisting of N vec data vectors. Since each data set obeys its own random parameters [µ fg ] , the DSC model reflects some of the heterogeneity of typical real world data. Finally, we split each data set into a training set (80%) and a test set (20%).
Before applying different types of classifiers to the DSC data sets, we test that the feature correlations and the class separation can be controlled reliably and over a sufficiently large range, using the quantities C and S (Fig. 4d).

Control of feature correlations by quantity C.
To evaluate correlation control, we fix the quantities D = 10 and S = 1.0 (Note that the separation has no effect on the correlations) and vary C over the complete available range of supported values from 0 to 2. For each C, we generate N rep = 100 independent data sets, each consisting of N vec = 10000 data vectors. For each data set, we estimate the empirical covariance matrix � (0) ij of class 0. Because the matrix is symmetric, we compute the root-mean-square (RMS) average of all matrix elements above the diagonal. The blue line in (Fig. 4d) shows for each C the mean RMS, averaged over the N rep = 100 repetitions (The latter are shown as gray dots). We find an almost linear relation between C and the mean RMS. In particular, we can realize the full range of correlations, including the limiting cases of independently fluctuating features (for C = 0 ) and identically fluctuating features (for C = 2).

Control of class separation by quantity S.
To evaluate separation control, we fix the quantities D = 10 and C = 0.5 and vary S between 0 and 10. For each S, we generate N rep = 100 independent data sets, each consisting of N vec = 10000 data vectors. For each (labeled) data set, we compute the general discrimination value (GDV), a quantity that has been specifically designed to quantify the separation between classes in high dimensional data sets 14,15 . The orange line in Fig. 4d is the mean negative GDV, averaged over the N rep = 100 repetitions (The latter are shown as gray dots).
The GDV is computed as follows: We consider N points x n=1..N = (x n,1 , · · · , x n,D ) , distributed within D-dimensional space. A label l n assigns each point to one of L distinct classes C l=1..L . In order to become invariant against scaling and translation, each dimension is separately z-scored and, for later convenience, multiplied with 1 2 : Based on the re-scaled data points s n = (s n,1 , · · · , s n,D ) , we calculate the mean intra-class distances for each class C l and the mean inter-class distances for each pair of classes C l and C m Here, N k is the number of points in class k, and s (k) i is the i th point of class k. The quantity d(a, b) is the euclidean distance between a and b . Finally, the Generalized Discrimination Value (GDV) is calculated from the mean intra-class and inter-class distances as follows:  The inter-feature correlations in each data set are described by covariance matrices ij , in which the level of correlations is controlled by the quantity C ∈ [0, 2] . The off-diagonal elements i =j are drawn from the C-dependent distribution q(x, C) shown in the top. Setting, for example, C = 0.5 , these elements range from 0 to 0.5 (lower left), and for C = 1.5 they range from 0.5 to 1 (lower right). (c) Visualization of the two data classes for a case with D=2 dimensions. The classes correspond to point clouds in feature space, where quantity S affects the distance between the means and quantity C affects the shape of the clouds. (d) When averaged over 100 independent data sets, the class separability (measured by the negative General Discrimination Value GDV) depends in a monotonous way on the separation quantity S. Similarly, the degree of inter-feature correlations (measured by the RMS of the off-diagonal co-variance matrix elements) is monotonically dependent on the correlation quantity C. is introduced for dimensionality invariance of the GDV with D as the number of dimensions. In the case of two Gaussian distributed point clusters, the resulting discrimination value becomes −1.0 if the clusters are located such that the mean inter cluster distance is two times the standard deviation of the clusters.
Part 3: Comparing classifiers. In Fig. 5, we determine the average accuracy of the three classifier types (See part 1 of the Methods section) for different combinations of the DSC control parameters. For each parameter combination, 100 data sets are sampled from the superstatistical distribution. For every data set, consisting of 8000 training vectors and 2000 test vectors, the three classifiers are trained from the scratch and then evaluated. This results in 100 accuracies for each classifier and each parameter combination. We then compute the mean value of these 100 accuracies, and this is the average accuracy plotted as colored lines in Fig. 5c-f. The individual, non-averaged accuracies are plotted as gray points.
Part 4: feature transformations. In Fig. 6, we return to a much simpler test data set, consisting of two 'spherical' Gaussian data clusters in a two-dimensional feature space, which are centered at � x = (− 1 2 , 0) and � x = (+ 1 2 , 0) . respectively. All three classifier types reach the theoretical accuracy limit of about 0.69 in this case. In this part we explore how certain non-linear transformations of the original features (that is, ) ) affect the classification accuracy. In particular, we investigate the cases f (x) = sin(x) , f (x) = cos(x) and f (x) = sgn (x) . The signum function yields -1 for negative arguments and +1 for positive arguments. For the special case x = 0 it would return zero, but this practically does not happen, as the features x are continually distributed random variables.
Part 5: sleep EEG data. For a real-world evaluation of classifier performance, we are using 68 multi-channel EEG data sets from our sleep laboratory, each corresponding to a full-night recording of brain signals from a different human subject. The data were recorded with a sampling rate of 256 Hz, using three separate channels F4-M1, C4-M1, O2-M1. In this work, however, the signals from these channels are pooled, effectively treating them as data sets of their own.
The participants of the study included 46 males and 22 females, with an age range between 21 and 80 years. Exclusion criteria were a positive history of misuse of sedatives, alcohol or addictive drugs, as well as untreated sleep disorders. The study was conducted in the Department of Otorhinolaryngology, Head Neck Surgery, of the Friedrich-Alexander University Erlangen-Nürnberg (FAU), following approval by the local Ethics Committee (323-16 Bc). Written informed consent was obtained from the participants before the cardiorespiratory polysomnography (PSG). All methods were performed in accordance with the relevant guidelines and regulations.
After recording, the raw EEG data were analyzed by a sleep specialist accredited by the German Sleep Society (DGSM), who removed typical artifacts 16 from the data and visually identified the sleep stages in subsequent 30-second epochs, according to the AASM criteria (Version 2.1, 2014) 17,18 . The resulting, labeled raw data were then used as a ground truth for testing the accuracy of the different classifier types.
In this work, we are primarily testing the ability of the classifiers to assign the correct sleep label s (Wake, REM, N1, N2, N3) independently to each epoch, without providing further context information. Such a singlechannel epoch consists of 30 × 256 = 7680 subsequent raw EEG amplitudes x d,e (t n ) , where d is the data set, e the number of the epoch within the data set, and t n the nth recording time within the epoch.
In order to facilitate classification of these 7680-dimensional input vectors x d,e by a simple Bayesian model, or by a flat two-layer perceptron with relatively few neurons, the vectors have to be suitably pre-processed and compressed down to feature vectors u d,e of much smaller dimensionality D ≪ 7680 . Instead of relying on self-organized (and thus 'black-box') features, we are using mathematically well-defined features with a simple interpretation. In particular, we are interested in the case where all D components u f of a feature vector u are fundamentally of the same kind and only differ by some tunable parameter.
Fourier features. Our first type of feature estimates the momentary Fourier component of the raw EEG signal x d,e (t n ) at a certain, tunable frequency ν f : The set of frequencies ν f =1 . . . ν f =D is in our case chosen as an equidistant grid between 0 Hz and 30 Hz, because our EEG system is filtering out the higher-frequency components of the raw signals above about 30 Hz.
Correlation features. Our second type of feature is the normalized auto-correlation coefficient of the raw EEG signal x d,e (t n ) at a certain, tunable lag-time t f : Part 6: sleep stage detection. In Fig. 8, we investigate the performance of the three classifier types described in part 1 in the real-world scenario of personalized sleep-stage detection. For this purpose, the classifiers are trained and tested individually on each of our 68 full-night sleep recordings, using as inputs the same 6-dimensional Fourier-or correlation features as in Fig. 7 (Note that the aggregated distribution functions and covariance matrices in Fig. 7 have been computed by pooling over all data sets and therefore show a much more regular behavior than the individual ones).
As a result, we obtain 68 accuracies for each combination of classifier type (Fig. 8, rows) and used input feature (Fig. 8, columns). The distributions of these accuracies are presented as histograms in the figure.
Part 7: natural data clustering. In Fig. 9, we address the question whether typical real-world data sets have a built-in clustering structure that can be detected (and possibly enhanced) by unsupervised methods of data analysis. For this purpose, we visualize the clustering structure. A frequently used method to generate lowdimensional embeddings of high-dimensional data is t-distributed stochastic neighbor embedding (t-SNE) 19 . However, in t-SNE the resulting low-dimensional projections can dependent strongly on the detailed parameter settings 20 , be sensitive to noise, and may not preserve the global structure of the data distribution 21,22 . In contrast, multi-Dimensional-Scaling (MDS) 10-13 is an efficient embedding technique to visualize high-dimensional point clouds by projecting them onto a 2-dimensional plane. Furthermore, MDS has the decisive advantage that it is parameter-free and all mutual distances of the points are preserved, thereby conserving both the global and local structure of the underlying data. When interpreting patterns as points in high-dimensional space and dissimilarities between patterns as distances between corresponding points, MDS is an elegant method to visualize high-dimensional data. By color-coding each projected data point of a data set according to its label, the representation of the data can be visualized as a set of point clusters. For instance, MDS has already been applied to visualize for instance word class distributions of different linguistic corpora 23 , hidden layer representations (embeddings) of artificial neural networks 6,15,24 , structure and dynamics of recurrent neural networks [25][26][27][28] , or brain activity patterns assessed during e.g. pure tone or speech perception 14,23,29 , or even during sleep 5,30 . In all these cases the apparent compactness and mutual overlap of the point clusters permits a qualitative assessment of how well the different classes separate.
In addition, we objectively measure the degree of clustering, based on given class labels, by the general discrimination value (GDV, see Methods part 2 for details.) 14,15 . This measure of class separability is 0 for uniform (unstructured) data distributions and becomes -1 for very well separated clusters. It is defined in such a way the GDV values in neural network layers with different numbers of neurons (that is, in data spaces of different dimensions) can be directly compared.
For the clustering analysis we analyze two examples of 'natural data': One is the MNIST data set 9 with 10 classes of handwritten digits, in which the input vectors are 784-dimensional (28x28 pixels) and have continuous positive values (between 0 and 1 after normalization).
As the second example we use, again, our full-night EEG recordings with the 5 data classes corresponding to the sleep stages Wake, REM, N1, N2, and N3. In order to reduce setup-differences between measurements, we first perform a z-transform over each individual full-night EEG recording, so that the one-channel EEG signal of each participant has now zero mean and unit variance. Next, in order to make the EEG data more comparable with MNIST, we produce one 784-dimensional input vector from each 30-second epoch of the EEG recordings in the following way: The 7680 subsequent one-channel EEG signals of the epoch are first transformed to the frequency domain using Fast Fourier Transform (FFT), yielding 3840 complex amplitudes. Since the phases of the amplitudes change in a highly irregular way between epochs, we discard this information by computing (the square roots of) the magnitudes of the amplitudes. We keep only the first 784 values of the resulting real-valued frequency spectrum, corresponding to the lowest frequencies. By pooling over all epochs and participants, we obtain a long list of these 784-dimensional input vectors. They are globally normalized, so that the components . Figure 5. Performance of three classifier types as a function of data statistics. (a) A Perceptron, a Naive Bayesian classifier and a correlated multi-variate Gaussian (CMVG) Bayesian classifier are applied to the same artificial data, controlled by the quantities D, S and C. (b) Accuracy of a two-layer perceptron (with 2 neurons in the second layer) as a function of the number N neu of neurons in the first layer, for fixed values of the separation S = 1.0 and Correlation C = 0.5 . As the perceptron is reaching the theoretical limit of accuracy for N neu = 100 , this layer size is used in the following. (c,d) Classifier accuracies as a function of dimension D (number of features). All classifiers profit from more available features, however Perceptron and CMVG Bayes can make use of correlations and thus outperform Naive Bayes for C = 1.0 , independent from class separation S.
(e,f) Classifier accuracies as a function of the separation S between the data classes in feature space. In the case without correlations C = 0 , all classifiers reach the theoretical performance limit and thus produce identical accuracies. (g, h) Classifier accuracies as a function of the correlation C between features. The accuracy of Naive Bayes is degrading with increasing correlations. By contrast, Perceptron and CMVG Bayes initially profit from correlations, but abruptly reach a plateau at C ≈ 0.8 . Beyond that transition point, accuracy can further improve only for large data separation.  . We consider two partly overlapping classes (blue and orange colors) in a two-dimensional features space. Shown are a scatter plot of the data (first column), as well as the marginal probability densities of the two features (second and third column). In the case of the orginal data (first row), all three classifier types reach the theoretical accuracy limit of ≈ 0.69 . After applying a sine-function individually to each feature (second row), the shape of the distributions changes drastically, but the accuracies remain unchanged at the theoretical limit. This remains even true after applying a signum transformation (third row), which reduces the originally continuous data to only four discrete points. However, applying a cosine transform (fourth row) reduces the accuracy to the random baseline of ≈ 0.5 , because the two data classes now overlap completely. It is possible to directly compute the MDS projection of the uncompressed 784-dimensional test data vectors into two dimensions, and also to calculate the corresponding GDV value that quantifies the degree of class separation (using the known sleep stage labeling). In Fig. 9, these uncompressed data distributions are always shown in the left upper scatter plot of each two-by-two block.
In this context, we also test if step-wise dimensionality reduction in an autoencoder leads to an enhanced clustering. The used autoencoder has RELU activation functions and 7 fully connected layers with the following numbers of neurons: 784,128,64,16,64,128,728. The mean squared error between input vectors and reconstructed The rows correspond to the five sleep stages s (Wake, REM, N1, N2 and N3). In the covariance matrices, the relatively large diagonal elements are suppressed for better visibility of the inter-feature correlations. The probability distributions p s (u f ) in row 1 are approximately Gaussian, whereas the distributions p s (u �t ) in row 3 are highly non-Gaussian. All distributions change in a systematic way with the feature parameters (frequencies f of Fourier modes, lag-times t of auto-correlations). Both the distributions and covariances also show characteristic differences between the sleep stages s, which can be exploited for automatic classification. www.nature.com/scientificreports/ vectors is minimized using the Adams optimizer. We also compute the MDS projections and GDV values for layers 2, 3 and 4 (the 16-dimensional bottleneck) of the autoencoder. In Fig. 9, these three compressed data distributions are shown within the two-by-two blocks of scatter plots. As a reference for the resulting MDS projections and GDV values in the unsupervised autoencoder, we also process the two kinds of natural data with a perceptron that is trained in a supervised manner, so that it separates the known classes as far as possible. To make the perceptron comparable to the autoencoder, the first 4 layers However, the decoder-part of the autoencoder is replaced by a softmax layer in the perceptron, which has either 10 (MNIST) or 5 (sleep) neurons. The perceptron is trained by back-propagation to minimize categorical crossentropy between the true and predicted labels, using the Adams optimizer. Just as in the autoencoder, we compute MDS projections and GDV values for the first 4 perceptron layers.
Ethical approval and informed consent. The study was conducted in the Department of Otorhinolaryngology, Head Neck Surgery, of the Friedrich-Alexander University Erlangen-Nürnberg (FAU), following approval by the local Ethics Committee (323 -16 Bc). Written informed consent was obtained from the participants before the cardiorespiratory poly-somnography (PSG).

Results
Part 1: accuracy limit. In order to demonstrate the existence of an accuracy limit in classification tasks, we assume a statistical process is generating data vectors x which are distributed in the input space (subsequently also called feature space) according to given generation densities p gen (� x | i) that depend on the class i. For reasons of mathematical tractability and visual clarity, we start with a simple problem of two Gaussian data classes in a two-dimensional feature-space. We assume that class i = 0 is centered at � x = (x 1 , x 2 ) = (0, 0) , whereas class i = 1 is centered a distance d away, at � x = (d, 0) . As another discriminating property, the two class-dependent distributions are assumed to have different correlations between the features x 1 and x 2 (Compare Fig. 3a,b).
As derived in the Methods section, an ideal classifier would divide the feature-space {� x} among the two classes in a way that is perfectly consistent with the true generation densities p gen (� x | i) . The resulting ideal assignment of a discrete class j = 0 or j = 1 to each data vector x can be described by binary class indicator functions q cla ( j | � x ) (Compare Fig. 3c,d). The latter two quantities can be combined to the confusion densities q cla (j|� x) p gen (� x|i) , which give the probability density that data point x is generated in class i but assigned to class j by the ideal classifier (Compare Fig. 3e,f). The parts of feature space where the confusion density is large for i = j correspond to the overlap regions of the data classes, and it is this overlap that makes the theoretical limit of the classification accuracy smaller than one.
It is possible to compute the confusion matrix of the ideal classifier by integrating the confusion densities over the entire feature space, which is feasible only in very low-dimension spaces. The confusion matrix, in turn, yields the theoretical accuracy limit A max of the ideal classifier. In our simple example, A max is expected to increase with the distance d between the two data classes, as this separation reduces the class overlap. By numerically computing the integral over the two-dimensional feature space of our Gaussian test example, we indeed find a monotonous increase of A max = A max (d) from about 0.62 at d = 0 to nearly one at d = 5 (Compare Fig. 3g, black line)).
Our next goal is to apply different types of classifier models to data drawn from the generation densities p gen (� x | i) of the Gaussian test example above.
As an example for a 'black box' classifier, we consider a perceptron with one hidden layer (See Methods section for details). In the training phase, the connection weights of this neural network are optimized using the back-propagation algorithm.
As an example of a mathematically transparent, but simple classifier type, we consider a Naive Bayesian model. Here, correlations between the input features are neglected, and so the global likelihood L(� u | c) of a data vector u , given the data class c, is approximated as the product of the marginal likelihood factors for each individual feature f (See Methods section for details). In the 'training phase' , the naive Bayesian classifier is simply estimating the distribution functions of these marginal likelihood factors, using Kernel Density Approximation (KDE).
Finally, we consider a Correlated Multi-Variate Gaussian (CMVG) Bayesian model as an example of a mathematically transparent classifier that can also account for correlations in the data, but which assumes that all features are normally distributed (See Methods section for details). In the training phase, the CMVG Bayesian classifier has to estimate the mean values and covariances of the data vectors.
When applying these three classifiers to the Gaussian test data, we indeed find that all models reach the same theoretical classification limit, even though their operating principles are very different (Compare Fig. 3g). The only exception is the Naive Bayes classifier at small class distances d (Compare Fig. 3g, orange line). This model fails because it can only use the marginal feature distributions, which happen to be identical for both classes in the case d = 0 . However, the problem can be easily fixed by multiplying the original two-dimensional feature vectors with a random, non-quadratic matrix (See Methods section for details) and thereby creating many new linear feature combinations, some of which usually have significantly different marginal distributions. Such a Random Dimensionality Expansion (RDE), as proposed in Yang et al. 31 , allows even the Naive Bayes model to reach the accuracy limit in strongly overlapping data classes (Compare Fig. 3g, olive line).
Part 2: the DSC data model. In order to investigate how the performance of different classifiers depends on the statistical properties of the data, we generate large numbers of artificial data sets with two labeled classes c ∈ {0, 1} , in which the dimensionality D of the individual data vectors u , the degree of correlations C between their components u f =1...D (here also called features), and the separation S between the two classes in feature space can be independently adjusted (See Fig.4b,c for an illustration of C and S). To replicate some of the heterogeneity of real world data, we design our data generator as a two-level superstatistical model 32 www.nature.com/scientificreports/ are themselves random variables. They are drawn from certain meta-distributions, which are in turn controlled by the three quantities D, S, C (See Methods for details, as well as Fig. 4a).
Using the General Discrimination Value (GDV), a measure designed to quantify the separability of labeled point sets (data classes) in high-dimensional spaces 15 , we show that the mean separability of data classes in the DSC-model is indeed monotonously increasing with the control quantity S (Orange line in Fig. 4d), whereas the separability of individual data sets is fluctuating heavily around this mean value (Grey dots in Fig. 4d).
Moreover, we quantify the degree of correlation between the D features of the data vectors in each class c by the root-mean-square average of the upper triangular matrix elements in the covariance matrix � (c) . We show that this RMS-average is an almost linear function of C (Blue line in Fig. 4d) and can be varied between zero (Corresponding to independently fluctuating features, or statistical independence) and one (Corresponding to identically fluctuating features, or perfect correlations). Part 3: comparing classifiers. Next, we apply the three classifier types to artificial data, with statistical properties controlled by the quantities D, S and C. We first investigate the accuracy of the classifiers as a function of data dimensionality D (Fig. 5c,d), considering correlated data ( C = 1.0).
When the separation of the data classes in feature space is small ( S = 0.1 , panel (c)), the classification accuracy for one-dimensional data ( D = 1 ) is very close to the minimum possible value of 0.5 (corresponding to a purely random assignment of the two class labels) in all three models. As data dimensionality D increases, all three models monotonically increase their average accuracies (colored lines), whereas the accuracies of individual cases show a large fluctuation (gray dots). However, the Naive Bayes classifier (orange line) does not perform well even for large data dimensionality, because the point clouds corresponding to the two classes are strongly overlapping in feature space. By contrast, the CMVG Bayes classifier (red line) and the Perceptron (blue line) eventually achieve a very good performance, because they can exploit the correlations in the data. The similarity of the latter two accuracy-versus-D plots is remarkable, considering that these two classifiers work in completely different ways (the Bayesian model performing theory-based mathematical operations with estimated probability distributions, the neural network computing quite arbitrary non-linear transformations of weighted sums). We therefore conclude that the latter two models approach the theoretical optimum of accuracy for each combination of the control quantities D, S, C.
As the separation of the data classes in feature space gets larger ( S = 1.0 , panel (d)), the accuracy-versus-D plots are qualitatively similar to panel (c), but for one-dimensional data ( D = 1 ) the common accuracy is now slightly above the random baseline, at 0.6. By comparing panels (c) and (d) we note that Naive Bayes is profiting from the larger class separation, but the other two classifiers reach the theoretical performance maximum even without this extra separation.
Next, we investigate the accuracy of the classifiers as a function of class separation S (Fig. 5e,f), considering five-dimensional data ( D = 5 ). Without correlations ( C = 0 , panel (e)), all three models show exactly the same monotonous increase of accuracy with separation S, starting at the random baseline of 0.5 and finally approaching perfect accuracy of 1.0.
With feature correlations present ( C = 1.0 , panel (f)), the Naive Bayes classifier shows the same behavior as in panel (e), whereas the other two correlation-sensitive models now already start with a respectable accuracy of 0.8 at zero class separation.
Finally, we investigate the accuracy of the classifiers as a function of the feature correlations C (Fig. 5g,h), considering again five-dimensional data ( D = 5 ). For strongly overlapping data classes ( S = 0.1 , panel (g)), Naive Bayes cannot exceed an accuracy of about 0.55, whereas the two correlation-sensitive models show a super-linear increase of accuracy with increasing feature correlations. However, this decrease is ending rather abruptly at about C ≈ 0.7 . Above this transition point, both models stay at a plateau accuracy of about 0.8, independent of the correlation quantity. Note that this discontinuity of the slope of the accuracy-versus-D plots is likely not an artifact of the DSC data, since the RMS-average of empirical correlations versus C (Fig. 5d) did not show such an effect at C ≈ 0.7 . Moreover, the fact that functionally distinct classifiers such as CMVG Bayes and Perceptron produce an almost identical behaviour here suggests that the accuracy plateau in the strong correlation regime indeed reflects the theoretical performance maximum.
As the class separation is increased ( S = 1.0 , panel (h)), all three models start at a larger accuracy of about 0.75 in the uncorrelated case. Now the performance of Naive Bayes is even declining with increasing C, because this model wrongly assumes uncorrelated data. The other two models show again the super-linear increase up to C ≈ 0.7 . However, now a further improvement of performance is possible with increasing correlations. Part 4: feature transformations. The accuracy limit is determined by the overlap of data classes, that is, by the possibility that different classes i = j produce exactly the same data vector � x * . Transformations � x → � f (� x) of the input features can drastically change the distributions of data points (As an example, compare the rows in Fig. 6). However, they cannot be expected to reduce the fundamental amount of class overlap, because transformations are just redirecting the common points � x * to new locations in feature space. In particular, invertible transformations can be viewed as variable substitutions in the integral Eq. (5) for the confusion matrix. They do not affect the resulting matrix values and thus leave the accuracy invariant.
In order to test this expectation, we start with two overlapping Gaussian data classes in a two-dimensional feature space (Fig. 6, top row), resulting in an accuracy limit of ≈ 0.69 . All three classifiers actually reach this limit with the original data as input.
Next we perform simple non-linear transformations on the input data, by replacing each of the two features x 1 and x 2 with a function of themselves (in particular: sin , sgn , and cos ). We find that the application of the sin -transformation (second row in Fig. 6) has indeed no effect on the accuracy of the three classifiers, even though Scientific Reports | (2022) 12:22121 | https://doi.org/10.1038/s41598-022-26498-z www.nature.com/scientificreports/ the joint (first column) and marginal distributions (second and third column) are now strongly distorted. Even the application of the sgn-transformation (third row), which collapses all data onto just 4 possible points in feature space, leaves the accuracies invariant. This works because the two classes in our simple example can be distinguished by the sign of the x 1 -feature, and both the sin -as well as the sgn-transformation leave this information intact. By contrast, the application of the cos-transformation destroys this crucial information, and consequently all accuracies drop to the random baseline of 0.5. The above numerical experiments illustrate that transformations of the input-data can reduce (by destroying information that is essential for class-discrimination), but never increase the theoretical accuracy limit, which is an inherent property of the data. Of course, the subsequent data transformations which are taking place in the layers of deep neural networks are still useful, because they re-shape data distributions until classes can be linearly separated in the final layer of the network.
Part 5: sleep EEG data. In our artificial data sets, all feature distributions were normally distributed.
Moreover, it was possible to introduce extremely strong correlations between these features, which could then be exploited by two of the three classifier models. It is however unclear if the ability of a classifier to detect correlations is always crucial in real-world problems.
We therefore turn in a next step to actually measured EEG data, recorded over-night from 68 different sleeping human subjects. In this case, our final goal is to assign to each 30-second epoch of a raw one-channel EEG signal one of the five sleep stages (Wake, REM, N1, N2, N3).
At our sample rate, a single epoch of EEG data corresponds to 7680 subsequent amplitudes. Such high-dimensional data vectors x are however not suitable as direct input for a Bayesian classifier, nor for a flat neural network with only ≈ 100 neurons. For this reason, we first compress the raw data vectors � x = (x 1 , . . . x 7680 ) into suitable feature vectors � u = (u 1 , . . . u D ) of strongly reduced dimensionality D ≈ 10 . Since we aim to develop a fully transparent classifier system, we use mathematically well-defined, human-interpretable features u f = G(� x, α f ) , which depend on a freely tunable parameter α . The dimensionality D of the feature space is then determined by how many of these parameters α f =1...D are chosen.
The huge literature on brain waves suggests that the momentary Fourier components of the EEG signal are suitable features for the classification of sleep stages. The parameter α is then naturally given by the frequency ν of the Fourier component (For details see methods). In a first experiment, we use a set of six equally spaced frequencies ( ν 1 = 5 Hz, ν 2 = 10 Hz, . . . ν 6 = 30 Hz). Based on training data sets that have been manually labeled by a sleep specialist, we then compute the marginal probability density functions of these Fourier features, as well as their covariance matrices, for each of the 5 sleep stages s (Fig. 7, left two columns). We find that within each sleep stage, the Fourier features have unimodal distributions, with peak positions and widths depending quite systematically on the frequency ν . There are characteristic differences between the sleep stages (in particular the distributions are wider in the wake stage), but they are not very pronounced. In the covariance matrices, we find that the off-diagonal elements are significantly smaller than the diagonal elements (The latter have been set to zero in Fig. 7 to emphasize the actual inter-feature correlations), with the exception of the wake state. Also the N1 state has slightly larger inter-feature correlations compared to the REM, N2 and N3 states.
As an alternative or complement to the Fourier features, we also consider the normalized (Pearson) autocorrelation coefficients of the raw EEG signal (Fig. 7, right two columns. For details see methods). The feature parameter α is in this case given by the lag-time t , for which we choose six equally spaced values ( 1, 3, . . . , 11 in units of the EEG sampling period). Since these correlation features cannot exceed the value of one by definition, the marginal distributions are highly non-Gaussian with pronounced tails towards small values. These tails show relatively strong differences between some of the sleep stages, but also surprising similarities, in particular for REM and N2. In the covariance matrices, we find the strongest inter-feature correlations in the wake and N1 stages. Again, the covariance matrices are very similar in REM and N2.
Part 6: sleep stage detection. Next, we apply our three classifier models to the above sleep EEG data.
However, while the feature distributions and correlations in Fig. 7 were based on the global data, pooled over all 68 full-night EEG recordings, we are considering here the task of personalized sleep-stage recording. That is, the classifiers are trained and evaluated individually on each of the 68 data sets. Because the amount of training data is severely limited in this task, classification accuracies are expected to be rather low and strongly dependent on the participant. We therefore compute the distributions of accuracies over the 68 personalized data sets (histograms in Fig. 8) for all three classifiers and for the two types of pre-processed features.
We find that the CMVG Bayes model is performing very poorly in this task, presumably because the feature distributions are non-Gaussian and only weakly correlated except in the wake stage. In particular, for some participants the classification accuracy is less then the random baseline of about 0.2, corresponding to consistent miss-classifications. This can happen in Bayesian classifiers when the likelihood distributions learned from the training data set do not match the actual distributions in the test data set.
By contrast, the Naive Bayes model can properly represent the non-Gaussian feature distributions by KDE approximations, and it furthermore profits from the lack of correlations. The performance of the Perceptron is comparable to that of the Naive Bayes model. Both for Fourier-and correlation-features, these two models show accuracies well above the baseline, roughly in the range from 0.3 to 0.6. Part 7: natural data clustering. Both the ten digits in MNIST, as well as the five sleep stages in overnight EEG recordings, are human-defined classes. It is therefore unclear whether these classes can also be considered as 'natural kinds' . www.nature.com/scientificreports/ After a suitable pre-processing that brings both data sets into the same format of 784-dimensional, normalized feature vectors (for details see Methods sections), we address this question by computing two-dimensional MDS projections, coloring the data points according to the known, human-assigned labels (In Fig. 9, see the upper left scatter plot in each 2-by-2 block). Indeed, the projected data distributions show a small degree of clustering, which is also quantitatively confirmed by the corresponding GDV values (−0.061 for MNIST and -0.035 for sleep EEG data). Note that in the sleep data, a large number of extreme outliers are found which might not correspond to any of the standard classes.
The purpose of classifiers is to transform and re-shape the data distribution in such a way that the final network layer (often a softmax layer with one neuron for each data class) can separate the classes easily from each other. Although, as we have shown above, these re-shaping transformations cannot reduce the natural overlap of classes (which would push the accuracy beyond the data-inherent limit), they might as a side-effect lead to a larger ' centrality' of the clusters associated with each class. This would show up quantitatively as a decrease of the General Discrimination Value (GDV) in the higher network layers of the classifier, as compared to the original input data. In order to test this hypothesis, we have trained a four-layer perceptron (see Methods section for details) in a supervised manner on both the MNIST and sleep EEG data. In the case of MNIST, we indeed observe a systematic decrease of the GDV in subsequent network layers: GDV(L0)= −0.061, GDV(L1)= −0.174, GDV(L2)= −0.250 , and GDV(L3)= −0.300 (See Fig. 9b). An analogous layer-wise decrease is found for the sleep EEG data: GDV(L0)= −0.035, GDV(L1)= −0.096, GDV(L2)= −0.122, and GDV(L3)= −0.181 (See Fig. 9d).
We finally address the question whether a natural clustering in novel, unlabeled data sets can be automatically detected, and possibly enhanced, in an unsupervised manner. For this purpose, we consider an autoencoder that performs a layer-wise dimensionality reduction of the data, and then re-expands these low-dimensional embeddings back to the original number of dimensions. During this process of ' compression' and 're-expansion' , fine details of the data have to be discarded, and it appears reasonable that this might go hand in hand with a 'sharpening' of the clusters. Again, in our test case where the labels of the data points are actually known, this enhancement of cluster centrality can be quantitatively measured by the GDV. For comparability, we have used an autoencoder that has the same design as the perceptron for the first four network layers.
In the case of the MNIST data set, we indeed find that unsupervised compression enhances cluster centrality over subsequent layers  Fig. 9c). However, the overall degree of clustering is much smaller than with MNIST. Moreover, a peculiar and so far unexplained feature is the loss of clustering in layer L3, the bottleneck of the autoencoder. In order to test if this is just a coincidental effect, we analyze an additional real-world data set of urban sounds (in the same way as the EEG data) and find a very similar behavior (See Fig. 2

of the Supplemental Material).
These results would suggest that the sleep EEG (as well as the urban sound) recordings, represent relatively weakly structured data sets. However, the observed low initial degree of clustering and its only marginal enhancement by unsupervised compression do not exclude the possibility that more refined methods could actually reveal a more pronounced clustering structure in these data that might even partly recover the human-defined sleep stages. In particular, our simple way of preprocessing the EEG time series by Fourier transformation into the frequency domain could be replaced by a wavelet transformation, so that also temporal information becomes available. Moreover, alternative methods of unsupervised data compression, such as Principal Component Analysis, might facilitate a considerably stronger cluster enhancement.

Discussion and outlook
In this work, we have addressed various aspects of data ambiguity: the fact that multi-dimensional data spaces usually contain vectors that cannot be unequivocally assigned to any particular class. The probability of encountering such ambiguous vectors is easily underestimated in machine learning, because the data sets used to train classifiers-rather than being sampled randomly from the entire space of possible data-typically represent just a tiny, pre-selected subset of 'reasonable' examples. For instance, the space of monochrome images with full HD resolution and 256 gray values contains 256 1920×1080 ≈ 10 4993726 possible vectors. The fraction of these images that resemble any human-recognizable objects is virtually zero, whereas the largest part would be described as noise by human observers. One may argue that these 'structure-less' images should not play any role in real-world applications. However, it is conceivable that sensors in autonomous intelligent systems, such as self-driving cars, can produce untypical data under severe environmental conditions, such as snow storms. How to deal with data ambiguity is therefore a practically relevant problem 34 . Moreover, as we have tried to illustrate in this paper, data ambiguity has interesting consequences from a theoretical point of view.
In part one, we have derived the theoretical limit A max of accuracy that can be achieved by a perfect classifier, given a data set with partially overlapping classes. By generating artificial data classes with Gaussian probability distributions in a two-dimensional feature space and with a controllable distance d between the maxima, we verified that different types of classifiers (The CMVG Bayesian model with multi-variate Gaussian likelihoods and a perceptron) exactly follow the predicted accuracy limit A max (d) (Fig. 3g). The naive Bayesian model, which cannot exploit correlations to distinguish between data classes, originally yields sub-optimal accuracies for small distances d, but this problem can be fixed by applying a random dimensionality expansion to the data as a pre-processing step 31 . We have restricted ourselves to only two features (dimensions) for this test, because predicting the accuracy limit involves the exact computation of the confusion matrix, which in turn is an integral over the entire data space. Note, however, that for high-dimensional data with known class-dependent generation densities p gen (� x | i) , the integral could be approximated by Monte Carlo sampling. In this case, the element C ji of the confusion matrix would be computed by drawing random vectors x from class i. The class indicator function www.nature.com/scientificreports/ q cla ( k | � x ) of the perfect classifier, which is fully determined by the generation densities, yields the corresponding predicted classes k for these data vectors. The matrix element C ji is then given by the fraction of cases where k = j.
It is important to note that the theoretical accuracy limit can only be reached by actual classifiers under otherwise ideal conditions: For example, the classifier model must have a sufficient capacity, which in a perceptron is determined by parameters such as the number and size of layers (See panel b of Fig. 5). Moreover, the total number of data vectors in the training data set has to be large enough for the complexity of the classification problem, the training data have to represent an unbiased sub-sample of the actual data, and the number of training cycles has to be sufficient so that the accuracy has actually converged to the optimum value. If one or more of these conditions is not met, which is the case in most real-world applications, the classification accuracy will be even below the theoretical limit.
In part two, we have constructed a two-level model to generate artificial test data (Fig. 4). The model has high-level parameters D, S and C which control the number of dimensions (features), the average separation of the two classes in feature space, as well as the average correlation between the features. For each triple of highlevel parameters D, S, C, a large number of low-level parameters µ, � are randomly drawn according to specified distributions, which are in turn used to generate the final test data sets. The super-statistical nature of the model allows us to prescribe the essential statistical features of dimensionality, separation and correlation, while at the same time ensuring a large variability of the test data. By using the General Discrimination Value (GDV), a quantitative measure of class separability (centrality), we have confirmed that the high-level parameter S controls the class separability as intended. Moreover, the proper action of parameter C was confirmed by computing the root-mean-square average over the elements of the data's covariance matrix.
In part three, we have applied our three types of classifiers to the test data generated with the DSC-model. Without intra-class feature correlations ( C = 0 ), we find that all three models show with growing separation parameter S exactly the same monotonically increasing average accuracy (Fig. 5e). Although the exact computation of A max is not possible in this five-dimensional data space, the perfect agreement of the three different classifiers indicates that they all have reached the accuracy limit. When intra-class feature correlations are present ( C = 0 ), we find by systematically varying the parameters D, S and C that the resulting accuracies of the CMVG-Bayes classifier and of the perceptron are extremely similar in all considered cases, indicating again that they have reached the theoretical accuracy limit. As expected, the naive Bayesian classifier shows sub-optimal accuracies in all cases where feature correlations are required to distinguish between the classes. In general, this analysis shows that the accuracy of classification can be systematically enhanced by providing more features (larger data dimensionality D) as input. Extra features that do not provide additional useful information are 'automatically ignored' by the classifiers and never reduce the achievable accuracy. Moreover, accuracy can be enhanced by providing features that are correlated with each other (larger parameter C), but differently in each data class. Such class-specific feature correlations can be exploited for discrimination by models such as CMVG Bayes and the perceptron, but not by the naive Bayes model. Moreover, we find that the theoretical accuracy maximum as a function of the correlation parameter C shows an interesting abrupt change of slope at around C ≈ 0.8 (Fig. 5g,h). The origin of this effect is at present unclear, but will be explored in follow-up studies.
In part four, we have investigated the effect of non-linear feature transformations, applied as a pre-processing step, on classification accuracy (Fig. 6). Since the achievable accuracy in a classification task is limited by the degree of overlap between the data classes, feature transformations can certainly reduce the accuracy to below the limit A max (when they destroy information that is essential for discrimination), but they can never push the accuracy to above A max . This is indeed confirmed in a simple test case where all three classifier types perform at the accuracy maximum with the non-transformed data: Applying a feature-wise sine-transformation drastically changes the data distributions p gen (� x | i) , but leaves the accuracies unchanged at A max . The accuracy remains invariant even under a signum-transformation, although this non-invertible operation reduces the data distributions to only four possible points in feature space. In this extreme case, most of the detailed information about the input data vectors is lost, but the part that is essential for class discrimination, namely the sign of the feature x 1 , is retained. This example demonstrates that classification is a type of lossy data processing where irrelevant information can be safely discarded. For this reason, neural-network based classifiers usually project the input data vectors into spaces of ever smaller dimensions, up to the final discrimination layer which needs only as many neural units as there are data classes. In this context, it is interesting that biological organisms with nervous systems, relying on an efficient classification of objects in their environment for survival, have probably evolved sensory organs and filters that only transmit the small class-discriminating part of the available information to the higher stages of the neural processing chain. As a consequence, our human perception is almost certainly not a veridical representation of the world [35][36][37] .
In part five, we have analyzed full-night EEG recordings of sleeping humans, divided into epochs of 30 seconds that have been labeled by a specialist according to the five sleep stages. Such recordings can been used as training data for automatic sleep stage classifiers -an application of machine learning that could in the future remove a large work load from clinical sleep laboratories. In our context of data ambiguity, sleep EEG is an interesting case because different human specialists agree about individual sleep-label assignments only in 70-80% of the cases, even if multiple EEG channels and other bio-signals (such as electro-oculograms or electro-myograms) are provided 38 . This low inter-rater reliability suggests that a considerable fraction of the 30-second epochs is actually ambiguous with respect to sleep stage classification, in particular when only the time-dependent signal of a single EEG channel is available as input-data. Our first goal is a suitable dimensionality reduction of the raw data, which (at a sample rate of 256 Hz) consist of 7680 subsequent EEG values in each epoch. As a pre-processing step, we map each 7680-dimensional raw data vector onto an only 6-dimensional feature vector, so that our Bayesian classifiers (Naive and CMVG) can be efficiently used. We consider as features the real-valued Fourier amplitudes at different frequencies, as well as the auto-correlation coefficients at different lag-times (Fig. 7).

Scientific Reports
| (2022) 12:22121 | https://doi.org/10.1038/s41598-022-26498-z www.nature.com/scientificreports/ The Fourier features are expected to be particularly useful, as it is well-known that the activity in different EEG frequency bands varies in characteristic ways over the five sleep stages. The correlation features have been successfully applied for Bayesian classification in a former study 7 . In our present study, we are using either Fourier or correlation features, but no combinations of those. By performing a statistical analysis of the features, we find that within the same sleep stage, the six features have significantly different marginal probability distributions. However, these distributions are quite similar in all sleep stages, so that their value for the classification task is limited. Moreover, the correlations between features, which could be exploited by the CMVG Bayes classifier and by the perceptron, turn out to be very weak, except for the Fourier features in the wake stage. Another problem is the strongly non-Gaussian shape of the marginal probability distributions in the case of the correlation features, which cannot be properly represented by the CMVG Bayes model. In part six, we have used our three classifier models, based on the above Fourier-and correlation features, for personalized sleep stage detection. In this very hard task, the classifiers are trained and tested, independently, on the full-night EEG data set of a single individual only. Since an individual data set contains typically less than 1000 epochs (each corresponding to one feature vector), random deviations from the 'typical' sleeping patterns are likely to be picked up during the training phase. We consequently find that the accuracies vary widely between the individual data sets. As expected, the CMVG Bayes model performs badly in this task, because there are almost no inter-feature correlations present that could be exploited for sleep stage discrimination, and because feature distributions are non-Gaussian. Interestingly, both the Naive Bayesian classifier and the perceptron achieve relatively good accuracies, mainly in the range from 0.3 to 0.6. However, these accuracies may be further increased by using more sophisticated neural network architectures 6,39 , and hence do not represent the accuracy limit.
In the final part seven, we have started to explore whether the distinct classes in typical real-world data sets are defined arbitrarily (and therefore can only be detected after supervised learning), or if the differences between these classes are so prominent that even unsupervised machine learning methods can recognize them as distinct clusters in feature space. Besides the (pooled) sleep EEG data, we have used the MNIST data set to test for any inherent clustering structure. For this investigation, the individual data points, corresponding to respectively one epoch of EEG signal or one handwritten digit, have been brought into the same format of 784-dimensional, normalized vectors. Computing directly the General Discrimination Value (GDV) of the MNIST data, based on the known labels, has indeed revealed a small amount of 'natural clustering' , even in this raw data distribution. This quantitative result was qualitatively confirmed by a two-dimensional visualization using multi-dimensional scaling (MDS), however the cluster structure would hardly be visible without the class-specific coloring (left upper scatter plots in Fig. 9a,b). By contrast, no natural clustering was found for the raw sleep EEG data when the 7680 values in each epoch were simply down-sampled in the time-domain to 784 values (data not shown). This presumably fails because the relevant class-specific signatures appear randomly at different temporal positions within each epoch, and so the Euclidean distance between two data vectors is not a good measure of their dissimilarity. However, when we instead used as data vectors the magnitudes of the 784 Fourier amplitudes with lowest frequencies, a weak natural clustering was found also in the sleep data (left upper scatter plots in Fig. 9c,d). We have furthermore demonstrated that the degree of clustering (for both data sets) is systematically increasing in the higher layers of a perceptron that has been trained to discriminate the classes in a supervised manner (Fig. 9, right column). Finally, we have used a multi-layer autoencoder to produce embeddings of the data distributions with reduced dimensionality in an unsupervised setting (compare e.g. 40 ). It has turned out that the degree of clustering (with respect to the known data classes) tends to increase systematically with the degree of dimensional compression (Fig. 9, left column). This interesting finding, previously reported in Schilling et al. 15 , suggests that unsupervised dimensionality reduction could be used to automatically detect and enhance natural clustering in unlabeled data. In combination with automatic labeling methods, such as Gaussian Mixture Models 41 , or the concept of the successor representations to map complex data structures 24,42 , this may provide an objective way to define 'natural kinds' in arbitrary data sets.

Data availability
Data and analysis programs will be made available upon reasonable request. Please contact C. Metzner (claus. metzner@gmail.com).