Using phidelta diagrams to discover relevant patterns in multilayer perceptrons

Understanding the inner behaviour of multilayer perceptrons during and after training is a goal of paramount importance for many researchers worldwide. This article experimentally shows that relevant patterns emerge upon training, which are typically related to the underlying problem difficulty. The occurrence of these patterns is highlighted by means of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\langle \varphi ,\delta \rangle$$\end{document}⟨φ,δ⟩ diagrams, a 2D graphical tool originally devised to support the work of researchers on classifier performance evaluation and on feature assessment. The underlying assumption being that multilayer perceptrons are powerful engines for feature encoding, hidden layers have been inspected as they were in fact hosting new input features. Interestingly, there are problems that appear difficult if dealt with using a single hidden layer, whereas they turn out to be easier upon the addition of further layers. The experimental findings reported in this article give further support to the standpoint according to which implementing neural architectures with multiple layers may help to boost their generalisation ability. A generic training strategy inspired by some relevant recommendations of deep learning has also been devised. A basic implementation of this strategy has been thoroughly used during the experiments aimed at identifying relevant patterns inside multilayer perceptrons. Further experiments performed in a comparative setting have shown that it could be adopted as viable alternative to the classical backpropagation algorithm.

Informal introduction to ϕ, δ diagrams. Although in this article ϕ, δ diagrams play a gregarious role, their main concepts are summarized hereinafter for the sake of clarity. Further details on this matter are reported in the Supplementary Information (Supplementary section S1) available online. The interested reader may also find an extensive study on their semantics in Armano and Giuliani 32 .
The measures that give rise to the ϕ, δ space are defined as: ϕ = ρ − ρ and δ = ρ + ρ − 1 , where ρ and ρ denote specificity and sensitivity. Upon normalisation, the values of any feature that occurs in the given dataset --as found in the available training samples--can always be seen as the outputs of an elementary (single-feature) classifier. Hence, to draw the class signature of the dataset, first the "performance" of each feature with respect to the positive category is evaluated, and then all corresponding values (in terms of φ and δ) are reported in a diagram. Figure 1 shows an example of how ϕ, δ diagrams can be used for feature assessment, by depicting the signature of the dataset optdigits. It is worth to highlight that φ coincides with the horizontal axis, whereas δ with the vertical one. In the cited figure, points have different colours, which account for their importance; in particular, blue points are assumed to be highly useful for the classification process and vice versa for red points. However, note that hereinafter the colour map will be automatically rescaled according to the difference between minimum and maximum value of |δ|. This choice has been taken to facilitate the reader at visually ranking points even when the values of |δ| have low dynamics. According to the given definitions, both measures range in the interval [− 1, + 1], and their underlying semantics is the following: ϕ estimates the bias of a feature with respect to the positive (and negative) category on the given dataset, whereas (though stretched in [−1, +1] ) δ gives the accuracy of a feature with respect to the positive category. Note that, being given in terms of specificity and sensitivity, φ and δ are both independent of the actual balancing between negative and positive samples. It can be easily shown that ϕ is the locus of points such that the mutual information between a feature and the positive or negative category drops to zero. Hence, features laying close to this axis (i.e., such that |δ| ≈ 0 ) are expected to provide limited or negligible support to the classification process. This is true for the whole axis, although with different semantics (in particular, features independent of either class label lay at the crossing of the ϕ and δ Figure 1. Class signature of the dataset optdigits, downloaded from the UC Irvine machine learning repository (UCI, hereinafter). Each sample is encoded with an image of 32 × 32 B/W pixels, for a total of 1024 binary features. The multiclass problem has been binarized considering the digit 0 as positive category and 1, 2, . . . , 9 as negative category. Each point in the diagram represents the "performance" of a feature, considered as an elementary classifier. Feature importance is highlighted by a scale of colours: from red (not relevant) to blue (highly relevant). Intermediate values are represented with yellow, green and light blue, depending on the corresponding feature importance (from lower to higher). Due to the presence of several points with high value of |δ|, the problem is expected to be easy.
Scientific Reports | (2020) 10:21334 | https://doi.org/10.1038/s41598-020-76517-0 www.nature.com/scientificreports/ axes). As for δ , by construction, it gives information about the degree of agreement (or disagreement) between a feature and the positive category. In case of agreement (upper corner), the feature is said to be covariant with the positive category, whereas in case of disagreement (lower corner) it is said to be contravariant. Features whose δ value is close to the upper or lower corner are expected to give strong support to the classification process. The fact that both highly covariant and highly contravariant features are equally important should not be surprising, as selecting a class as positive or negative is just an arbitrary choice.
Patterns that arise on easy and difficult problems. The first experimental benchmark has been set with the goal of identifying whether specific patterns occur inside MLPs, able to account for their generalisation ability over the problem at hand. To this end, both easy and difficult problems have been investigated. Figure 2 reports the signatures of two exemplar problems, i.e., kidney-disease and dota2. The left-hand side points out that several features of kidney-disease are in high agreement with the negative category.
Having at least one feature highly covariant with either category (i.e., with high |δ| ) is a sufficient condition that guarantees the easiness of a problem, due to the expected performance of the corresponding elementary classifier on the test set (in fact, the blue feature not far from the bottom would allow to reach an accuracy of about 85% ). Conversely, the right-hand side shows that in dota2 a very poor agreement holds between features and categories, as all features therein lay on (or are very close to) the ϕ axis. Notably, having all features with low |δ| is not a necessary condition for the problem to be difficult, as by construction a class signature is able to report only the agreement/disagreement between each feature taken in isolation and the positive category. However, using their capability of acting as feature encoders, MLPs can be used to put into evidence if any useful combination of input features holds.
To verify that the first problem is in fact easy and to assess whether the second is difficult or not, an MLP with a single hidden layer and equipped with 20 neurons has been trained on both problems using 10-fold cross validation. Learning rate, momentum and number of epochs have been set to typical values (i.e., 0.01, 0.05 and 40, respectively), whereas sigmoid has been used as activation function. The easiness of the former problem has been confirmed by 100% of accuracy, and the difficulty of the latter has been confirmed by an accuracy of about 57% , which is very close to random guessing. More interestingly, the hidden layers found after training the MLP on the selected problems are very different and deserve attention. Let us separately examine the ϕ, δ diagrams drawn by considering the outputs of each hidden layer as they were in fact input features. The left-hand side of Fig. 3, which reports the hidden layer of the MLP trained on the dataset kidney-disease, highlights the occurrence of a characteristic pattern. In particular, part of the neuron outputs lay close to the upper corner while most of the others lay close to the lower corner. The interpretation of this feature recombination process is clear: the MLP has been able to generate new features highly covariant (upper corner) or highly contravariant (lower corner) with the positive category (recall that both covariant and contravariant features are highly discriminant). In the event that the training set is representative of the dataset at hand, finding this pattern implies that the MLP has been able to come up with a very good predictive model through generalisation. The right-hand side of Fig. 3, which reports the hidden layer of the MLP trained on the dataset dota2, highlights the occurrence of a completely different pattern. In particular, all neurons therein lay close to the φ axis, several of them being also close to the centre of the diagram. Yet, the interpretation of this feature recombination process is clear: the MLP has not been able to generate any new feature highly covariant or contravariant with the positive category. Two typical class signatures, for easy and difficult classification problems. In particular, the toy problem kidney-disease is reported at the left-hand side, whereas the (expected to be) difficult problem dota2 is reported at the right-hand side. Both datasets have been downloaded from UCI. www.nature.com/scientificreports/ Rather, such hidden layer highlights the inability to provide a generalisation by resorting to random guessing, meaning that those neurons and the positive or negative category tend to be statistically independent.
Failure patterns that may occur depending on training parameters. Understanding the impact of MLP parameters on the generalisation ability of the learning algorithm is a main issue. In particular, it is well known that changes on learning rate and momentum must be dealt with care, otherwise-also depending on the selected activation function-the learning process might end up with a failure. Just as an example, let us consider how the generalisation ability of an MLP equipped with one hidden layer depends on different values of momentum. Figure 4 reports the hidden layers obtained after training the dataset CNAE-9 with two different values of momentum. The left-hand side shows the signature obtained with momentum equal to 0.01 and the right-hand side the one obtained with momentum equal to 0.05. The former case highlights a pattern of generalisation success, whereas the latter highlights a clear pattern of generalisation failure. Note that in both layers there are neurons whose ϕ, δ values that do not show any useful correlation with the positive or negative  dataset (from UCI). In particular, the hidden layer at the left-hand side shows that the MLP has been able to generalise, despite the fact that some neurons (i.e., those located at the left-and right-hand corners) operate in saturation, whereas the other highlights a clear pattern of failure, as all neurons operate in saturation.
Scientific Reports | (2020) 10:21334 | https://doi.org/10.1038/s41598-020-76517-0 www.nature.com/scientificreports/ category (see the left-and the right-hand corners of the diagrams). Indeed, any such neuron would clearly emit −1 (left) or +1 (right), regardless of the actual input. This phenomenon occurs on neurons that operate in saturation, and-in absence of further neurons with medium/high covariance or contravariance with the positive category-it implies a generalisation failure. It is worth noting that, together with a more careful choice of training parameters, the occurrence of residual neurons that underwent saturation can be dealt with in several ways. For instance, alternative activation functions could be adopted, e.g., rectified linear unit (ReLU, for short)-see, for example, Hahnloser et al. 33 and Glorot et al. 34 . In the event that a layer-wise training strategy is adopted, the best way for dealing with this problem would be to prune these residual neurons off, for they cannot carry any useful information onwards.
Improving the generalisation ability of MLPs by adding layers. Although it is widely acknowledged that an MLP with a single hidden layer can be a universal function approximator 35,36 with the rise of deep learning it has been shown that many real-world problems allow very compact representations with increasing  www.nature.com/scientificreports/ number of layers-see, in particular, Bengio 37 . According to this view, one can hypothesize that deploying a difficult problem over an MLP with more than one hidden layer may help its generalisation ability. A simple way to verify this hypothesis consists of thrusting each intermediate layer, but the last, to act as feature encoder. Unfortunately, classical backpropagation is not compliant with this perspective, as in principle all layers are concurrently updated every time a training sample is processed. Being aware of this limitation, PT has been used in place of BP, the main characteristics of the former being its ability of training each hidden layer in isolation, starting from the first and going onwards. Technical details on this layer-wise strategy are given in "Methods" section. Let us examine the behaviour of an MLP that underwent progressive training using an example in which the problem at hand is apparently difficult. Figure 5 reports the signature of the classical dataset arrhythmia. This signature suggests that the problem is expected to be difficult, as almost all features lay in proximity of the ϕ axis. Running 10-fold cross validation on an MLP that embeds a single hidden layer allowed to reach an accuracy of about 65% , regardless of the adopted MLP parameters and of the number of hidden neurons (i.e., 10, 20 and 50 at different trials). On the contrary, training an MLP that embeds two hidden layers using PT allowed to obtain an accuracy of about 75% . Figure 6 shows the signature corresponding to each hidden layer, making clear that they have separate responsibilities. In particular, the first performs a successful feature encoding, so that several new "features" are now more correlated with the positive or negative category. The second layer highlights that the attempt of finding a good model (upon the outputs of the previous layer) has been almost successful, as some neurons are now covariant with the positive category, whereas others are contravariant. It is clear, however, that in fact the success pattern found for the arrhythmia dataset lies in between the patterns of failure and success. To what extent any such pattern approximates failure or success depends on the distance of the neural clusters from their reference (upper or lower) corner. This phenomenon may be due to the residual noise or to the limits of classical gradient descent. In either case, this "half-way" pattern deserves attention, being highly relevant in various aspects concerning the behaviour of MLPs.
Effectiveness of progressive training. Beyond the need of making clearer the process of feature encoding and pattern formation inside an MLP, PT has also shown to be very effective. Three kinds of experiments are reported hereinafter: (i) a preliminary experiment on a synthetic dataset in which all features have been randomly generated, and two of them determine the labelling of samples throughout the xor logical function; (ii) experiments performed on some non-trivial datasets from UCI; and (iii) experiments on well-known medium-size datasets, showing that relevant patterns can be found also in problems characterised by high number of samples and/or features. All experiments shed light on the fact that relevant patterns occur regardless of the size of datasets and regardless of the origin of data (i.e., synthetic or real).
Experiments on the xor dataset. As preliminary experiment, the power of layer-wise training has been assessed on a variant of the classical xor problem. This groundwork has been devised to verify that the ability of MLPs to combine features persists while moving from BP to PT. The corresponding experimental setup is the following: a synthetic dataset with 98 real-valued features and 1000 samples has been generated. Afterwards, two logical features have been added, randomly selecting-for each sample-their values. Then, the labelling of each sample has been set by calculating the xor between the two extra features. An MLP has been trained on this dataset, first using BP and then using PT. The corresponding architecture (i.e., three hidden layers with 10, 4, and 2 neurons) has been left unchanged. To prevent that experiment outcomes were biased by a single run, multiple runs have www.nature.com/scientificreports/ been performed, with typical values for learning rate and momentum (i.e., 0.01, 0.05, respectively). A negligible classification error (i.e., about 0.5% ) has been evidenced during experiments, meaning that in both cases the MLP was able to identify the two relevant input features. Figure 7 illustrates the content of the third hidden layer, as obtained by enforcing both kinds of training strategies (these contents have been extracted from a single, though representative, run). The cited figure highlights that both MLPs are able to cope with the xor problem. However, the rise of a success pattern is clearer for PT. The feature extraction process is highlighted by Fig. 8, which summarizes the informative content of the input layer and of all hidden layers, as obtained by running PT. Notably, the input layer makes clear that this problem cannot be solved by univariate analysis, as all featurestaken in isolation-are almost completely independent of the class label (recall that perfect independence occurs at the crossing between horizontal and vertical axis). Nevertheless, the first hidden layer shows that several combinations of input features are in fact relevant for the classification process. An initial attempt of reaching a success pattern is then shown at the second layer, whereas a full success pattern occurs at the third hidden layer.
Experiments on non-trivial datasets from UCI. Some preliminary comments follow on experimental setup. A fixed policy has been adopted for parameter setting, with the goal of ensuring a fair assessment for the algorithms under testing. In particular, learning rate and momentum have been set to 0.01 and 0.05, whereas the number of epochs to 40. To ensure statistical significance, training and test runs have been repeated 100 times for each dataset using random splitting. A two-sample Welch t-test has been evaluated on available data. The www.nature.com/scientificreports/ confidence level for p-values has been set to 0.05. As for MLP architectures, a fixed choice has been adopted as well. In particular, a hidden layer equipped with 30 neurons has been used for experiments made on MLPs equipped with a single hidden layer (BP only), whereas a network shape with 24, 10, 6 neurons has been used for experiments made on MLPs equipped with more hidden layers (for both BP and PT). The amount of variations being sizeable, commands have been run in batch mode. Some details about the format of batch command files are given in the Supplementary Information available online (Supplementary section S2). Also log files concerning experimental results (which include confusion matrix, specificity, sensitivity, accuracy, ϕ and δ ) are automatically stored in a similar format during each run; in so doing, any further analysis is greatly simplified. As for the criteria used for assessment, they have been derived from two typical perspectives: one focused on the classifier behaviour under the hypothesis that it must adapt to the statistics of data to be classified, and the other on the behaviour expected on balanced data. The simplest way to put into practice the former perspective is by adopting accuracy as performance measure. However, in scenarios in which there are practical reasons to focus on type-I or type-II errors (e.g., in biomedicine), other measures may be used instead, including precisionrecall diagrams 38 , coverage plots 39 , or generalised ϕ, δ diagrams 32 . On the other hand, an equally simple way to account for the latter perspective is to focus on the accuracy that would be measured as data were in fact balanced. This shift can be implemented by substituting the accuracy with its "unbiased" variant as performance measure. Note that the two kinds of accuracy are strictly related; in particular, with n and p percent of negative and positive Table 1. Comparison, with focus on accuracy, between BP and PT applied to non trivial datasets (all from UCI). For each dataset, 100 training and test sets have been generated by random splitting. On the backpropagation side, experiments have been performed on two kinds of MLPs: one equipped with one hidden layer and the other with a shape identical to the one selected for PT. The best results obtained on the backpropagation side has been retained for each dataset. Results in favour of/against PT are highlighted with black/white circles, whereas results with no significant difference are highlighted with an equal sign. Two-sample Welch's t-test has been used to check the similarity between the outcomes of different kinds of classifiers. The significance level for p-values has been set to 0.05. Also standard deviation is reported for accuracy. Legenda: 1HL/nHL = MLP with one/more than one hidden layer, ρ = specificity, ρ = sensitivity, and a = accuracy.  Table 2. Comparison, with focus on δ (and on ϕ ), between BP and PT applied to non trivial datasets (all from UCI). The comparison is in fact focused on unbiased accuracy-i.e., on the accuracy measured as datasets were in fact balanced. Results in favour of/against PT are highlighted with black/white circles, whereas results with no significant difference are highlighted with an equal sign. Standard deviation is reported for both ϕ and δ. www.nature.com/scientificreports/ samples, the former (say a ) can be defined as a = n · ρ + p · ρ , whereas the latter (say a u ) is obtained by setting n = p = 1/2 in the previous formula, yielding a u = (ρ + ρ)/2 . Note that also δ can be easily reformulated in terms of unbiased accuracy, as δ = ρ + ρ − 1 = 2 · (ρ + ρ)/2 − 1 = 2 · a u − 1 , meaning that δ represents in fact the unbiased accuracy stretched in the interval [−1, +1].

BP (best 1HL/nHL) PT (nHL)
The experimental results reported hereinafter are commented out by following this twofold approach. In particular, Tables 1 and 2 are respectively focused on accuracy and on unbiased accuracy. However, having stressed the strict relation that holds between a u and δ , and considering that this article is centred on ϕ and δ as performance measures, the table actually reports ϕ and δ values. Notably, the pair ϕ and δ can also give useful information about the generalisation capability of a trained MLP. In fact, from a pragmatic point of view, generalisation occurs when no performance reduction is observed while moving the focus from the training to the test set, and at the same time specificity and sensitivity are maximised. Moreover, they should be maximized as much as possible to the same extent, a wanted property being that the classifier should have a null or minimum bias towards the positive or negative category. Equivalently, one may impose that the sum of sensitivity and specificity (i.e., ρ + ρ ) should be maximised and their difference in absolute value (i.e., |ρ − ρ| ) minimised. Not incidentally, δ = ρ + ρ − 1 , whereas ϕ = ρ − ρ . Besides, the above recommendations for assessing the generalisation capability of an MLP, i.e., low φ (absolute value) and high δ, are perfectly compliant with those regarding success patterns, the only difference being that for success patterns the sign of δ is not any longer relevant-as also highly contravariant neurons are very helpful for the classification task. Due to lack of space, on the BP side only the best performing result has been retained for each dataset-choosing between 1HL and nHL architectures (with 1HL = MLP equipped with one hidden layer, and nHL = MLP equipped with more hidden layers). In particular, a better performance of the latter (i.e., nHL) has been found only on the datasets autos and bank, for both accuracy and unbiased accuracy. Looking at Table 1, PT appears more effective than BP, as five over ten results are in favour of the former, whereas the others are equivalent. As for Table 2, no clear indication holds to conjecture that one algorithm is better than the other (three datasets are in favour of PT and one in favour of BP). As for the generalisation capability, given the results in terms of ϕ and δ , it is worth pointing out that in some datasets the standard deviations on ϕ reveal a not negligible degree of instability over runs. Summarizing, experimental results show a slight superiority of PT. However, considering that the comparison has been made on a limited number of datasets, and also taking into account that no specific trial-and-error strategy has been put into practice for parameter tuning and MLP shape identification, a cautious conclusion would be that PT is expected to be at least as effective as BP.
Experiments on medium-size datasets. To complete the assessment concerning the existence of relevant patterns inside MLPs and the effectiveness of PT, further experiments have been performed on medium-size datasets (all downloaded from the Kaggle ML repository). As made for the experiments performed on UCI datasets, also here two tables have been reported (i.e., Tables 3 and 4)-which account for the behaviour found by focusing on accuracy and on unbiased accuracy (through δ). Given that running experiments on medium-size datasets is a time-consuming task, only 4 runs have been performed for each dataset and algorithm under testing. BP (nHL only) and PT have been experimented using the same MLP architecture and the same training parameters. With the exception of a dataset (see below), hidden layers have been equipped with 40, 30, 20, and 10 neurons. As for relevant parameters, learning rate and momentum have been set to 0.01 and 0.05, whereas the number of epochs has been set to 40. Notably, this choice about experimental settings did not affect the emergence of relevant patterns; in fact, for easy problems (e.g., MNIST), the adoption of more layers than needed had the effect of showing a success pattern at the first hidden layer and the same pattern almost unchanged at subsequent layers. Table 3 highlights that the performances of BP and PT are equivalent on the selected datasets, whereas Table 4 points out that high levels of generalisation have occurred on all datasets, being δ typically high and |ϕ| always low. As for the occurrence of relevant patterns, this set of experiments has shown no difference between smalland medium-size datasets. As an example, Fig. 9 reports the occurrence of success patterns for the Credit Cards Fraud Detection 40 dataset. The problem is expected not to be difficult (see the signature of the input layer), and indeed a success pattern occurs already at the first hidden layer. Notably, notwithstanding the reduction of neurons per layer, the pattern is preserved on subsequent layers (the last hidden layer has not been reported for the sake of brevity). Almost all experiments, except for Diabetic Retinopathy Detection, have been run without performing any feature selection. The cited exception is a collection of images of size 224 × 224 and encoded with three RGB values, for a total of 150.528 features. Given such a huge number of features, neither BP nor PT could be able to perform generalisation. Hence, a simple yet effective strategy has been put into practice. First, feature ranking has been performed on the training dataset. Recalling that, for a feature f, it is not important to be covariant or contravariant with the positive category, the function |δ(f )| has been used to generate the ordering. Then, the first 500 features have been retained and used for training. Also the MLP architecture has been adapted to this dataset. In particular, hidden layers have been equipped with 100, 50, 40, and 20 neurons.

Discussion
In this section, further comments are given on the following aspects: (i) the capability of MLPs to act as feature encoders when trained with PT; (ii) the occurrence of relevant patterns inside MLPs; (iii) the role of ϕ, δ diagrams in the task of devising more effective training strategies; and (iv) the possibility of using MLPs and ϕ, δ diagrams in combination to perform multivariate analysis.
The insight that lays behind the work described in this article is that MLPs are very effective feature encoders and that relevant patterns arise upon training. The soundness of this insight is also supported by the ongoing research on deep learning, which has fully adopted autoencoders (see Bourlard  . The sequence highlights that a pattern of success occurs at the first hidden layer and that it is slightly improved at the subsequent layers. The last layer is not reported for the sake of brevity.   www.nature.com/scientificreports/ Zemel 42 ) as a mean for implementing feature extraction according to an unsupervised perspective. However, the process of encoding features is not necessarily unsupervised. In fact, applying backpropagation allows to generate different encodings of the input features at each hidden layer, driven by the errors evidenced at the output layer. This process of feature encoding is particularly effective for PT, in which hidden layers are trained one at a time, using mean square error as loss function. In so doing, the transformation of inputs from a layer to another is progressively driven by a generative thrust aimed at making the new encodings more covariant or contravariant with the positive category. Besides, many comments reported in Shwartz-Ziv and Tishby 21 have been experimentally assessed while checking the soundness of PT. In fact, on average, the proposed strategy appears as least as effective as BP, allows to better highlight the input compression mechanism, and typically shows a shorter training time.
As for the occurrence of relevant patterns inside MLPs, experimental assessments made over a great number of datasets, have shown that the following significant kinds of pattern hold: failure, success and partial success. Failure patterns point to the inability of the MLP to come up with new feature spaces laying as far as possible from the ϕ-axis. This kind of patterns has been extensively studied, and the difference between a failure due to the inability of the network to come up with a suitable generalisation has been distinguished from other causes related to algorithmic issues. In particular, the former case is typically evidenced by neurons located close to the centre of the ϕ, δ diagram, whereas the latter (which highlights the occurrence of saturation) is typically evidenced by neurons that lay in proximity of the left and/or right corner(s). Needless to say that these two cases can occur jointly. As for success patterns, they hold when the training activity has generated neurons that are highly covariant or highly contravariant with the positive category. Typically, one cluster is found in proximity of the upper corner and another in proximity of the lower; however, also a single cluster may occur. Experimental results demonstrated that, once achieved, any such pattern tends to be steady. In particular, assuming that a researcher has decided that the MLP at hand should have a shape of length N and that a success pattern has been found at a layer k * < N , the point here is whether training the remaining layers may improve or not the final performance. Experiments performed on many different problems highlight that negligible improvement can be obtained after k * , as at all layers k > k * similar patterns of generalisation success would be reproduced. Figure 10 reports the signatures of an MLP equipped with four hidden layers (all with the same number of neurons) and trained on the WBC dataset. The figure shows that only marginal changes characterize a hidden layer with respect to the others, highlighting that a pattern of success holds on all layers. The lack of relevance regarding these variations has been assessed by removing all hidden layers but the first. In fact, as expected, the classification performance was not affected by the removal. Notably, the steadiness of success patterns appears very important for devising adaptive training strategies in which the shape of an MLP is not a priori defined. Patterns of partial success occur when clusters of neurons tend to attain the upper and/or the lower corner, but with limited success (in other words, these clusters typically stand half-way between the ϕ-axis and the upper and/or lower corners). One reasonable hypothesis that may explain their occurrence consists of assuming that they originate from the intrinsic difficulty of the dataset at hand. A more interesting hypothesis, which however does not rule out the previous one, is that this kind of pattern accounts for the limitations of BP. In the event that at least part of the responsibility for the occurrence of half-way patterns is due to the gradient descent enforced by backpropagation, there is room for experimenting alternative training strategies (e.g., SGD) able to inject the pseudo-random behaviour required to escape from local minima. Also these patterns tend to be steady, thus highlighting that there is no guarantee for PT to come up with a better generalization just adding further layers. Notably, the presence of half-way patterns is consistent with the findings of Salakhutdinov and Murray 43 and of Larochelle and Bengio 44 , who point out that deep architectures cannot be considered better than shallow ones on every problem. This proposal also highlights the role of ϕ, δ diagrams in the task of devising more effective training strategies. In particular, the fact that the inner behaviour of MLPs can be investigated with proper visual and computational tools (i.e., ϕ, δ diagrams and measures) promotes the opening of new scenarios, in which further relevant techniques could be borrowed from the machine learning and pattern recognition communities and adapted to this research topic. For instance, assuming that PT is used, one may focus on the advantage of pruning the current layer before training the next. The benefit of applying pruning should not be surprising, as the ϕ, δ analysis performed on hidden layers allowed to verify that they may contain neurons which are apparently not useful for the classification task (e.g., those that lay in proximity of the left or right corner). A simple pruning strategy would consist of devising a proper cost function and use it for ranking neurons with the goal of identifying candidates for deletion. Rather than adopting entropy or Gini index, which appear more convenient for identifying neurons characterised by low accuracy in a MAP setting 45 , one may be interested at minimising the bias as well (i.e., at trying to make specificity and sensitivity as equal as possible while maximising the accuracy). Which cost functions appear more appropriate for enforcing bias minimisation is still a matter of investigation. In any case, the opportunity of visually inspecting hidden layers by means of ϕ, δ diagram should greatly help rapid advances in this research topic. The reader interested in pruning and compression techniques may also consult -for instance-the review articles of Augasta and Kathirvalavakumar 46 and of Cheng et al. 47 . Alternative pruning strategies may be devised also considering that, at least in principle, any layer of an MLP could be seen as an ensemble-the role of individual classifiers being played by the corresponding neurons. According to this view, an appropriate pruning strategy might be devised in accordance with the proposals concerning the tradeoff between diversity and accuracy, as investigated by the community of classifier ensembles (see for example Kuncheva and Whitaker 48 and Bhatnagar et al. 49 ). Notably, this view is indirectly confirmed by the adoption of softmax 50 in two relevant scenarios: (i) as output blender for classifier ensembles, e.g., Memisevic et al. 51 , and (ii) as output combiner for CNNs, e.g., Krizhevsky et al. 52 and Liu et al. 53 .
The last comments of this section are devoted to highlight the potential of using MLPs and ϕ, δ diagrams in combination to perform multivariate analysis. In 28 the author points to the ability of ϕ, δ diagrams to make feature-importance analysis by calculating the class signature on the given dataset (recall that class signatures Scientific Reports | (2020) 10:21334 | https://doi.org/10.1038/s41598-020-76517-0 www.nature.com/scientificreports/ fall into the broad category of univariate analysis, as they concentrate on each feature taken in isolation). Class signatures are in fact "semi-decidable", meaning that when at least one feature highly covariant or contravariant with the positive category is found, then the problem at hand is certainly easy. Unfortunately, the converse is not true. In other words, when no good features are found, one may conjecture that the problem is difficult, but further supporting information is needed to complete the assessment. Fortunately, training an MLP on the given dataset can shed more light on its actual difficulty. Again, a failure (i.e., an MLP with limited generalisation capability) would not necessarily mean that the problem is in fact hard to solve. However, the identification of a success pattern at any hidden layer would turn the appraisal from potentially-difficult to easy. This support for decision making is obtained by providing the researcher with manifold class signatures, one for each layer (including the one evaluated on inputs.) Although each signature is made according to a univariate perspective, those evaluated at the hidden layers can give strong support to multivariate analyses, as the neurons therein may provide useful combinations of input features (depending on the work done by the adopted training algorithm). An evidence about the shortcomings of univariate analysis has been given before with the synthetic xor problem. This is a characteristic example of how univariate analysis may fail while multivariate analysis (obtained by the combined use of MLPs and ϕ, δ diagrams) may succeed.

Methods
This section is devoted to give some details about PT, whose most distinctive characteristics is the ability of performing greedy layer-wise training. To understand PT from an algorithmic perspective, let us concentrate on Fig. 11, which gives a snapshot of the training activity under the hypothesis that the shape of the MLP at hand is 240, 80, 40, 20, 1 -meaning 240 inputs, first hidden layer with �240 × 80� weights, second with �80 × 40� weights, third with �40 × 20� weights, and output layer with �20 × 1� weights. The process starts by training the first layer. Then this layer is frozen and used as feature encoder to provide the inputs for training the second layer. Again, this second layer is frozen and used (in pipeline with the first) as feature encoder to provide the inputs for training the third layer, and so on until the output layer is reached. Notably, PT is compliant with the vision of Shwartz-Ziv and Tishby 21 , who adopt a "supervised" perspective while analysing the content of a multilayer neural network from an information theoretic perspective. In particular, assuming that X and Y denote input and output of a multi layer ANN, the authors point out that any internal representation of the input, say T, can be seen as the combination of an encoder, P(T|X) , and a decoder P(Ỹ |T) , with Ỹ representing the output of the network. In fact, PT provides an effective greedy strategy aimed at generating the encoder entrusted with feeding the final layer. To this end, at each iteration, an MLP with one hidden layer (say trainee) is trained, whose input is taken from the encoder generated so far (of course, at the first iteration, the actual input is given to the trainee). At the end of the iteration, the hidden layer of the trainee is appended to the encoder, whereas its decoder is neglected. Then, the training algorithm starts over again with the next iteration, until the final architecture has been generated. Note that, at the last iteration, the output layer (i.e., the decoder) of the trainee is appended to the encoder. In so doing, PT gives an effective solution to the problem of finding useful representations of the given input according to the information provided by sample labelling. This conceptualization has been implemented defining a Python class, say MLPP, in which several trainees (i.e., different MLPs equipped with a single hidden layer) are trained, and their hidden layer progressively embedded into a proper internal slot (of the MLPP object) that provisionally plays the role of input encoder-see also the Supplementary Information (Supplementary section S3) available online. It is worth pointing out that, unlike other proposals that typically rely on autoencoders, here feature www.nature.com/scientificreports/ extraction is performed according to a supervised perspective. In fact, in PT the information about class labelling is made part of the whole process at each iteration. As for its roots, PT shares the practice of performing layer-wise training with several proposals. A pioneering work on this topic has been made by Fahlman and Lebiere 54 who proposed the cascade correlation (CC) architecture and algorithm. Instead of just adjusting the weights in a network of fixed topology, the CC algorithm begins with a minimal network, and then automatically trains and adds new hidden units one by one, creating a multi-layer structure. Once a new hidden unit has been added to the network, its input-side weights are frozen, so that the unit becomes a permanent feature detector in the network under construction. In 2006, Hinton et al. 55 propose a layer-wise learning algorithm for training deep belief networks. The authors illustrate a fast and greedy learning algorithm for constructing multilayer directed networks, one layer at a time. This algorithm is used to initialize a slower learning procedure aimed at performing fine-tuning. Bengio et al. 30 propose to pre-train DNNs using a layer-wise unsupervised strategy, followed by fine tuning. In a 2010 article, Erhan et al. 56 analyse the motivations why pre-training helps the learning process. Following the insight on pre-training, Furusho et al. 57 investigate how relevant information theoretic measures are related to the generalization error, how the representations change as the number of hidden layers increases, and how pre-training affects input encoding. Arnold and Ollivier 58 propose a layer-wise training procedure based on the best latent marginal, which is able to approximate the global optimum. Duan et al. 59 propose a novel family of connectionist models based on kernel machines and consider the problem of learning concepts layer by layer. In particular, the authors propose a method to "kernelize" any ANN, which allows to obtain a counterpart of any given ANN that is actually powered by kernel machines instead of neurons. Considering the two-layer case without loss of generality, the authors illustrate a framework (and an algorithm) able to minimise the objective function of the network according to a greedy training scheme that learns one layer at a time.
As final comments, note that the vanishing (or exploding) gradient problem (see for example Hochreiter et al. 60 ) is strongly contrasted by PT, as each layer is trained separately from the others. Hence, the push for change that motivated the adoption of ReLU in place of a sigmoid or a hyperbolic function as "right" activation function should become a minor issue under this framework. Moreover, using PT, the adoption of proper pruning strategies is expected to make void the problem of residual neurons found to be in saturation.

Conclusion
This article has shown that relevant patterns arise inside MLPs upon training, as clearly highlighted by the ϕ, δ signatures of their hidden layers. In particular, the analysis has experimentally demonstrated that, upon training, clear patterns representing success or failure typically hold. The existence of patterns that lay half-way between success and failure has also been highlighted. To facilitate the analysis on MLPs equipped with more than one hidden layer, a layer-wise training strategy has been devised and implemented, called progressive training. In this strategy, layers are individually trained starting from the one in charge of processing the given inputs. Despite the fact that this is a methodological article, experimental results show that progressive training appears a viable alternative to the backpropagation algorithm. As for future work, several research activities are under way, including (i) getting a better understanding of the connection between MLP relevant patterns and training parameters, (ii) devising a general criterion for stopping the training as soon as a success pattern is found, (iii) devising proper pruning strategies, entrusted with the deletion of neurons deemed useless; (iv) investigating the causes that generate half-way patterns; and (v) characterising progressive training from an information theory perspective. Notably, after reaching theoretical and/or experimental findings on the cited issues, setting up an adaptive training strategy in which the final shape of a neural architecture is not known in advance should no longer be an unattainable goal.

Data availability
The data that support the findings of this study are available from the following Machine Learning Repositories: (a) UC Irvine (URL https ://archi ve.ics.uci.edu/ml/datas ets.php) and (b) Kaggle (URL https ://www.kaggl e.com/ data). All datasets used in this research are publicly available.