Modern synergetic neural network for imbalanced small data classification

Deep learning’s performance on the imbalanced small data is substantially degraded by overfitting. Recurrent neural networks retain better performance in such tasks by constructing dynamical systems for robustness. Synergetic neural network (SNN), a synergetic-based recurrent neural network, has superiorities in eliminating recall errors and pseudo memories, but is subject to frequent association errors. Since the cause remains unclear, most subsequent studies use genetic algorithms to adjust parameters for better accuracy, which occupies the parameter optimization space and hinders task-oriented tuning. To solve the problem and promote SNN’s application capability, we propose the modern synergetic neural network (MSNN) model. MSNN solves the association error by correcting the state initialization method in the working process, liberating the parameter optimization space. In addition, MSNN optimizes the attention parameter of the network with the error backpropagation algorithm and the gradient bypass technique to allow the network to be trained jointly with other network layers. The self-learning of the attention parameter empowers the adaptation to the imbalanced sample size, further improving the classification performance. In 75 classification tasks of small UC Irvine Machine Learning Datasets, the average rank of the MSNN achieves the best result compared to 187 neural and non-neural network machine learning methods.

wrong basin of attraction in its working procedure.Since the cause of the problem is not revealed, parameter tuning becomes the dominant route.Due to the lack of explicit objectives, optimization methods are often based on genetic algorithms.Such a research route substantially complicates SNN's application process and occupies the optimization space of the parameters, so task-oriented parameter training is difficult to be introduced simultaneously.These problems lead related research to a standstill.
In this paper, we propose a modern synergetic neural network (MSNN) model to properly apply the advantages of SNN to practical problems.We first address the association error and release the parameter tuning space by defining and remodeling the state initialization method of SNN.Although SNN's first study and some subsequent studies suggest that its initial state characterizes the similarity between samples and memories 11,23,[32][33][34] , we prove that the initial state does not conform to the principles of a similarity metric.Therefore, we distill the network's method of calculating the initial state and remodel it as a definitive solution.Since the new solution isolates the parameter tuning process, the whole optimization space can be reserved for the task properties.We design an Error BackPropagation (EBP) based attention parameter training method that allows MSNN to be co-trained with other network layers for automatic data distribution adaptation.Experimental results on 75 imbalanced small UC Irvine Machine Learning (UCI) Datasets show that these improvements make MSNN outperform 187 neural and non-neural methods.

Contribution of this work.
(1) Revealing the root of the SNN association error to be the wrong state initialization method.(2) Updating SNN's working process to solve the association error and release the parameter tuning space.(3) Proposing an EBP-based training method to enable adaptation of the built-in attention parameters to the data distribution.

Related work
In general, classification methods for imbalanced data include data preprocessing, training target modification, and proposing targeting methods 35 .Since most classification networks are not designed with data imbalance, related research focuses on data preprocessing methods.Depending on the distributional characteristics of the data, preprocessing methods can be categorized into oversampling 36,37 , undersampling 38,39 , and hybrid methods of the two 40,41 .In recent years, with the increasing demand for data volume of classification networks and the proposal of pattern generation methods based on generative adversarial networks, oversampling of minority classes has gradually become a mainstream method 42,43 .However, our solution belongs to the category of proposing targeting methods, and our network natively supports imbalanced data for training, which is divergent from the above studies.

SNN overview
SNN's working procedure.SNN is a 3-layer RNN, its network structure is shown in Fig. 1.Updating formula of SNN 11 of its input, hidden, and output layer is x is the normalized query pattern.V = [v 1 , . . ., v M ] is the matrix of normalized static prototypes represent- ing memories.x new is the new input transmitting to Eq. (1).ξ is the vector of order parameters.www.nature.com/scientificreports/Moore-Penrose inverse 44,45 .Syn is the Synergetic activation function.γ is the learning rate.Network parameters include , b , and c . is the attention parameter to the prototypes with default value 1.Higher attention brings greater chances of association.b and c control the convergence speed with default value 1. SNN requires that all prototypes are mutually independent, and that their total number is less than the dimension, such that the product of V + and V is the identity matrix.Substitute Eq. ( 3) into (1), and ξ = V + x new = ξ new .Thus, the update formula can be interpreted as constructing a dynamic system of ξ .ξ is the dynamical state, and its initial value is the initial state.The variation of ξ is reflected to x through V.
SNN converges to three kinds of stationary points, including the target stable point, the saddle point, and the local maxima point.The convergences are shown in Fig. 2. Generally, SNN reaches the target stable point.The target stable point is reached when ξ is the positive or negative one-hot encoding.The single nonzero order parameter is called the winner parameter.The network outputs ±v at this point, which reflects the association from x to v .The saddle point is reached when ξ has more than one identical nonzero value, which stems from multiple identical extremes in the initial state.The local maximum point is reached when all elements of ξ are 0. The division by 0 error in Eq. ( 2) blocks the network from working.

SNN's basin of attraction.
In describing the convergence process to the target stabilization point, SNN proposes the "winner-takes-all" property, i.e., the order parameter with the biggest absolute initial value is the winner parameter, but lacking detailed proof.Therefore, we prove this property by showing that ξ new m is the largest when |ξ m | is the largest.The detailed proof is shown in SI ***1A.From the perspective of dynamical systems, the "winner-takes-all" property can be interpreted as extreme-based basin partitioning.The basin of SNN's attractor is the set of all initial states with the same sign and extreme value index as itself.The attractors, basins, and trajectories of random initial states of SNN in 2D and 3D are shown in Fig. 3.It can be seen that such  www.nature.com/scientificreports/ a division allows the order parameter with the biggest absolute value retains its winner position throughout the convergence.

MSNN
The nonlinear dynamical system is sensitive to the initial state and should be carefully designed.However, the initialization method of SNN was proposed without in-depth analysis.Although the associative memory task requires the correct association to be the most similar memory to the input, we prove that the existing initialization method will designate the order parameter with smaller similarity as the winner parameter.Due to the "winner-takes-all" property of SNN's convergence, the selected winner will converge to ± 1, so the network will output the less similar memory as the association result, leading to the association error.To address this problem, we redesign the state initialization method to correct the winner designation process.The new approach ensures the consistency of the winner selection and the association target, fundamentally solving the association error problem of SNN.In addition, the new initialization method provides the feasibility of EBP-based parameter learning.

SNN's erroneous state initialization method.
The working target of SNN is to converge to the most similar memory.The initial state controls the convergence, so the initialization method should be proposed under a similarity metric.However, the similarity between the sample x and the memory v cannot be character- ized by the metric of SNN's state initialization method Although there are at least 67 different metrics applied in various fields 46 , all similarity metrics shall satisfy the following three principles 47 : 1. Commonality related.The more commonality they share, the more similar they are.2. Difference related.The more differences they have, the less similar they are.3. The maximum is reached when identical.
However, S actually characterizes the scaled cosine distance of v + m and x , conforming to none of the above principles.For Principle (3) x is the normalized adjoint vector).From "Related work", V + V is the identity matrix, so which means that v + m is perpendicular to the hyperplane of all prototypes except v m .Since the inner product of v + m and v m is 1, the angle between v + i and v m takes values in the range [0,0.5π).SNN requires v + m to be normalized, so �v + m � 2 ≥ 1 .S may achieve a bigger value when it is not equal to v m , so S does not satisfy Principle (3).For Principle (1) and (2), as x gradually approaches v + m from v i , its commonality with v m decreases and the difference increases, but S increases other than decreases.Therefore, S does not satisfy Principles (1) and (2).
The conflict between S and the similarity metric causes the association error.From the previous section, the order parameter with the largest absolute value in the initial state is the winner parameter.SNN will pick the wrong winner when the largest order parameter relates to a less similar v by S , which leads to an association error.
MSNN's remodeling of the state initialization method.The association error originates from the wrong initial state, so the MSNN needs to redesign the initialization method.Since SNN's basin of attraction focuses on the parameter's absolute value, simply using the similarity measure as the state initialization method of SNN may allow the smallest negative order parameter to be the winner, making the network associates the least similar memory.To avoid this problem, we propose the new initialization method as S is the similarity measure between the query and the memory.ReLU sets the negative value to zero, elimi- nating the possibility of the negative order parameter becoming the winner.In summary, the working process of MSNN is MSNN's network structure is shown in Fig. 4. The new initialization method ensures the correct association while improving the running speed.This method only allows the positive value to be the initial value of the order parameter, so the most similar memory must correspond to the largest order parameter.From the "winner-takes-all" property, the largest order parameter ( 4) www.nature.com/scientificreports/becomes the winner, and the most similar memory becomes the association result.The new initialization method sparsifies ξ by setting negative order parameters to zero, thus speeding up the hardware computation.
MSNN's attention parameter self-learning.SNN's genetic algorithm-based parameter learning is hard to be co-trained with other modern network layers, so we design an EBP-based learning method.The new learning method adjusts the attention parameter to assign greater attention to classes with smaller sizes for imbalanced data self-adaptation.Before applying EBP, Syn repeatedly imposes a polynomial function onto the input, which may lead to the gradient exploding or vanishing.The gradient problem is so severe that conventional means like gradient clipping can barely circumvent the non-convergence.To solve this problem, we first normalize ξ and divide Syn into two terms, EBP is performed normally for the former term, and the latter term uses the gradient bypass technique 48,49 .This technique passes the gradient of certain network layer outputs directly to the input during backpropagation, which is used to circumvent the inappropriate activation functions causing the gradient exploding or vanishing, even the gradient intransmissible caused by discontinuity.
The parameter learning requirement can be satisfied by directly acting EBP onto .Let the error of ξ new i be δ i .ξ i ≥ 0, so the adjustment i has a different sign than δ i .δ i > 0 means ξ new i is too large, and i ≤ 0 means the network will not increase its attention to ξ i , giving it a higher chance to converge to 0. δ i < 0 means ξ new i is too small, and i ≥ 0 leads ξ i to a higher chance of converging to 1. Therefore, EBP satisfies the parameter learning require- ment of .

Experiments
Dataset and network configuration.We test MSNN on the small datasets of the UCI, a collection of 121 datasets as pattern classification tasks to benchmark both neural network and non-neural network machine learning algorithms.These datasets are divided into 75 small and 46 large datasets by the threshold of 1000 samples 50 .All of these datasets are imbalanced after the train-test set division.We compare our network against 187 neural and non-neural machine learning algorithms.Their configurations and performances are detailed in literatures 16,17,50 .See SI 1B.2 for dataset configuration details.
As for the network architecture, we use the embedding layers, which is/are {0, 1, 7} fully connected layer(s) with ReLU activation functions and {32, 128, 1024} hidden units per embedding layer.These embedding layers are followed by SNN with iteration {0} to {9} and a mapping to the output vector with the dimension number of classes.The prototype matrix is obtained by intra-class K-means clustering, and the adjoint matrix is the M-P inverse of the prototype matrix.The network structure used for the experiments is shown in Fig. 5. On each dataset, we use EBP to train SNN's hyperparameter and perform a grid search to determine the best hyperparameter setting for the embedding layers, the memory number, and SNN's iteration number.The hyperparameter search space of the grid search is listed in Table 1.All models are trained for 100 epochs with a mini-batch size of 4 samples using the softmax cross-entropy loss and the AdamW optimizer 51 .After each epoch, the model accuracy is computed on a separate validation set.Using the gradient direct transmission technique 48,49 , the gradient ( 10) The working process of the MSNN after the correction of the state initialization method.The network input x is mapped to ξ (0) through the corrected state initialization, ξ (0) is activated by the ReLU function and input to the hidden layer by the initial order parameter feedforward layer, and then the network starts to work in iterations.
of MSNN's output layer in the error backpropagation stage is passed directly to the state initialization layer to circumvent bypassing the polynomial-shaped activation function of the SNN causing gradient exploding or vanishing.With early stopping, the model with the best validation set accuracy averaged over 16 consecutive epochs is selected as the final model.This final model is then evaluated against the test set to determine the accuracy.

Classification performances validation.
The Friedman rankings of these methods among datasets are presented in Table 2. MSNN outperforms all other methods on small datasets, setting a new state-of-the-art for 12 datasets (balance-scale, breast-cancer, congressional-voting, heart-cleveland, ionosphere, low-res-spect, monks-2, monks-3, planning, post-operative, soybean, and spect).See SI 1B ***for more details.
Imbalanced data adaptation performance.We analyze the performance of MSNN for datasets with different levels of imbalance in terms of the percentage of majority class %Maj 17 .%Maj reflects the level of imbalance in the dataset, the higher the %Maj, the higher the imbalance.The classifier is prone to focus on the majority . Network structure for experiments.The network input x is preprocessed through the embedding layer and subsequently transformed into the initial state of the SNN through the initialization layer, i.e., the initial value of the ordinal parameter ξ 0 .ξ 0 is passed into the hidden layer of the SNN, and the update of the ordinal parameter and the association of memory are realized with the cycle of the network.The prototype pattern obtained by association is passed into the classifier by the output layer of the SNN to obtain the label y.
The parameters are trained using error backpropagation with the loss function CrossEntropy y, label , where label is the true label of the data.Using the gradient direct transmission technique, the gradient ∇ ξ new L of ξ new in the error backpropagation stage (red line in the figure) is passed directly to ξ 0 bypassing the polynomial- shaped activation function of the SNN to circumvent gradient explosion or vanish.
Table 1.Hyperparameter search space for grid search on small UCI datasets.All models are trained for 100 epochs with a minibatch size of 4 using AdamW with early stopping based on the accuracy.The number of stored patterns is 1 or 8 times the number of the target classes of the individual task.

Parameter Values
Embedding layers {0,1,7}  www.nature.com/scientificreports/class when applied to imbalanced data, i.e., labeling all samples as the major class, which brings the overfitting problem.The more severe the overfitting problem is, the closer the accuracy of the classifier will be to %Maj.Thus, the accuracy over %Maj, denoted by σ , can reflect the extent of minor class samples being correctly classi- fied.See SI 1B ***for the %Maj of each UCI dataset.We rank each dataset in ascending order of %Maj and mark the accuracies of the top three Friedman ranking methods in Fig. 6a.We merge the adjacent datasets in groups of five and calculate their average σ for better visualization.The results are shown in Fig. 6b.MSNN outperforms other methods in most cases, and the average σ improvement is most obvious in groups 2-12 (except group 10) with the %Maj interval of (30.93, 67.83), which indicates that MSNN is useful for both mild and moderate imbalance datasets have good adaptive performance.In groups 13-15 with %Maj greater than 73.53, the average σ of MSNN has a decrease compared to other methods, which suggests that the linear classifier and the standard associative memory network have a more stable performance for heavily imbalanced datasets.
Order parameter initialization validation.We verify the effectiveness of MSNN's order parameter initialization method for correcting association errors by comparing the accuracy to SNN.We use the balanced parameter configuration (all parameters default to be 1), so the association target is the most similar memory to the query.The performance of the SNN and MSNN is shown in Fig. 7. MSNN achieves 100% accuracy for all datasets, while SNN achieves 100% accuracy in only 5 datasets (acute-inflammation, acute-nephritis, horsecolic, monks-3, and trains).Its average accuracy is 66.47%.The average accuracy over %Maj (average σ ) of the three methods with the adjacent datasets merged in the group of five.MSNN has the highest average σ in the first 12 groups except group 10, with a decrease in groups 13-15, indicating its better performance for mild and moderate imbalance datasets.www.nature.com/scientificreports/Attention parameter learning performance.MSNN mitigates overfitting by attention parameter self-learning to provide greater attention to classes with small sample sizes.Ideally, the elements of should negatively correlate with the number of imbalanced samples.Due to the diversity of data sources, MSNN cannot guarantee fitting effectiveness on all datasets, and the attention parameter's learning performance can hardly be reflected in the underperformed datasets.In addition, the correlation between and the number of imbalanced samples is challenging to model in a parameter configuration with multiple attention parameters corresponding to one class.Since the objective is to verify the ideal cases rather than all cases, we dropped results from 28 datasets that did not meet the criteria, including (1) positive effect of self-learning on performance.(2) The number of is equal to the class number.We use Spearman correlation analysis to verify the correlation between and the sample sizes.Spearman analysis requires at least four samples, yet a significant proportion of 2 or 3 class tasks are in the UCI dataset.Therefore, we apply 1-norm to from different datasets and integrate them.The integrated contains 192 samples with a correlation coefficient of about − 0.170, corresponding to a p value of about 0.019.Thus, has a significantly negative correlation with the sample sizes, indicating that EBP is applicable to the learning of .

Conclusion
In this paper, we propose the MSNN model to further improve RNN's classification performance on imbalanced small data.MSNN first addresses SNN's misattributing association errors to underoptimized parameters in existing studies by modifying the state initialization method in its working process, releasing the whole parameter optimization space to task requirements.Then, MSNN adjusts SNN's built-in attention parameter through an EBP and error bypass-based learning method for network self-adaptation of imbalanced data during network layers joint training.Experimental results on 75 small UCI datasets show that MSNN retains error-free associations on all datasets, and the attention parameters spontaneously establish a strong correlation with the imbalanced sample size.These improvements make MSNN outperforms 187 methods and achieves a new state-of-the-art.
Our study allows the theoretical advantages of the Synergetics to be successfully applied in artificial neural networks, and we plan to further extend these advantages to other areas in future work, including optimization methods for attention mechanisms and self-learning methods of representative prototypes.

Figure 1 .
Figure1.SNN structure.The activation function of the hidden layer is Syn .The input and output layers are linear mappings with weights of the adjoint matrix V + and the prototype matrix V , respectively.

Figure 2 .
Figure 2. The convergence to stationary points of SNN.The different colors of the curves are used to distinguish order parameters.(a,b) are the convergence to target stable points (the positive or negative one-hot), i.e., one order parameter converges to ± 1 while others converge to 0; (c) is the convergence to the saddle point (multiple identical values) that stems from multiple identical extreme values in the initial value of ξ ; (d) is the convergence to the local maximum point (all values are zeros) that stems from the zero-vector initialization of ξ .Note that chart (d) is only for representation.The divide-by-0 error terminates the network's further iterations.

Figure 3 .
Figure 3. SNN's attractors, basins, and trajectories of random initial states in 2D and 3D space.One coordinate of the attractor is ± 1 and others are 0. The trajectories and the basin corresponding to the same attractor are marked with the same type of color.The basin contains all initial states with the same sign and maximum absolute value index as its attractor. https://doi.org/10.1038/s41598-023-42689-8

Figure 6 .
Figure 6.(a) Majority class percentage %Maj of small UCI datasets and the accuracy of Friedman ranking top three methods.Datasets are ranked in ascending order of %Maj.(b)The average accuracy over %Maj (average σ ) of the three methods with the adjacent datasets merged in the group of five.MSNN has the highest average σ in the first 12 groups except group 10, with a decrease in groups 13-15, indicating its better performance for mild and moderate imbalance datasets.

Figure 7 .
Figure 7.The associative accuracy of SNN and MSNN.The datasets are ordered by name.MSNN reaches 100% accuracy in all datasets, while SNN's performance fluctuates significantly and reaches 100% accuracy in only 5 datasets.

Table 2 .
Friedman ranking and average accuracy (%) for each classifier, ordered by increasing Friedman ranking.Significant values are in [bold].