A memetic dynamic coral reef optimisation algorithm for simultaneous training, design, and optimisation of artificial neural networks

Artificial Neural Networks (ANNs) have been used in a multitude of real-world applications given their predictive capabilities, and algorithms based on gradient descent, such as Backpropagation (BP) and variants, are usually considered for their optimisation. However, these algorithms have been shown to get stuck at local optima, and they require a cautious design of the architecture of the model. This paper proposes a novel memetic training method for simultaneously learning the ANNs structure and weights based on the Coral Reef Optimisation algorithms (CROs), a global-search metaheuristic based on corals’ biology and coral reef formation. Three versions based on the original CRO combined with a Local Search procedure are developed: (1) the basic one, called Memetic CRO; (2) a statistically guided version called Memetic SCRO (M-SCRO) that adjusts the algorithm parameters based on the population fitness; (3) and, finally, an improved Dynamic Statistically-driven version called Memetic Dynamic SCRO (M-DSCRO). M-DSCRO is designed with the idea of improving the M-SCRO version in the evolutionary process, evaluating whether the fitness distribution of the population of ANNs is normal to automatically decide the statistic to be used for assigning the algorithm parameters. Furthermore, all algorithms are adapted to the design of ANNs by means of the most suitable operators. The performance of the different algorithms is evaluated with 40 classification datasets, showing that the proposed M-DSCRO algorithm outperforms the other two versions on most of the datasets. In the final analysis, M-DSCRO is compared against four state-of-the-art methods, demonstrating its superior efficacy in terms of overall accuracy and minority class performance.

tion, or the number of less healthy corals that must die to allow empty positions for the next generation of evolution (stages of the evolutionary process), among others.
• M-SCRO This version improves the setting of parameters, and it is based on the Statistically-driven CRO (SCRO) algorithm 41 .M-SCRO automatically adjusts the parameters of the standard CRO algorithm, eliminating the need for extensive experimental procedures.M-SCRO selects the corals involved in the operators at different stages of the evolutionary process based on each individual's fitness and the average fitness of the entire reef.This version assumes that the FL of the population follows a normal distribution.• M-DSCRO The third version of the algorithm is called Memetic Dynamic SCRO (M-DSCRO).M-DSCRO studies the fitness of the entire reef throughout the evolutionary process, checking whether the distribution is normal or not.It does not assume that the FL of the population always follows a normal distribution.The corals involved in each algorithm stage will vary depending on FL.The selection of corals will be based on either the mean and standard deviation of the population or the median and interquartile range, depending on whether the population follows a normal distribution or not.Therefore, MD-SCRO also avoids adjusting multiple algorithm parameters.
The contributions of this paper are briefly summarised below: • Three versions of the CRO metaheuristic for the simultaneous training, design and optimisation of ANNs.
We have hybridised the algorithms (M-CRO, M-SCRO and M-DSCRO) using the iRprop+ algorithm as LS procedure.To the best of our knowledge, this metaheuristic has not been used before for the purpose of designing and optimising ANNs.The LS procedure is added at the three reproductive stages of the reef.Additionally, a method for reinitialising the reef in case of possible stagnation is included.This is concisely described in the following sections of this work.• Adaptation of evolutionary operators used with ANNs to the reproduction scheme presented by CRO metaheuristics.The implementation of specific operators is crucial for the optimisation of ANNs using EAs [42][43][44] .In this work, we have identified crossover and mutation operators used in ANNs that are appropriate for integration into the scheme of a CRO algorithm, more specifically for the sexual and asexual reproduction stages.• The use of robust estimators instead of assuming normality of the fitness distribution during the evolutionary process.Thanks to them, M-DSCRO enhances the performance of M-CRO and M-SCRO algorithms.The algorithm also automatically determines the individuals to be used in each stage of the evolutionary process, eliminating the need for researchers to establish these parameters.• The M-SCRO and M-DSCRO algorithms improve the accuracy compared to the basic version, eliminat- ing the need for parameter setting.A statistical study with 40 classification datasets has been carried out to compare the three developed algorithms and other state-of-the-art methodologies in pattern classification.
The study considers global accuracy metrics as well as accuracy per class.The results indicate that M-DSCRO outperforms other methodologies on most datasets.
The rest of the paper is organised as follows: "Background and evolutionary stages" section includes the main phases or stages of the evolutionary process of the three algorithms, together with the strategies used in the selection of individuals in M-CRO and M-SCRO algorithms.It continues referencing the LS algorithm used to exploit solutions in the evolutionary process, a description of the reef restarting procedure, the use of the operators adapted to ANNs, and it concludes with a brief introduction to the theory of robust estimators."M-DSCRO algorithm" section describes in detail the FL strategy used in the M-DSCRO algorithm and the reproductive stages."Experiments" section shows information about the datasets used in the experimentation, the ANN models used, the metrics employed to evaluate the performance of the algorithms and the experimental design."Results and discussion" section discusses the results, including their statistical analysis.Finally, the conclusions are shown in "Conclusions" section.

Background and evolutionary stages
This section presents in a general way the stages of the three Memetic CRO algorithms developed.It continues with a reference to the iRprop+ algorithm used as an LS optimisation method, an explanation about the reef restarting procedure in case of premature convergence and low diversity, a brief description of the problem of using crossover operators in ANNs, and it concludes with a theoretical explanation of robust estimators used in the M-DSCRO algorithm.
Essentially, the evolutionary stages present in all three algorithms are similar.Hence, this section presents both the shared stages and the distinctions in parameter configuration methods.This encompasses the manual parameter definition for each stage of the M-CRO algorithm, as well as the automatic setup utilised in the M-SCRO algorithm.M-DSCRO also employs automated parameter configuration, which will be introduced within the algorithm's pseudocode in "M-DSCRO algorithm" section.
Figure 1 illustrates the stages of M-CRO, M-SCRO and M-DSCRO algorithms summarising the procedures discussed below.
The use of CRO methodologies to design and optimise ANNs combines Evolutionary Algorithms and Simulated Annealing.Additionally, the dynamic version (M-DSCRO) eliminates the need for researchers to establish several metaheuristic parameters by analysing the fitness landscape of the population during the evolutionary process. Vol:.(1234567890)

Understanding M-CRO and the statistical version, M-SCRO
Standard CRO algorithm 20 is a type of EA that, by simulating the biological processes occurring in a coral reef, tries to solve search and optimisation problems.In general, each position c i,j of a reef formed by P individuals organised in a P 1 × P 2 matrix is a possible solution of the problem to be solved, where P 1 and P 2 are the number of rows and columns, respectively.On the other hand, the SCRO algorithm 41 with self-adaptive parameters was proposed for time series segmentation problems with the idea of removing the high number of parameters needed to be set in the standard CRO.For further understanding of SCRO, let us define two significant variables of this algorithm.Taking into Figure 1.Flowchart of the three Memetic CRO algorithms.The "Background and evolutionary stages" section and the "M-DSCRO algorithm" section describe in detail each of these stages.
account that the quality of the coral is measured by a fitness function f (see Eq. 19), we can define the fitness values of the N j corals in the j-th generation of the algorithm as {f 1j , f 2j , . . ., f Nj } .Assuming that the fitness distribution is normal, the variance of the population in the j-th generation can be estimated as: where G is the total number of generations, f ij is the fitness of the i-th individual in j-th generation, and fj is the average fitness value of all the individuals of the generation, expressed as: Considering the Eqs.(1) and ( 2), SCRO avoids assigning multiple parameters at different stages of the evolutionary process.
Both M-CRO and M-SCRO are based on the considerations that have just been described and that are clarified in the following subsections.

Initialisation in M-CRO and M-SCRO
M-CRO algorithm starts by initialising a random subset of positions from the total number of individuals P, leaving the remaining positions empty.The percentage of initial free positions in the coral is determined by a parameter ρ with 0 < ρ < 1 , which indicates the ratio of the reef that remains empty initially.The idea is that these positions allow for the settlement and growth of corals in the later stages of the algorithm.
To avoid the parameter ρ , the M-SCRO algorithm initialises a complete coral reef with P positions, and then, those corals whose fitness f i1 ∈ ( f1 − S f 1 , 1] are deleted.Thus, the parameter ρ is unnecessary.If ( f1 − S f 1 ) ≤ 0 no corals are removed from the reef.
Once the initialisation stage has been performed, the evolutionary block of the algorithms simulates the processes of reproduction and reef formation, using different operators sequentially applied over a number of generations.

Sexual reproduction plus Local Search in M-CRO and M-SCRO
Within sexual reproduction, two processes can be distinguished: broadcast spawning (also called external sexual reproduction) and brooding (also called internal sexual reproduction).
For the M-CRO algorithm, in each i-th iteration, the broadcast spawning procedure selects a uniform random fraction F b of corals to be broadcast spawners.To form a new larva, a crossover operator is usually applied."Crossover operator in artificial neural networks" section explains the problem of using crossover operators for ANNs.On the other hand, the remaining subset of corals on the reef ( 1 − F b ) simulates reproduction in hermaphroditic corals using the brooding operator.Each coral mutates to generate a new larva that becomes part of the candidate solution pool.
M-SCRO algorithm does not need to configure these parameters.Instead, the corals with a fitness function in the interval f ij ∈ ( fj − S f j , 1] are selected for broadcast spawning.Brooding is applied to the remaining ones, i.e. those whose fitness f ij ∈ [0, fj − S f j ] .Therefore, F b is not necessary. At this stage of sexual reproduction, an optimisation procedure is also applied to the best individuals resulting from the Broadcast Spawning and Brooding operators.This optimisation is applied using iRprop+ as LS algorithm (detailed in "iRprop+ local search algorithm" section).Therefore, the best individual resulting from Broadcast Spawning is optimised with iRprop+ and added to the Coral Pool, and the same goes for the best individual obtained from Brooding.

Coral pool in M-CRO and M-SCRO
The corals obtained from two types of sexual reproduction and asexual reproduction, detailed below, are stored in a coral pool (emptied for each generation), along with the two optimised individuals with the LS algorithm, so that they are the individuals that will be considered for the Settlement stages.All individuals in the pool are evaluated prior to the Settlement stages.

Settlement in M-CRO and M-SCRO
Once the sexual and asexual reproduction procedures have been completed, each larva in the candidate pool attempts to settle and grow at a random position (i, j) on the reef ( P 1 × P 2 ).The larva will be set if the position is empty or if it is healthier than the existing coral at that position, i.e. its fitness value is better.In addition, a maximum number ν of attempts is established for the larva to search for a feasible position.A robust value for ν is 2, i.e. a larva has two attempts to settle on the reef 41 .

Asexual reproduction plus local search in M-CRO and M-SCRO
For the M-CRO algorithm, the asexual reproduction mimics the reproduction of corals by budding or fragmentation.The mechanism consists of i) ranking corals according to their fitness value, ii) selecting a small fraction F a of the best corals (we have verified that the performance obtained by choosing a random solution instead of a fraction is similar), iii) duplicating the fraction of best corals to ensure its survival and add to a new candidate pool, and iv) settling the corals from the candidate pool.
(1) www.nature.com/scientificreports/ In M-SCRO algorithm, in order to eliminate the F a parameter, a random solution from the set of corals whose fitness verifies f ij ∈ ( fj + S f j , 1] is selected to be asexually reproduced.If ( fj + S f j ) ≥ 1 , asexual reproduction is not carried out.
After asexual reproduction, a mutated individual is randomly selected and duplicated, then the LS is applied.Next, another Settlement process is carried out, so that the optimised individual attempts to establish itself in the coral.We have experimentally verified that the results obtained are similar regardless of the fact that the individual is chosen randomly or the best one is chosen.

Depredation in M-CRO and M-SCRO
Finally, the M-CRO algorithm applies a depredation procedure for a percentage F d of the worst corals in the reef under a given probability P d .It simulates the death of less healthy corals to allow empty positions for the next generation of evolution.
In the case of M-SCRO algorithm, it eliminates the set of individuals whose fitness function verifies f ij ∈ [0, fj − 2S f j ] .In this case, the parameters F d and P d do not need to be configured.If ( fj − 2S f j ) ≤ 0 , depre- dation is not carried out.

Stop condition
The stopping condition of the algorithms is met when a maximum number of generations has been reached.

iRprop+ local search algorithm
In this work, the algorithm iRprop+ is used in the three algorithms implemented as an LS procedure to establish a balance between the exploration and exploitation of the population of ANNs.iRprop+ is an improvement of the well-known BP algorithm, and its good performance for ANNS weight optimisation has already been proven 27,45 .This algorithm employs a sign-based scheme to update the weights in order to eliminate influences of the derivatives' magnitude on the weight updates, but it applies a backtracking strategy to decide whether to take a step back along a weight direction by using a heuristic.
In 45 , the reader can see a detailed description of the iRprop+ adapted to the softmax activation function and the cross-entropy error function, used to discern the output class provided by each ANN and as optimisation function respectively.The error and fitness function used in this work can be consulted in "Evaluation metrics" section.
As seen above, the LS is applied to only three individuals in each generation.We have verified that involving more individuals would enormously increase the computational cost, and the results would not be better.

Reef restarting procedure
By incorporating an LS algorithm during the reproductive stages on the coral reef, it may be the case that the diversity of the population suffers a quick reduction during the search procedure, obtaining a premature convergence to a local optimum.To prevent this possibility, we have introduced in the three algorithms a reef restarting after each stage of reproduction (Sexual, Asexual and Depredation).
The restarting procedure is similar to the Initialisation stage.However, in this case, the best coral is maintained, and the rest of the corals in the reef are randomly initialised.Empty spaces are carried out according to the procedure described in the Initialisation subsections of the three algorithms.
The reef is restarted if one of these two conditions is reached: • The difference in fitness, bw j , between the best coral (f bj ) and the worst one (f wj ) is lower than a threshold, named t a : • The S 2 f j value of the population is lower than a threshold, named t b :

Crossover operator in artificial neural networks
A crossover in genetic algorithms or EAs is commonly seen, but certain drawbacks prevent establishing a crossover operator if the EA considers ANNs as individuals 42,46 .The cause is that crossover is impractical in environments where the fitness of an individual in the population is not correlated with the expected ability of its representational components.Such environments are called deceptive 47,48 .Deception is a significant feature in most representations of ANNs, so crossover should be avoided in EANNs 42 .Therefore, a problem arises when crossing networks with similar structures and weights.That could lead to offspring that contain repeated components in their structure, and therefore, the ability of those components in the parents to be lost.Or it could also happen that the descendant individual was identical to one of the parents since changing the order nodes, we do not alter the individual itself (problem of deception 42 ).Thus, the result would be, in either case, an offspring with individuals equal to or worse than their parents, so the crossover operator is useless.
A similar drawback occurs when crossing two networks with the same structure but different weights.Each hidden node plays a specific role in each ANN, and it is the set of hidden nodes that have evolved together that Finally, crossing two networks with different structures will surely be incompatible, reducing the possibility of producing offspring, so if successful, good individuals are unlikely to be generated.
Therefore, it is a complicated task to generate a crossover operator for EAs with ANNs since these drawbacks must be taken into account and compensated in some way to generate good offspring from the crossover of two individuals.
For this reason, only two types of mutations have been used for the development of the M-CRO, M-SCRO and M-DSCRO algorithms in their sexual reproduction stages: structural and parametric mutations 49,50 .It is detailed below.

Operators used in the sexual reproduction stages
Taking into account the problems of using the crossover operator in ANNs discussed above, this subsection describes the mutation operators used in the sexual reproduction stage of the implemented algorithms.The types of mutations applied are the same for the M-CRO, M-SCRO and M-DSCRO algorithms.
For broadcast spawning, structural mutations are applied (it explores the search space).Structural mutations affect the topology of an ANN and allow different regions in the search space to be explored.This type of mutation modifies the number of hidden neurons and the number of links between neurons in the hidden layer, as well as the number of neurons in the input and output layers.Note that mutations implemented for the developed algorithms are: nodes deletion, connections deletion, nodes addition, connections addition and nodes fusion.
The mutations add or delete neurons consist of randomly adding or removing a minimum and maximum number of neurons.For deleted neurons, their links are also deleted.And for added neurons, the links are randomly established with a value according to an interval.The mutation node fusion randomly choose two neurons, A and B, and replace them with another new neuron, C. The connections from neuron C to the nodes shared by neurons A and B will be preserved.Additionally, those that are not common will also be kept with a probability of 0.5 (see example in Fig. 2).
In the mutations add or delete links, the number of links to add or delete is applied between the input and the hidden layer and between the hidden and the output layer.
For brooding, parametric mutations are applied (it exploits the search space).The parametric mutation modifies the model coefficients aggregating Gaussian noise, using a self-adaptive annealing process 51,52 .The variance of the Gauss distribution depends on the temperature factor based on the aptitude (see "Evaluation metrics" section) of each individual i, which will decrease along the evolutionary way to avoid aggressive mutations at the end of that process: being A(i) the aptitude of the individual.
Specifically, the parametric mutation affects the weights w kl (weight for the k-th input of the l-th neuron of the hidden layer) and β lq (weight for the l-th hidden neuron of the q-th neuron of the output layer) of the network as follows: where ǫ 1 (j) and ǫ 2 (j) represents one dimensional normally distributed random value from N(0, α 1 (j)T(i)) or N(0, α 2 (j)T(i)) respectively, and where α 1 (j) and α 2 (j) are parameters that together with the temperature ( 5) determines the variance of the distribution, which varies during evolution adapting the learning process, and j is the generation number in the evolutionary process.

Robust estimators
In the above-mentioned M-SCRO algorithm, the distribution of the fitness function is assumed to be normal throughout the evolution, using the mean as the centralisation statistic and the standard deviation as the dispersion statistic.If the fitness distribution is not normal, it is necessary to change these statistics.The use of these other estimators (fully described in "Selection of individuals in M-DSCRO" section) relies on the theory of robust estimators.A robust estimator is fully efficient for an assumed distribution but maintains high efficiency for plausible alternatives 53,54 .The robustness property can be studied through the breakpoint and the influence function of any estimator.
Robust statistics is an area of mathematical statistics that appeared in the 1960s.Its foundations include mainly three works 53,55,56 , which are the basis of later studies on these statistics 57 .
One type of robust estimator is M-estimators, a generalisation of the estimators acquired by the maximum likelihood method, whose objective function is a sample average 58 .Thus, given a sample and a function ψ , T is said to be an M-location estimator based on the function ψ if: According to the previous expression, the sample mean is a function-based M-estimator as well as the median.The M-estimator concept is introduced by analysing the robustness of two important location estimators, the mean and the median.From a sample of size n, both can be obtained by solving two optimisation problems: for the first estimator and: for the second, from which the following equations are obtained: whose solutions are, respectively, the mean and the median of the fitness distribution.Thus, when the distribution is unknown, the sample median is a better estimator of the location parameter than the sample mean.
Robust estimators can be significantly tied to the properties of the function on which they are based.This is discussed below: • As for the mean, its associated function is the identity, which designates the excessive sensitivity of this estimator to the presence of extreme values in the sample.• The sample median is based on a bounded nature function, making it a less sensitive estimator to the presence of outliers in the sample.
In terms of obtaining robust estimators of scale, there are multiple proposals.The most classical estimator is, given a random sample of size n, X 1 , X 2 , . . ., X n , to use the median absolute deviation from sample median, MAD, defined as MAD = |X i − MD| ; i = 1, 2, . . ., n , where MD is the sample median.Authors in 59 proposed another standard estimator found in many statistical packages.It is the interquartile which has a breakdown point of 25%, which is the point after an estimator becomes useless.It is a robustness measurement; the larger the breakdown point, the better the estimator.If an estimator has a high breakdown point, it may be called a resistant statistic.
Discussing breakdowns, MAD has the best possible breakdown point of a 50th percentile, where its influence function is bounded, with the sharpest possible limit among all scaling estimators.This property of the MAD estimator makes it a better auxiliary scaling estimator than the interquartile range.Nevertheless, it also has disadvantages, as its efficiency on Gaussian distributions is low because it first estimates the MD and then assigns equal importance to positive and negative deviations from it.In contrast, the interquartile range does not present this problem because it is not necessary for the quartile to be equally far from the centre.For this reason, the interquartile range has been used in this paper for the M-DSCRO algorithm, detailed in more depth in the next Section.( 8)

M-DSCRO algorithm
This Section describes in more detail the evolutionary stages of the M-DSCRO algorithm, which are similar to those described in "Background and evolutionary stages" section, but it varies in how individuals are selected in the initialisation and reproduction processes.
Both M-SCRO and M-DSCRO algorithms developed in this work avoid the additional parameters configuration in the stages of the evolutionary process.Furthermore, M-DSCRO also checks if the coral reef population follows a normal distribution or not (regarding its fitness), in which case the automatic setting of the parameters varies with respect to M-SCRO algorithm, as can be seen in the following subsection.

Selection of individuals in M-DSCRO
As aforementioned, M-SCRO assumed that the FL of the population follows a normal distribution.However, it is convenient to think this is not always the case in the evolutionary process.With the M-DSCRO algorithm, during the evolution, whether the FL follows a normal distribution is checked several times.Specifically, a Kolmogorov Smirnov's test 60 is used to check the normality of fitness distribution.Based on the theory of robust estimators explained in "Robust estimators" section, the process for selecting corals in the initialisation and reproductive stages of the algorithm varies.
If the distribution is normal, the coral selection process is the same as that established in the M-SCRO algorithm.Otherwise, M-DSCRO uses: • The median of the fitness at j-th generation, MD f j , instead of the mean fj as centralisation estimator.For the sake of simplicity, the notation C j stands for fj or MD f j , depending on whether the fitness distribution is normal or not, respectively.• The interquartile range, IRQ f j , instead of the S f j as scale estimator.For simplicity, notation SC j stands S f j or IRQ f j , depending on whether the fitness distribution is normal.
The Kolmogorov Smirnov's test is calculated only in a certain number of cases: • When the population is initialised (after generating and evaluating the reef), that is, in the first generation.
• If G is the total number of generations, M-DSCRO also applies the Kolmogorov Smirnov's test in the gen- erations 2G/7, 4G/7 and 6G/7 of the evolutionary process.That is, the statistics selected by the test results in generation 2G/7 are used until generation 4G/7, and so on.For instance, if the distribution fitness is not normal at generation 2G/7, then C j = MD f j and SC j = IRQ f j is used until generation 4G/7 when the test is applied again.• If the condition for the coral reef to be reset occurs.
The above values are set in this way for two reasons: (1) More checks throughout the evolution have not provided better results, as well as increasing the computational cost, (2) after the initial check, we continue at 2G/7 to give the population some time to evolve, do another check about halfway through the evolution, and finish with 6G/7 to give the population some time to evolve after doing the third check with the possible change.

Initialisation in M-DSCRO
M-DSCRO initialises P corals on the reef, i.e.P random ANNs representing feasible solutions to our problem.In the initialisation phase, instead of using the interval ( f1 − S f 1 , 1] described in "Initialisation in M-CRO and M-SCRO" section, the interval (C 1 − SC 1 , 1] is used ( ρ is unnecessary).That is, those corals whose fitness is not in the above interval are eliminated.If (C 1 − SC 1 ) ≤ 0 no corals are removed from the reef.Algorithm 1 sum- marises this procedure.
Input: Random number of neurons, links and weights for each coral; size of coral reef.Output: Initial population.
1: for each position of the coral reef do 2: Generate a coral (random ANN).

Sexual reproduction in M-DSCRO
Following the philosophy of M-CRO and M-SCRO algorithms, there are two types of sexual operators.On the one hand, external sexual reproduction or broadcast spawning must explore the search space.As mentioned in "Crossover operator in artificial neural networks" section, using crossover operators with EANNs has several drawbacks.Therefore, structural mutations are applied for those corals whose fitness function satisfies:

Settlement in M-DSCRO
The larvae settlement follows the same structure as in M-CRO and M-SCRO algorithms.
If the settlement is carried out after sexual reproduction, the individuals to be established are in the pool.If the settlement takes place after asexual reproduction, it is the duplicated and optimised individual that tries to establish itself on the reef.Algorithm 4 shows this procedure.(12)   Save the best individual in the new coral.

3:
for each position of the coral reef do 4: Generate a coral (random ANN).

Stop condition in M-DSCRO
Finally, remember that, in the stop condition, in addition to checking if the maximum number of generations has been reached, a Kolmogorov Smirnov's test is applied for M-DSCRO in the generations 2G/7, 4G/7 and 6G/7 of the evolutionary process for checking if the population is normal or not.

Experiments
In this section, detailed information will be provided about the datasets used in the experimentation, the ANN model used as individuals, the metrics employed to evaluate the performance of the different algorithms, and the experimental setup of the parameters needed.

Datasets used in our experiments
In this work, the performance of the M-CRO, M-SCRO and M-DSCRO is evaluated considering a total of 40 datasets for supervised classification problems, whose main characteristics are summarised in Table 1: identifier (ID), assigned by ordering the datasets alphabetically, name (Dataset), number of patterns (#Patt.),characteristics Table 1.Characteristics of the selected classification datasets, sorted alphabetically.Note that some databases have undergone preprocessing, such as removing missing values or binarising categorical variables.This causes these databases' number of patterns or attributes to vary slightly.However, the final databases will be available in Section Data availability.Within the table, #Patt.represents the number of patterns, #Char.denotes the number of characteristics, #Classes signifies the number of classes present in the dataset, Class Dist.indicates the distribution of classes, and IR refers to the Imbalance Ratio, providing a comprehensive overview of the dataset features. .www.nature.com/scientificreports/(#Char.),classes (#Classes), their distribution (Class Dist.), the imbalance ratio (IR), and the source of information (Source).As can be seen, this selection includes various types of classification problems with different fields of application (medical, energy or benchmarks, among others).Also, they cover a wide variety in terms of the number of patterns (from 132 to 19020), number of attributes (2 to 121), classes (2 to 15) and imbalance ratio (1 to 40).The underlying idea is to test the memetic algorithms on a wide variety of datasets and to verify that the dynamic proposal (M-DSCRO), as the paper's contribution, improves the other two algorithms.

ANN model
We use MLPs with a hidden layer and Sigmoidal transfer functions (SUs) in the hidden layer, and linear units in the output layer, whose functional model can be represented as: replacing B l (x, w l ) by: where w l = (w 1l , . . ., w Kl ) is the vector of weights of the connections between the input layer and the l-th hid- den node, M is the number of sigmoidal units in the hidden layer, Q is the number of classes of the problem, K is the number of features in each pattern to be classified, x is the input pattern and B l (x, w l ) is the sigmoidal basis function.
Taking the softmax activation function into account, represented in Eq. ( 16), it can be observed that the class predicted by the classifier corresponds to the neuron on the output layer whose a posteriori probability is greater.
where f q (x) is the output of the q-th output neuron for pattern x , and g q (x) is the probability that pattern x has of belonging to q-th class.Therefore, one of the classes does not need to be estimated due to the properties of the probability function.

Evaluation metrics
When considering classification problems, the most common metric is the percentage of patterns correctly classified or Correctly Classified Rate (CCR ), which is defined formally as follows: where N is the number of patterns in the training or generalisation set, and n qq is the number of patterns from the q-th class that are correctly classified.
CCR is a general approach to assess the goodness of the classification model.However, CCR only captures the global accuracy of the model without considering the minority classes in imbalanced datasets.As seen in Table 1, there are many datasets with an imbalanced nature ( IR > 1 ), so another metric should be used for comparison purposes.In this sense, the Minimum Sensitivity (MS) 45 is the accuracy rate of the worst classified class.MS is defined as follows: where S q is the percentage of examples correctly predicted as belonging to the q-th class.Thus, S q = n qq /n q , where n q represents the total count of patterns belonging to this q-th class.The use of this metric is based on the fact that it is more directly interpretable than other alternatives considered (such as a multiclass f1-score): by calculating this ratio for the worst classified class (minimum value of the sensitivity), the classifier is ensured to obtain at least the given performance for all classes of the problem.
In this way, we will not only obtain a value of the overall performance of the evaluated algorithms but also how they behave, at least in the class that ranks worst.
However, the error function used during the evolutionary training process of the ANNs is the cross-entropy function, E. The E error function is a continuous function, which makes the convergence more robust with respect to CCR .The values that can take the E metric are between 0 and ∞: where y q n is equal to 1 if the pattern n belongs to q-th class, and 0 otherwise, and where g q (x) is the predicted probability (Eq.16) that the pattern n belongs to class q.Finally, the metric is transformed to be maximised in (15)   www.nature.com/scientificreports/ the interval [0, 1] by the expression A = 1/(1 + E) so that the statistically-driven evolutionary procedure makes sense.

Experimental setting
Each dataset is divided using a stratified hold-out with 75% of the patterns for training and the remaining 25% for testing.Given the stochasticity of the algorithms, to obtain statistically representative average results, they are run 30 times with different seeds.
A multilayer neural network with a single hidden layer has been shown to be a universal approximator.Thus a shallow network, with a single layer of hidden units has been used, given that it is sufficient to represent any function with the necessary degree of accuracy 67 .For the hidden and output layers, the initialisation of the weights is random, and a bias value needs to be trained for each SU and linear unit.The number of outputs corresponds to the number of classes minus one for a concrete dataset (given that the softmax function is used in the output layer).
For the structural mutations, the probability of choosing a type of mutation is equal to 1/5.One or two neurons are added or deleted during these mutations.For adding or deleting links, we randomly add or delete 30% of the links in the input-hidden layers and 5% in the hidden-output layers.Weights are assigned using uniform distribution defined throughout two intervals, [-5,5] for connections between the input layer and hidden layer and [-10,10] for connections between the hidden layer and the output layer.For the parametric mutation, α 1 (0) = 0.5 and α 2 (0) = 1 .All these parameters values have been taken from previous references 49, 50 , which present an EA with similar mutators.Note that, in any case, the use of an EA, which dynamically adapts to the problem evaluated, results in a performance which is negligibly affected by minor changes in these parameters.
Based on the literature 20 , the coral reef size (P) has been set to 100 individuals, and the number of settlement attempts is 2 for each individual of the pool.As mentioned above, M-CRO requires a more extensive configuration than its statistical versions.The ratio of free positions on the reef ( ρ ) is set to 0.1, the percentage for the asexual reproduction ( F a ) is established to 0.05, the percentage for broadcast spawning has been set to 0.75, while the remaining 0.25 are selected for brooding; and finally, the percentage and probability of depredation are 0.05 and 0.1, respectively.
Although the thresholds are configurable for coral reef reset conditions, a threshold value of 2% (0.02) for the bw j parameter is somewhat acceptable.In the same way, a threshold value of 0.05 for S 2 f j is robust.For the iRprop + algorithm, the number of epochs established is 25 (a more significant number of epochs does not improve the results), η + = 1.2 , η − = 0.5 , 0 = 0.0125 (the initial value of the ij ), min = 0 and max = 50 (see 27 for iRprop + parameter descriptions).
For the sake of conciseness, we have chosen to present the results of the memetic versions since they are all better than their standard version, and the same analysis can be extracted.
To further validate the effectiveness of the proposed M-DSCRO method, it was compared against four wellknown state-of-the-art methods: C4.5 Decision Tree, Logistic Regression (LR), Multilayer Perceptron (MLP), and Support Vector Machine (SVM).The aim was to surpass these established algorithms in performance.The selection of hyperparameters for each algorithm involved a nested 10-fold cross-validation process, repeated three times on the training dataset, focusing on minimising the cross-validation error.The optimal hyperparameter set, resulting in the lowest cross-validation error, was then applied to the entire training dataset to evaluate the final performance in the test set.The hyperparameter tuning was conducted as follows.For LR, both l1 and l2 penalty functions were considered, with the cost parameter ( C ) ranging from 10 −3 to 10 3 .In the case of SVM with a Gaussian kernel, C and the kernel width ( σ ) were varied within the same range of 10 −3 to 10 3 .The C4.5 Decision Tree's configuration included the Gini index and entropy as criteria, with the maximum tree depth set between 3 and 6, and the minimum number of samples required at a leaf node ranging from 2 to 10.For the MLP, a single hidden layer was used, with the number of neurons in this layer chosen from the set {2, 4, 6, 8, 10} (similar with respect to our neural networks).The training iterations were set within {500, 1000, 1500}, the learning rate ( α ) was chosen from {0.1, 0.5, 1}, and the momentum ( µ ) from {0.3, 0.5, 0.7, 0.9}.

Results and discussion
The results of CCR and MS are shown in Table 2, in which mean values and standard deviation ( Mean std ) of the 30 runs for each algorithm have been calculated, as well as the average number of neurons ( #Neur. ) and links ( #Links ) used by the best methodology in CCR .Also, each method's mean ranking ( r ) has been included, assigning 1 to the best method and 3 to the worst.The best method for each dataset is in bold, while the second one is in italics, considering the two metrics separately.
The proposed method, M-DSCRO, achieves the best results in CCR .It obtains the highest value in 36 out of 40 datasets and the second-best value in 3 out of 40 datasets.This demonstrates that the algorithm is capable of achieving excellent global accuracy in almost all databases, showcasing its robustness across various applications and databases with diverse characteristics.The second-best performing algorithm is M-SCRO.It obtains the highest and second-best results in 3 and 30 datasets, respectively.In contrast, M-CRO can only achieve the best value in 2 datasets, indicating its weak performance when compared to the other two methods.These results are consistent with the literature, which shows that SCRO performs better than CRO.Moreover, the dynamic approach enhances their overall performance.The average rankings confirm this analysis, with M-DSCRO having the lowest value (closest to one), and M-SCRO coming in the second position.
Observing the other metric, the algorithms get worse improvement concerning MS than CCR .This is rather normal as the algorithm is set up to improve entropy, a metric directly related to the percentage of total patterns correctly classified.Nevertheless, the results obtained are undoubtedly reasonable in almost all datasets.
In this sense, it can be observed that the behaviour is similar to that of CCR but with more variability.Specifically, the M-DSCRO algorithm has the highest value in 26 out of 40 datasets and is the second-best in 11.Additionally, the differences between M-CRO and M-SCRO are even more minor when looking at this metric.In this case, M-SCRO has the highest MS in 8 datasets, while M-CRO is the best one in 7.These results align with the findings of a previous study 41 , where SCRO outperformed CRO, but the difference was not significant.It is worth noting that the mean ranking of M-CRO and M-SCRO (2.44 and 2.11) is much closer when analysing MS than when analysing CCR .Hence, the statistical version of the algorithm can improve the global accuracy while reducing the number of parameters that need to be determined.However, the accuracy of the worst classified class only improves slightly.Fortunately, the dynamic proposal (M-DSCRO) overcomes this disadvantage by providing a significant improvement, as observed in this analysis.

Table 2.
Mean and standard deviation values ( Mean std ) of Correct Classification Rate (CCR ) and Minimum Sensitivity (MS) obtained by all the algorithms in each dataset for the 30 runs.Average number of neurons ( #Neur. ) and links ( #Links ) used by the best methodology in CCR .The mean rankings of all algorithms are also included.The best method for each dataset is in bold, while the second one is in italics.www.nature.com/scientificreports/When exploring stochastic algorithms, the standard deviation of the different runs is an important feature to consider.If the deviations are close to 0, the algorithms are robust and not dependent on random initialisation.This is the case for all three algorithms being evaluated.In addition, the proposed algorithm (M-DSCRO) not only improves in average terms but also reduces the standard deviation in almost all the databases, demonstrating the excellent stability of the algorithm.
In order to analyse the results from a statistical point of view, a set of statistical tests has been used.Firstly, CCR values are analysed.A 68 test has been applied to the CCR rankings, which states that, for a level of significance α = 5% , the confidence interval is C 0 = (0, F 0.05 = 3.11) , and the F-distribution statistical value is F * = 68.45/ ∈ C 0 .Therefore, the test rejects the null hypothesis stating that all algorithms perform equally in mean ranking of CCR .That is, the algorithm effect is statistically significant.Because of this, the best performing method in CCR is considered as the control method for a post-hoc test 69 , comparing this algorithm with the other methods.It has been found that comparing all algorithms with a given algorithm (control method) is more sensitive than comparing all algorithms with each other.The Holm's test compares the i-th and j-th algorithms with the following statistic: where ri is the mean ranking of the i-algorithm, k is the number of algorithms, and N is the number of datasets.
With the value of z, it is found the probability of a normal distribution and compared it with a level of significance α .Holm's test adjusts the value for α to compensate multiple comparisons, using a procedure that sequentially tests the hypotheses ordered by their significance.The ordered p-values are denoted by p 1 , p 2 , . . ., p k , so that p 1 < p 2 < ... < p k .The test compares each p i with α * i = α/(k − i) , starting with the most significant p-value.If p 1 is lower than α/(k − 1) , the corresponding hypothesis is rejected, and then it is compared p 2 with α/(k − 2) , and so on.When a certain null hypothesis is accepted the remaining ones are also accepted.
The results of Holm's test are reported in Table 3.When using M-DSCRO as the control algorithm (CA), Holm's test shows that p i < α * i in all cases, for α = 0.05 , confirming that there are statistically significant differ- ences favouring M-DSCRO.In addition, M-SCRO is statistically better than CRO using CCR as a comparison metric (although the differences are lower).
Similarly, to determine the existence of statistical differences when comparing MS, a Friedman test has been carried out showing that, for a level of significance α = 5% , the F-distribution value obtained is F * = 13.23 which is outside the confidence interval C 0 = (0, F 0.05 = 3.11) .So, again, there are significant differences between the algorithms and, consequently, a Holm's test has been run with M-DSCRO as CA.The results presented in Table 3 confirm that M-DSCRO is statistically better than the other two methods.Furthermore, in this case, there are no significant differences between M-CRO and M-SCRO, as previously suggested.

Examining the imbalance ratio
A range of imbalanced datasets were worked with as part of the experimental validation.As stated, for each classification dataset, the IR has been calculated as the ratio of the number of patterns in the majority class to the number of patterns in the minority class.This information is reported in the column IR in Table 1.Furthermore, Fig. 3 shows a graph summarising the performance in CCR (a) and MS (b) with the databases sorted in increasing order of IR, which facilitates the discussion of the results by analysing this characteristic.
After sorting the databases based on their IR and analysing the results, we noticed that for the CCR , the M-DSCRO algorithm outperforms the rest in almost all databases, regardless of imbalance.
When studying the MS, it is observed that the algorithm M-DSCRO performs the best in classifying the minority classes of both balanced and imbalanced databases.However, for extremely imbalanced databases with IR greater than eight, those located to the right of the orange vertical line in Fig. 3b, M-DSCRO is the best in two out of six, with M-CRO performing better in the remaining four.This suggests that while M-SCRO improves global performance for extremely imbalanced databases, it worsens the performance in minority classes.However, this issue is partly resolved with the proposed dynamic version of the algorithm.

Examining the number of classes
Also, the datasets cover a wide range in terms of the number of classes, which range from 2 to 15, and are listed in column (#Classes) in Table 1.As in the previous part, Fig. 4 shows the CCR (a) and MS (b) performance of the three algorithms on the datasets sorted in ascending order by the number of classes.Regardless of the number of classes, M-DSCRO performs equally well in CCR .Nonetheless, it can be observed that the few times it could be better is in databases with 2 classes (those to the left of the orange line in Fig. 4a).This implies that the dynamic methodological approach excels in more complex databases in terms of overall accuracy.
In MS, the M-DSCRO algorithm performs very well for 2 and 3 classes (see vertical orange line in the Fig. 4b).However, when it comes to a larger number of classes, the algorithm is equally competent as M-CRO, while M-SCRO performs the worst.In other words, M-SCRO reduces the performance as the number of classes increases.Therefore, thanks to the dynamic approach, the statistical version becomes more potent without compromising performance for databases with few classes, solving its disadvantages.

Examining the size
A final analysis has been conducted to check whether the database size affects the results obtained.For this purpose, the number of patterns, attributes, and total size considered as the product of both have been studied.
However, a relationship has yet to be found for any of these elements since the best results of the algorithms are not concentrated in small or large databases but are distributed regardless of size.

Comparison with state-of-the-art models
In this phase, the performance of the newly introduced M-DSCRO technique is evaluated against other state-ofthe-art machine learning models, including C4.5 decision trees, Logistic Regression (LR), Multilayer Perceptron (MLP), and Support Vector Machines (SVM).Table 4 summarises the mean (Mean) values of CCR and MS (we omit the standard deviation to improve the readability of the results).As previously, an average ranking r for each approach is provided, where 1 represents the top-performing method, and 5 is the least effective.The leading method for every dataset is highlighted in bold, and the runner-up is denoted in italics, with both metrics evaluated independently.Regarding CCR metric, the M-DSCRO algorithm proposed herein secures the highest performance across 18 databases and attains the second-highest performance in 9 others, as evidenced by its mean rank ( r ) of 2.15, positioning it as the foremost method.Following closely is the SVM model, which leads in 11 databases, culminating in an average rank of 2.7.This performance underscores the proposal competitive edge in overall accuracy when juxtaposed with state-of-the-art methods.
Furthermore, according to the MS metric, the M-DSCRO algorithm outperforms others, securing the top position in 18 datasets and the second-highest in 9, with an impressive average rank of 2.03.In this scenario, a tie for second place emerges between the C4.5 and LR models, each with an average ranking of 2.96, demoting the SVM model to the fourth position.This indicates that our method maintains superior performance across the board, including for minority classes, whereas the SVM model, despite its overall effectiveness, falls short in adequately addressing minority classes, which are often of paramount importance.
As with previous analyses, to assess whether significant statistical differences exist in the performance metrics CCR and MS, two Friedman tests were conducted.These tests revealed that at a significance level of α = 5% , the F-distribution values achieved were F * = 7.67 for CCR and F * = 8.37 for MS.Both values exceed the bounds of the confidence interval C 0 = (0, F 0.05 = 2.43) , indicating significant disparities among the evaluated methods.Consequently, Holm's post-hoc test was applied to each metric with the M-DSCRO method serving as the control algorithm, and the findings are compiled in Table 5.
Although the M-DSCRO algorithm exhibits superior performance in terms of CCR , the Holm's test indicates that its advantage over the SVM model is not statistically significant, whereas significant disparities are observed when compared to C4.5, LR, and MLP.In contrast, the MS metric clearly demonstrates the algorithm's superior performance relative to other methods, with the significant differences being unmistakably pronounced.Based on these findings, the adoption of the M-DSCRO methodology for classification problems is confidently endorsed.

Conclusions
This paper proposes three memetic algorithms for training and optimising the topology and weights of ANNs simultaneously.Concretely, CRO and its statistical version, SCRO, have been implemented and adapted using suitable operators for this purpose, resulting in M-CRO and M-SCRO algorithms.Also, an improved version of M-SCRO has been proposed in which the hypothesis of normal fitness distribution is tested during the evolution, motivated by the theory of robust estimators.In this way, the algorithm dynamically selects the intervals based on the centralisation and scaling calculated estimators, resulting in the so-called M-DSCRO.The results show that M-SCRO statistically improves on M-CRO in terms of CCR but is equal in terms of MS.However, M-SCRO does not require parameter tuning.The new proposed M-DSCRO methodology, however, outperforms the other two algorithms by achieving the same advantage as M-SCRO, i.e. eliminating the need for manual parameter value adjustment based on a dynamic update of the parameters during evolution considering robust estimators and avoiding the assumption of normality of the fitness distribution.According to the study performed, the results of M-DSCRO are significantly better in terms of CCR and MS.The M-DSCRO algorithm demonstrated superior CCR in diverse imbalance levels and performed well in MS across balanced and imbalanced datasets.However, in cases of high imbalance (IR> 8), M-CRO outperformed M-DSCRO in minority class classification in some instances, a gap partially bridged by its dynamic version.The datasets varied from 2 to 15 classes, with M-DSCRO showing consistent CCR performance and excelling in datasets with fewer classes.While effective in MS for 2 and 3 classes, its performance aligned with M-CRO as class number increased, with M-SCRO effectiveness waning with more classes.The dynamic approach of M-DSCRO addressed its limitations without affecting its performance in smaller datasets.An analysis on the influence of database size found no clear link between database dimensions and algorithm performance.Finally, a comparison against four stateof-the-art algorithms shows how M-DSCRO excels, proving its superior effectiveness in terms of both overall performance and minority class performance.
For future lines of research, the authors plan to extend this work by using the CCR along with MS in a multiobjective evolutionary algorithm.It has been shown in 45 that both objectives are opposite, especially at certain levels.At the beginning of a learning or evolutionary process, CCR and MS could be cooperative, but after a certain level, objectives become competitive and an improvement in one objective tends to involve a decrease in the other one.MS can be considered a complementary measure of CCR whose value must be maximised.It will improve CCR as a weighted average of the correct classification rates of the Q classes.In this way, the pair (CCR , MS) tries to find a point between the scalar accuracy measure and the multidimensional ones based on misclassification rates.

Figure 2 .
Figure 2. Mutation node fusion.A and B are the initial neurons involved in the fusion, and C is the neuron resulting from the mutation.

9 :Algorithm 6 .
Delete those corals whose fitness is not in the interval (C 1 − SC 1 , 1]. 10: end if 11: return A new coral reef.Check reset in M-DSCRO.

Figure 3 .
Figure 3. Performance in CCR (a) and MS (b) of the three algorithms on the databases sorted in increasing order of IR: M-CRO (red), M-SCRO (blue) and M-DSCRO (green).

Figure 4 .
Figure 4. Performance in CCR (a) and MS (b) of the three algorithms on the databases sorted in increasing order of number of classes: M-CRO (red), M-SCRO (blue) and M-DSCRO (green). = t b : Calculate C 1 and SC 1 .7: Delete those corals whose fitness is not in the interval (C 1 − SC 1 , 1].

Table 3 .
Holm test results considering M-DSCRO as control algorithm.Its average CCR and MS is compared to those of M-CRO and M-SCRO: corrected α values, compared methods and p-values, all of them ordered by the number of comparison (i).If M-DSCRO results statistically better, it is marked with (*).

Table 4 .
Mean values of Correct Classification Rate (CCR ) and Minimum Sensitivity (MS) obtained by all the algorithms in each dataset.The mean rankings of all algorithms are also included.The best method for each dataset is in bold, while the second one is in italics.

Table 5 .
Holm test results considering M-DSCRO as control algorithm.Its average CCR and MS is compared to those of C4.5, LR, MLP and SVM: corrected α values, compared methods and p-values, all of them ordered by the number of comparison (i).If M-DSCRO results statistically better, it is marked with (*).