Universal activation function for machine learning

This article proposes a universal activation function (UAF) that achieves near optimal performance in quantification, classification, and reinforcement learning (RL) problems. For any given problem, the gradient descent algorithms are able to evolve the UAF to a suitable activation function by tuning the UAF’s parameters. For the CIFAR-10 classification using the VGG-8 neural network, the UAF converges to the Mish like activation function, which has near optimal performance \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_{1}=0.902\pm 0.004$$\end{document}F1=0.902±0.004 when compared to other activation functions. In the graph convolutional neural network on the CORA dataset, the UAF evolves to the identity function and obtains \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1=0.835\pm 0.008$$\end{document}F1=0.835±0.008. For the quantification of simulated 9-gas mixtures in 30 dB signal-to-noise ratio (SNR) environments, the UAF converges to the identity function, which has near optimal root mean square error of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.489\pm 0.003~\mu {\mathrm{M}}$$\end{document}0.489±0.003μM. In the ZINC molecular solubility quantification using graph neural networks, the UAF morphs to a LeakyReLU/Sigmoid hybrid and achieves RMSE=\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.47\pm 0.04$$\end{document}0.47±0.04. For the BipedalWalker-v2 RL dataset, the UAF achieves the 250 reward in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${961\pm 193}$$\end{document}961±193 epochs with a brand new activation function, which gives the fastest convergence rate among the activation functions.

www.nature.com/scientificreports/ a reinforcement learning problem. Subsequently, the best neural network is selected from a pre-built model zoo and it is retrained to get the best results. One of the core tasks for automated machine learning is to find an optimal activation function for a specific model. However, many activation functions have been proposed over the history of machine learning and this makes the selection difficult. Richards 11 developed the sigmoid activation function family that spans the S-shaped curves like the tanh 12 function and the sigmoid function. Other activation functions in the family include the step function, the clipped tanh function, and the clipped sigmoid function. Subsequently, the first neural network 13,14 used the sigmoid activation function for modeling biological neuron firing. For the most part, activation functions from the sigmoid activation function family are used for classifying objects, where the output is constrained to the range [0, 1].
The ReLU activation function 15 is another popular activation function that is used for quantification, classification, and reinforcement learning problems. The ReLU activation function is part of the ReLU activation function family, where the behaviour of all functions in the family are linear y = x when x > 0 . The identity, LeakyReLU 16 , Elu 17 , and softplus 18 activation functions are also included in this family. The LeakyReLU activation function is a version of ReLU activation function that has a non-zero slope y = αx when x < 0 , where the non-zero slope is used to prevent the gradient from reaching zero. One of the major problems of the ReLU and LeakyReLU activation functions is the discontinuity at x = 0 that produces undefined gradients 19 and causes the gradient descent optimizer to fail. The Elu and the softplus activation functions solve the problem by creating smoothness and continuity around x = 0 18 . Newer activation functions such as Mish 20 and Swish 21 have built-in regularization to prevent over-fitting of models.
The Gaussian activation function 22 has a bell shaped curve and it is useful for modeling Gaussian distributed random variables. For example, a neural network predicting the speed of a car might use the Gaussian function for regression because the speed of a car is Gaussian distributed 23 . Moreover, the Gaussian function is also used for classifying the existence of objects 24 . The Gaussian function is a special case of the radial basis function (RBF) activation function family 24 , whose functions always have a bell shape curve. Other members of the RBF family include the polyharmonic spline and the bump function.
Among the many basic activation functions, selecting the best activation function that suits a specific task is hard. Researchers have solved this problem by creating adaptable activation functions that can evolve to a specific task. The adaptable activation functions are controlled by trainable parameters, of which are then optimized using gradient descent algorithms. PReLU 25 is an example of an adaptive activation function, where the slope α of a LeakyReLU function is a trainable parameter. Bodyanskiy et al. 26 developed an adaptable RBF that can be trained in real time. Qian et al. 27 proposed adaptive ReLU functions for CNNs. Campolucci et al. 28 used an adaptive spline activation function that approximates the curves of a sigmoid activation function. However, the adaptive spline activation functions suffer from over-fitting and discontinuities. Each individual spline is constructed independently of other segments. Afterwards, the segments are joined together to form a complete activation function. As a result, continuity is not guaranteed at the segment joints because the derivatives of the two different segments might not agree. Furthermore, too many segments might introduce a large amount of trainable parameters and this might cause over-fitting 29 .
We propose a simple universal activation function (UAF) to solve the problem of finding the optimal activation function for a specific task. The 5 trainable parameters of the UAF allows it to approximate any of the activation functions listed above. Without any additional constraints, the UAF is continuous and differentiable for all parameter values. Due to the properties above, the gradient descent algorithms are able to smoothly evolve the UAF to a near optimal activation function, which may be an existing activation function in the literature or a new activation function. Adopting the UAF in neural networks automates the search for a good activation function and reduces the total training time. For example, NEAT 4 and ENAS 8 discretely search through the activation functions one by one. Everytime the activation function changes, the neural networks above need to be retrained from scratch. Instead of retraining the neural networks, the activation functions and the weights can be continuously updated to reduce training time. These papers 30,31 prove that adaptive activation functions converge faster for certain problems such as stiff ordinary differential equations and partial differential equations.
The paper is organized as follows. Section "Construction of UAF" describes the properties of UAF and its training procedure. Section "Experiments" shows the UAF's performances on the CIFAR-10 32 classification, infrared spectra database for 9 gas quantification 33 , BipedalWalker-v2 34 RL, Planetoid/CORA publication classification dataset 35 , and ZINC molecular solubility quantification dataset 36 . Furthermore, a conclusion is presented in Section "Conclusion and future work". Finally, Supplementary Information S. 2 gives implementation details about the UAF.

Construction of UAF
In this section, the UAF will be derived from the softplus activation function. For the range of x ≫ 0 , the ReLU activation function can be approximated by the softplus activation function.
Furthermore, the softplus function can be generalized by adding two new trainable parameters A and B where A controls the slope and B controls the horizontal shift. The LeakyReLU activation function can be approximated by adding another monotonically decreasing function and a new parameter D UAF error analysis using RMSE table. In this subsection, we will examine the errors of the UAF in the range of x ∈ [−5, 5] because every maximum absolute error occurs within this range. Table 1 shows the root mean square error (RMSE), mean absolute error (MAE), maximum absolute error, and locations of the maximum absolute error for each activation function. The UAF models the identity function and the softplus function without any errors because the UAF is based on those functions. For the continuous activation functions such as the sigmoid, tanh, and Gaussian, the UAF models them well with a small RMSE. For the discontinuous activation functions like the ReLU and LeakyReLU, the RMSE is slightly higher due to the continuous UAF not being able to handle the discontinuities. A more through evaluation of the UAF's error analysis is available in the Supplementary Information S.1.

UAF error analysis using error plots.
To further illustrate the errors between the UAF and the targeted activation functions, we have made error plots of the UAF as shown in Fig

Training the UAF's parameters
Unlike regular activation functions, the UAF has trainable parameters and it requires a unique training procedure to achieve the best performance. The exact same training procedure is followed for each dataset in "Experiments". The UAF's training procedure is divided into phase 1 and phase 2. Starting with training phase 1, gradients of the weights, biases, and UAF's parameters are computed. Afterwards, the ADAM optimizer 37 updates the weights, biases, and UAF's parameters concurrently using the computed gradients. When the loss function hits a plateau, training phase 1 ends and training phase 2 begins. In training phase 2, the ADAM optimizer only updates the weights and biases of the neural network, while the UAF's parameters are not updated. This is done to reduce the over-fitting of the model and to prevent training instability. In order to update the UAF's parameters, the ADAM optimizer requires the UAF's gradients. Derivation of the UAF's gradients is presented below.
Derivation of the UAF's gradients. Suppose a MSE loss function J needs to be minimized by tuning the predicted output ŷ to match the actual output y. Suppose the predicted output ŷ is modeled by a single layer MLP that has the UAF where x i are the inputs, w i are the weights, and v is the bias. Firstly, the UAF's gradients ∇f UAF (x, A, B, C, D, E)    are updated using the ADAM optimizer. The ADAM optimizer also requires the learning rates, which are described in the next subsection.
Learning rates for phase 1 and phase 2. In training phase 1, the learning rate is held constant α(t) = α 0 for epochs 0 < t < t 0 . When the loss does not decrease for Z epochs, the loss is considered to have plateaued at epoch t 0 and this leads to the start of training phase 2. In training phase 2, the new learning rate α(t) = α 1 is significantly smaller than the previous learning rate α 1 < α 0 . Moreover, the learning rate decreases when the loss has plateaued for Z epochs.

Experiments
In this article, five experiments are used to benchmark the UAF against other activation functions. To show the effectiveness of the UAF, an animation depicting the evolution of the UAF in these datasets is available in the Supplementary Materials (V.2).

CIFAR-10 image classification.
The goal of the CIFAR-10 dataset 32 is to take 32 × 32 pixel RGB images and classify them into 10 different categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The VGG 8 layer CNN 38 is applied to the CIFAR-10 dataset, of which contains 6 CNN layers, 6 max pooling layers, and 2 dense layers. Each CNN layer contains many 3 × 3 pixel kernels interspersed with max pooling layers. On the other hand, each dense layer has 1,024 neurons and they produce the output classification result.
To ensure fairness in the tests, all neurons and all layers are set to the same type of activation function. In the case of the UAF, a single UAF is applied to all neurons and to all layers. CIFAR-10 dataset contains 60,000 images in total, where 50,000 images are used for training and 10,000 images are used for testing. The training and testing datasets for CIFAR-10 are not randomized to allow comparisons between papers. After executing the 1 × 10 folding training and testing, the precision, recall, and F 1 scores of various activation functions are recorded in Table 2 . The ReLU activation function has the worst score F 1 = 0.018 ± 0.001 because the ReLU's gradient sometimes gets stuck and stops the weights from updating 19 . The identity, sigmoid, tanh, and ELU activation functions have poor scores F 1 = 0.795 ± 0.02 , 0.881 ± 0.006 , 0.835 ± 0.010 and 0.886 ± 0.004 because their gradients do not back-propagate well across many different CNN layers. On the other hand, Mish and LeakyReLU functions are designed to stop the gradient from reaching zero. As a result, they perform better and have higher scores F 1 = 0.891 ± 0.008 and 0.893 ± 0.003 . Softplus and UAF have the highest scores F 1 = 0.902 due to the smoothness of the functions and being able to reach the global minimum. This means softplus and UAF are superior at classifying objects when compared to the other activation functions despite the UAF requires more training time for the UAF's parameters to converge. Figure 2a shows the evolution of the UAF on the CIFAR-10  www.nature.com/scientificreports/ dataset. Upon initializing the UAF as the identity activation function, the UAF converges to a Mish activation function that is shifted to the right and has a different slope.
Planetoid/CORA publication classification. In the Planetoid/CORA publication classification dataset 35 , uncategorized published papers and their publication metadata are given in order to classify the papers into one of seven academic fields. The input to the network is a graph of published papers, where each node contains the extracted keywords of a paper and each edge contains a citation between two papers. If a keyword exists within a paper, then it is labeled as 1, otherwise it is labeled as 0. The prediction uses a 64 layer graph convolutional neural (GCN) network 39 that has 64 hidden channels and 1 dense layer. Bias weights are not used because they cause overfitting and performance degradation. To ensure fairness in the tests, all neurons and all layers are set to the same type of activation function. In the case of the UAF, a single UAF is applied to all neurons and to all layers. The Planetoid/CORA dataset contains 2708 publications in total. After randomly shuffling the dataset, 140 publications are randomly selected for training and 1000 publications are randomly selected for testing. Table 3 shows the 1 × 10 folding training and testing of the various activation functions. Sigmoid performs poorly F 1 = 0.129 ± 0.01 due to label prediction imbalance. In the absence of bias weights, the Sigmoid skews the input domain of [0, 1] to the output range of [0.5, 0.731] and this leads to the overprediction of label 1 compared to the label 0. The same label prediction imbalance causes LogSigmoid, Hardswish, softplus, and SiLU to perform poorly. The ELU, identity, LeakyReLU, Mish, PReLU, ReLU, tanh, and UAF perform significantly better due to them not requiring bias weights. In Fig. 2b, the UAF converged to identity function and failed to obtain the best result because the ADAM optimizer stopped at a local minimum. Nevertheless, its F 1 score of 0.835 ± 0.008 is close to the best performed ReLU ( F 1 = 0.845 ± 0.011 ) which is able to preserve the information from the keywords. 9 gas quantification. The objective of the infrared spectra database is to predict the concentrations of 9 gasses using 1 ×1000 images of the gasses' IR spectra. We generated the dataset using the similar procedure to 33 and made the gas concentrations uniformly distributed between 0 and 10 µM . The total dataset contains 100,000 images. After shuffling, 80,000 images are randomly sampled for training and 20,000 images are ran- Table 2. CIFAR-10 image classification using VGG 8 layers. 1 × 10 fold macro averaged results. Confidence interval of 2σ . The UAF is the activation function described in this paper. The non bold items are the other activation functions used for comparison.  Table 3. Planetoid/CORA classification using graph convolution neural networks. 1 × 10 fold macro averaged results. Confidence interval of 2σ . The UAF is the activation function described in this paper. The non bold items are the other activation functions used for comparison.

Activation functions Precision
Recall  Table 4 shows the 1 × 10 fold testing of the 30 dB SNR IR spectra database 33 . The ReLU activation function again gets stuck and produces a high RMSE = 1.2 ± 1.7 . Moreover, softplus, sigmoid, and tanh activation functions have high RMSE = 0.90 ± 0.03 , 0.95 ± 0.01 and 0.694 ± 0.002 because they are not suited for quantification. On the other hand, MLPs using the Identity, LeakyReLU, and UAF activation functions obtained the lowest RMSE = 0.489 ± 0.004 , 0.488 ± 0.004 and 0.489 ± 0.003 due to them being suitable for quantification. As a result, MLPs with the identity, LeakyReLU, and UAF are able to predict the concentrations of the gasses more accurately than the MLP with other activation functions. Fig. 2c shows the evolution of the UAF during the training procedure. The UAF begins as the identity function. Afterwards, the UAF changes to a parabolic function. Subsequently, the UAF converges to the identity function, which is close to the optimal activation function.
ZINC molecular solubility quantification. The objective of the ZINC molecular solubility quantification dataset 36 is to predict an unknown chemical's solubility property given its molecular structure. A graph neural network with principal neighbourhood aggregation 40 is used to predict the solubility values. For testing, a single type of activation function is applied to all layers and all neurons. The input to the neural network is the molecular structure in the form of a graph. Each node represents an atom and each edge represents a bond between two atoms. The entire ZINC dataset contains 250,000 different molecular graphs. 220,011 molecular graphs are randomly sampled for training and 5,000 molecular graphs are randomly sampled for testing. Table 5 shows the results of the various activation functions on the ZINC dataset after executing the 1 × 10 fold testing. Sigmoid and LogSigmoid perform poorly RMSE = 0.6 ± 0.1 and 0.51 ± 0.05 because they are not designed for quantification. Identity performs poorly RMSE = 0.56 ± 0.05 as it does not filter out invalid values such as nega-  Table 5. ZINC molecular solubility quantification using graph neural networks with principal neighbourhood aggregation. 1 × 10 fold macro averaged results. Confidence interval of 2σ . The UAF is the activation function described in this paper. The non bold items are the other activation functions used for comparison. BipedalWalker-v2 reinforcement learning. The goal of the BipedalWalker-v2 34 RL benchmark is to move the robot past the finish line while adapting to large changes in the simulation's terrain. The neural networks control the torques of the robot's legs in order to move the robot forwards and to prevent the robot from falling over. The reward function depends on the furthest distance traveled and the total amount of energy used to move the robot. Maximizing the furthest distance traveled and minimizing the total energy used, increases the reward function. Moreover, the neural networks must converge in the least number of epochs. High rewards and low number of epochs imply that the models run efficiently. Table 6 shows the results of the Deep Deterministic Policy Gradient 41 on BipedalWalker-v2. 1 × 10 fold testing is used on the dataset and each fold has randomly generated terrain. The sigmoid activation function achieves the 100 reward in 818 ± 213 epochs, which is the least number of epochs. UAF is slightly slower at achieving the 100 reward in 859 ± 209 epochs. However, UAF is the fastest at achieving the 250 reward with 961 ± 193 epochs. In the long run, the UAF achieves the best performance in terms of highest rewards in the least number of epochs. Figure 2e shows the evolution of the UAF in BipedalWalker-v2. The UAF is initialized as the identity function. Subsequently, the UAF evolves to an unusual parabolic activation function. The parabolic function is a new activation function that performs well for this specific problem. It limits the torque of the bipedal robot to y ∈ [−1, ∞) and the parabolic function decreases the energy needed to move the robot. As the energy needed decreases, the reward increases.

Conclusion and future work
The UAF was developed as a generic activation function that can approximate many others such as the identity, ReLU, LeakyReLU, sigmoid, tanh, softplus, and Gaussian as well as to evolve to a unique shape. This versatility allows the UAF to achieve near optimal performance in classification, quantification, and reinforcement learning. As demonstrated, incorporating the UAF in a neural network leads to best or close-to-best performance, without the need to try many different activation functions in the design.
In the current setup, a single UAF is applied to the entire neural network. As for future work, each layer or each neuron could have its own UAF. Each UAF would then specialize to a specific task. This would enable the neural networks to model more non-linear processes and to solve more difficult problems. Moreover, the UAF could be used for transfer learning. The activation functions from one neural network could be transferred to another neural network. This would enable multiple neural networks to learn from each other and to converge faster.

Data availability
The majority of the datasets used in this paper are publicly available. Private datasets can be given upon request.

Code availability
The UAF's code is available for Tensorflow and Pytorch upon request.