Research on improved convolutional wavelet neural network

Artificial neural networks (ANN) which include deep learning neural networks (DNN) have problems such as the local minimal problem of Back propagation neural network (BPNN), the unstable problem of Radial basis function neural network (RBFNN) and the limited maximum precision problem of Convolutional neural network (CNN). Performance (training speed, precision, etc.) of BPNN, RBFNN and CNN are expected to be improved. Main works are as follows: Firstly, based on existing BPNN and RBFNN, Wavelet neural network (WNN) is implemented in order to get better performance for further improving CNN. WNN adopts the network structure of BPNN in order to get faster training speed. WNN adopts the wavelet function as an activation function, whose form is similar to the radial basis function of RBFNN, in order to solve the local minimum problem. Secondly, WNN-based Convolutional wavelet neural network (CWNN) method is proposed, in which the fully connected layers (FCL) of CNN is replaced by WNN. Thirdly, comparative simulations based on MNIST and CIFAR-10 datasets among the discussed methods of BPNN, RBFNN, CNN and CWNN are implemented and analyzed. Fourthly, the wavelet-based Convolutional Neural Network (WCNN) is proposed, where the wavelet transformation is adopted as the activation function in Convolutional Pool Neural Network (CPNN) of CNN. Fifthly, simulations based on CWNN are implemented and analyzed on the MNIST dataset. Effects are as follows: Firstly, WNN can solve the problems of BPNN and RBFNN and have better performance. Secondly, the proposed CWNN can reduce the mean square error and the error rate of CNN, which means CWNN has better maximum precision than CNN. Thirdly, the proposed WCNN can reduce the mean square error and the error rate of CWNN, which means WCNN has better maximum precision than CWNN.


Results
Four experiments are implemented between BPNN, WNN, CWNN and WCNN in order to prove the improved effects. Firstly, "feasibility experiment" is designed to verify the feasibility (convergence) of WNN and prove that WNN can solve the problems of BPNN and RBFNN. Secondly, "performances experiment" is designed to verify the best performances (such as maximum precision, minimum error) of BPNN, RBFNN and WNN. Thirdly, "CWNN experiment" is designed in order to prove that the performances of the proposed CWNN is better than CNN. Fourthly, "WCNN experiment" is designed in order to prove that the performances of the proposed WCNN is better than CWNN and CNN.
Definition 1 1 completed simulation process (1CSP) means a completed training process from beginning time 0 to the complete time (the time when the training error is less than the target error). Definition 2 1 simulation time (1CT) is only one training calculation. 1CSP contains many CTs. �w (3) jk , �w (2) ij , a j , b j are calculated once in 1CT.
Result of feasibility experiment. The dataset of "feasibility experiment" is generated by our designation which is specifically described in the section of "Data". Results of comparative simulations are discussed by two ways: Firstly, error descending curves and error surfaces in 1CSP are plotted in Fig. 1.
All the simulations of BPNN, RBFNN and WNN are repeated for 10CSP. The condition to stop the simulation is that the target error in training process is less than a fixed value. Hence, the average CTs, the maximum error (between target value and calculated output) and the mean square error in each CSP can be calculated as follows: The average CTs in each CSP of BPNN is 39,802. The maximum error is 0.050000. The mean square error is 0.000319. The error descending curve and error surface are drawn as Fig. 1a,d. The average CTs of RBFNN is 1580. The maximum error is 0.000314. The mean square error is 0.049570. The error descending curve and error surface are drawn as Fig. 1b,e. The average CTs of WNN is 1006. The maximum error is 0.000445. The mean square error is 0.049995. The error descending curve and error surface are drawn as Fig. 1c,f.
Secondly, statistical details of simulations in 10 CSPs are listed in Table 1, which are specifically discussed as follows: 10 CSPs of simulations of BPNN, RBFNN and WNN algorithms are compared to find out the differences of training times, mean square error and maximum error. Columns 2, 5, and 8 (XNN Training times) show how many CTs are required to complete the simulation in each CSP XNN. Columns 3, 6, and9 (XNN Mean square error) show the final mean square error after each training CSP. Columns 4, 7, and 10 (XNN maximum error) show the final maximum error after each training CSP. The maximum error is expected less than the target error. Therefore, if the training can be completed within 20,000 CTs, the maximum error is less than err_goal = 0.1 . The results show that WNN can solve the problems of BPNN and RBFNN with better performance and make preparation for the improvement from CNN to CWNN.
According to the above Fig. 1 and Table 1, we can draw the following conclusions: Firstly, all of the BPNN, RBFNN and WNN algorithms are convergent. According to the columns 2, 5, and 8, all the values are less than the max_epoch = 200, 000 , which means all the training process are convergent (all the training errors are lower than the target errors). It is proved that all the algorithms are feasible. Secondly, the average training times of WNN (CTs = 343) is the least, while the average training times of BPNN and RBFNN are 1305 and 2630. The average training times of RBFNN is the most. It is proved that WNN is the fastest algorithm, and RBFNN is the slowest one. Thirdly, the error descending curve of BPNN keeps decreasing, which causes problem of local minimum. BPNN also has problems such as slow convergence speed and local minimum problem 23 . However, the problem of local minimum are solved by RBFNN and WNN because of the structure of network and the active functions of RBFNN and WNN. The error curve of RBFNN in Fig. 1b not only reduces the time but also avoids the local minimum. While the error curve of WNN in Fig. 1c, significant changes (break the process of decreasing) only happen at the very beginning time of the training process. www.nature.com/scientificreports/ Result of hyperparameter optimization experiment. The "Hyperparameter optimization experiment" is designed and implemented in order to verify the best performances (max precision, min error) of BPNN, RBFNN and WNN, with the comparative results analyzed. The simulation results of the above 3 algorithms are shown in Table 2, which are specifically discussed as follows. Columns 2, 5, and 8 (XNN Training times) show the numbers of CTs in each CSP. 20,000 means XNN cannot complete training within 200,000 CTs (i.e., the square error is not less than the target error with 20,000 training CTs). Columns 3, 6, and 9 (XNN Mean square error) show the final mean square error after each training CSP. Columns 4,7, and 10 (XNN Maximum error) show the final maximum error after each training CSP. The maximum error is expected less than the target error. Therefore, if the training can be completed within 20,000 CTs, the maximum error is less than err_goal=0.02.
According to Table 2, we can draw the following conclusions: The success rate of WNN training is the highest (i.e., 60% of WNN training processes are completed), while only 40% of RBFNN training processes are completed, and 0% of BPNN training processes are completed. The precision of WNN is the highest because when the target precision    26,27 which is also the feature of WNN. Thirdly, wavelet transform can do an excellent job such as function approximation 28,29 and pattern classification 30 . It has been proved that the wavelet neural network is an excellent approximator for fitting single variable function 31 . Problems and disadvantages of WNN are as follows: WNN cannot complete complex learning task because of structural limitation, etc. Similar problems also exist in BPNN, RBFNN and FCL. WNN is designed as follows: Firstly, structure of BPNN is adopted as the basic structure of WNN; Secondly, the form of activation function in hidden layers of RBFNN is adopted; Thirdly, the wavelet transform function is adopted as the activation function. The structure of WNN is shown in Fig. 2.  (1): (t) is the wavelet function (activation function). Parameters a j (t) and b j (t) are the scaling parameters of the wavelet function. t is the number of training time.
The number of nodes in hidden layer can be calculated according to the linear correlation theory: Redundant (repetitive or useless) nodes can be found and deleted by the comparison of parameters � a,b (t) in each node in hidden layer. The wavelet function (activation function) can be selected according to the frame theory. The closer the frame is to the boundary, the better the stability of the wavelet function, but when the frame is closer to the boundary, the problem of data redundancy will occur.
The wavelet function that satisfies the framework conditions is selected in Eq. (2): The loss function that we select is mean square error(MSE). E is the mean square error (MSE) of all samples, which can be formulated as Eq. (3). The reasons we use the mean square error (MSE) are as follows: Firstly, the outputs of WNN have negative numbers. Secondly, the cross-entropy loss function includes the logarithm function, which requires non-negative inputs.
In WNN, the back propagation of input errors ( δ −3 k in the output layer, δ −2 j in the hidden layer and δ −1 i in the input layer) can be calculated as Eq. (4) to Eq. (6).
Gradient descent method is adopted to adjust weights and bias of the neural network. Parameters such as w −2 ij , a −2 j , b −2 j , w −1 kj , b −1 k are adjusted in each training process. i, j and k are the numbers of neuron in each layer. w −2 ij represents the changed values of weight between the neurons in the input layer and the hidden layer. a −2 j and b −2 j represents the changed values of bias between the input layer and the hidden layer. w −1 kj represents the changed values of weight between the neurons in the hidden layer and the output layer. b −1 k represents the changed values of bias between the hidden layer and the output layer. net −2 j represents the input of hidden layer. At the training time t , the above parameters can be expressed as Eq. (7) to Eq. (11): According to the above descriptions, pseudocode of WNN method for simulations is shown in Algorithm 1.   www.nature.com/scientificreports/ Data Datasets generation. The data of the first two experiments ("feasibility experiment" and "Hyperparameter optimization experiment ") can be generated in the following steps. The first step is to generate the training set: The training set has two features: x and y. The label of the training set is z = f (x, y). The relationship between the two-dimensional features and the one-dimensional label can be expressed in Eq. (17)

Design of simulation
Architecture of WNN. "Feasibility experiment" is designed as follows: Common parameters for all the algorithms are set according to the same rules as follows: Firstly, the initial parameters such as w jk , w ij are set with the same randomization rules. Secondly, the target precision is set as a very low value (i.e., the target error is set as a very big value), so all the above algorithms are very easy to converge. Thirdly, the training time is set as a large value, so all the above algorithms have enough training time to complete the training. The initialization work for specific parameters is as follows: Firstly, maximum training time limitation is set as a very large number ( max_epoch = 20, 000 ) to ensure that BPNN, RBFNN, WNN have enough time to complete training. Secondly, target error is set as a very large number ( err_goal = 0.1 ) to make the training easy to complete. Thirdly, training processes of each algorithm are repeated 10 times (10 CSPs, CSP is defined in definition 2, i.e., case_repeat = 10 ) to observe statistical characteristics. Fourthly, the learning efficiency parameter is set as lr = 0.2 . Fifthly, the inertia coefficient parameter is set as la = 0.3 . Many values for lr and la were tested, where 0.2 and 0.3 were optimal for lr and la , respectively. The termination condition is that: Firstly, when the current training error is less than the target error, the training process is completed. Secondly, when the current training time exceeds the maximum training time limit, the training is stopped.
"Hyperparameter optimization experiment" is designed as follows: Parameters for this simulation are set as follows: Firstly, the initial parameters such as w jk ,w ij were set with the same randomization rules. Secondly, the target precision was set very high to find the highest precision of BPNN, RBFNN, and WNN. Thirdly, the training time was set as a large value, so all the above algorithms have enough training time to complete the training. The initialization work for specific parameters is as follows: Firstly, maximum training time limitation was set as a very large number (max_epoch = 20,000). Secondly, maximum error was set as a very small number (err_goal = 0.02). Thirdly, simulations of BPNN, RBFNN, and WNN were repeated for 10 CSPs (case_repeat = 10). Fourthly, the learning efficiency parameter was set as lr = 0.2. Fifthly, the inertia coefficient parameter was set as la = 0.3.More www.nature.com/scientificreports/ values of lr and la were tested in different simulations, and the values of lr = 0.2 and la = 0.3 listed above are the best. The termination condition is that: Firstly, when the current training error is less than the target error, the training process is completed. Secondly, when the current training time exceeds the maximum training time limit, the training is stopped.

Architecture of CWNN.
Simulation of CWNN is designed as follows: The network structures and parameters of CNN and CWNN simulation are listed in Table 6. Different values of the learning rate η and coefficient of inertia α are tested in repetitive simulations, and the values of η and α listed in Table 6 are the best ones.

Architecture of WCNN.
Simulation of WCNN is designed as follows: The network structures and parameters of WCNN simulation are listed in Table 7. Different values of the learning rate η and other parameters are tested in repetitive simulations, and the values of η in Table 7 are the best ones.

Conclusions
In this paper, WNN, CNN are implemented and CWNN, WCNN are proposed, all of them are simulated and compared. Conclusions are as follows: Firstly, both the structure of BPNN, the form of activation function in hidden layers of RBFNN and the wavelet transform functions are adopted to the design of WNN. The comparative results of BPNN,RBFNN and WNN are shown in Table 8.
According to Table 8, the following conclusions can be drawn: the mean MSE and mean error rate of WNN are lowest, and the training speed of WNN is the fastest, and the WNN method has no local minimum issue.
Secondly, CWNN is proposed. The fully connected neural network (FCNN) of CNN is replaced by WNN. WCNN is proposed. The activation function of the convolutional layer in CNN is replaced by the wavelet scale transformation function. The comparative simulations between CNN,CWNN and WCNN are shown in Table 9.
According to Table 9, the following conclusions can be drawn: All of CNN, CWNN and CWNN can complete the task of classification on MNIST.
Fifthly, training accuracy of CWNN is higher than CNN and classification ability of CWNN is better than CNN. Fifthly, training accuracy of WCNN is higher than CWNN and classification ability of WCNN methods are better than CWNN.
There are still some limitations of our methods although we've made some improvements based on CNN. Firstly, limitations of convolution layer. Back propagation algorithm is not a very efficient learning method, because these algorithms need the support of large-scale data sets. In back propagation algorithm, the parameters near the input layer will be adjusted very slowly when the layers are too deep. Secondly, limitations of pooling layer. A lot of valuable information such as information between the local and the whole will be lost in the pooling layer. Finally, the features extracted from each convolution layer cannot be explained, because neural network is a black box model which is difficult to be explained.
For the further research: Firstly, try to improve the learning ability and learning speed of CNN, WCNN and CWNN by changing the network structure. Secondly, use WCNN and CWNN as neurons to build a more lager and powerful neural network. Thirdly, design more experiments to prove the feasibility and verify the performances of the improved methods and of the work above. www.nature.com/scientificreports/