An innovative network based on double receptive field and Recursive Bi-directional Long Short-Term Memory

Sequence recognition of natural scene images has always been an important research topic in the field of computer vision. CRNN has been proven to be a popular end-to-end character sequence recognition network. However, the problem of wide characters is not considered under the setting of CRNN. The CRNN is less effective in recognizing long dense small characters. Aiming at the shortcomings of CRNN, we proposed an improved CRNN network, named CRNN-RES, based on BiLSTM and multiple receptive fields. Specifically, on the one hand, the CRNN-RES uses a dual pooling core to enhance the CNN network’s ability to extract features. On the other hand, by improving the last RNN layer, the BiLSTM is changed to a shared parameter BiLSTM network using recursive residuals, which reduces the number of network parameters and improves the accuracy. In addition, we designed a structure that can flexibly configure the length of the input data sequence in the RNN layer, called the CRFC layer. Comparing the CRNN-RES network proposed in this paper with the original CRNN network, the extensive experiments show that when recognizing English characters and numbers, the parameters of CRNN-RES is 8197549, which decreased 133,752 parameters compare with CRNN. In the public dataset ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT), the CRNN-RES obtain the accuracy of 96.90%, 89.85%, 83.63%, and 82.96%, which higher than CRNN by 1.40%, 3.15%, 5.43%, and 2.16% respectively.


Scientific Reports
| (2021) 11:22978 | https://doi.org/10.1038/s41598-021-01520-y www.nature.com/scientificreports/ We use BiLSTM 2,3 as the basic network of the RNN layer. BiLSTM is a two-way LSTM network, the traditional LSTM can only learn the one-way feature dependency of the image sequence, but the sequence we recognize may have a reverse dependency, such as the words "google", "brother", we can predict the words "google" and "brother" based on "googl" and "brothe", or predict the words "google" and "brother" based on "oogle" and "rother". So here we choose the bidirectional LSTM as the basic network in the RNN module of the CRNN-RES.
We use recursive training strategy, specifically, CRNN-RES adds a short-circuit connection between the output of the convolution layer and the output of the BiLSTM layer, and the result of the first output is taken as part of the input of the BiLSTM. The specific process is as follows: firstly, CONO (the output of the network in the convolution layer) is put into the BiLSTM layer, and then L1(the output result of the BiLSTM layer) and CONO (the output result of the convolution layer) are added to get a result, which is represented by O1 here. The specific process can be expressed as follows: Then we input O 1 into BiLSTM, and L 2 (the result of the second BiLSTM layer) is added to O 1 and CONO (the output of the convolution layer), and finally, O 2 (the final output of RNN layer) is obtained, which is expressed as follows: The process is shown in Fig. 1. The output of the convolution layer is denoted as CONO, and the operation function of BiLSTM is denoted as F, then the first and second operations can be expressed as follows: The specific process of the RNN layer is as follows: we input data which has a shape of [timestep,batchSize,512] firstly to the BiLSTM, and we will get an output data which has a shape of [timestep, batchSize, 512]. Then we add the previous output of the BiLSTM to the output of the convolution layer, and the output shape is [timestep, batch-Size, 512]. The previous output result is input to BiLSTM again, the input shape of data is [timestep, batch-Size, 512], the output shape is [timestep, batchSize, 512].
The second BiLSTM operation results are then added to the previous two outputs, and then the output dimension of the data is changed by the full connection layer. The final shape of the output is [timestep * batchSize, nclass] , finally we convert the output, the final output shape is [timestep, batchSize, nclass]. Here timestep refers to the length of time series, batchSize refers to the number of each batch of pictures input to the network during training, and nclass refers to the number of categories of classification.
Mathematical theory of RNN layer. For CRNN-RES networks, the calculation principle is the same as LSTM 3 or BiLSTM. In order to make the elaboration of the calculation principle more concise and convenient for readers to understand, we only calculate the form of one-way LSTM. All mathematical calculations are based on the following network architecture. As show in Fig. 2.
1) Calculate the information of the forgetting gate: Among them, w f represents the weight, b f represents the bias term, h t−1 is the output of the last hidden unit of lstm, x t is the input of the current hidden neural unit, and we use sigmoid as the activation function.
2) Calculate the information of the memory gate: Among them, w i represents the weight, b i represents the bias term. h t−1 is the output of the last hidden unit of lstm, x t is the input of the current hidden neural unit, and we use sigmoid as the activation function.
Among them, w c represents the weight, b c represents the bias term.
3) Calculate the neural unit state at the current moment: Where C 1t−1 represents the state of the cell at the previous moment. 4) Calculate the information of the output gate: Among them, w o represents the weight, b o represents the bias term. 5) Calculates the current state of the current hidden layer: Where C 1t represents the state of the cell at the current moment. At this time, the calculation of LSTM unit is completed. 6) Next, perform the following calculations: (1) www.nature.com/scientificreports/ Where x t is the input of the current hidden layer neural unit, h t is the output of the current hidden unit of lstm. Re-input output1 into the lstm unit. Then there are: The final output as follow: CNN and RNN flexible convergence layer. In CRNN, the general approach is to directly input the features of the CNN layer output to the RNN layer after dimensional transformation. We assume that the shape of the input data of the network is [8,1,32,128], and 8 is represented as 8 pictures for each batch, 1 is represented as 1-channel grayscale picture, 32 is represented as the width of the picture, and 128 means the hight of the picture. Then after the CNN layer of CRNN, the output feature shape is [8,512,1,16]. At this time, CRNN's method is to convert the feature into a feature of shape [16, 8,512], and then input features with sequence length of 16 into the RNN layer. In this setting, there are two problems: (a) If the maximum character length in all pictures is 2, and each picture contains at most 2 characters that need to be recognized, then the features will also be divided into 16 sequences and sent to the RNN layer, but in fact, we only need to two sequences are sufficient for prediction. Feeding 16 sequences will undoubtedly increase the parameters of the RNN network. In practice, we do not need 16 sequences, only two sequences are enough. Too many sequences will reduce the accuracy and convergence speed of the network.
(b) If the minimum character length in all pictures is 17, and each picture contains at least 17 characters that need to be recognized, then the features will also be divided into 16 sequences and sent to the RNN layer, and the network will finally get the accuracy of each sequence. That is the maximum length of the recognition result of the current picture is only 16 characters, which means that the recognition rate of the network will always be 0 at this time, and it will never converge.
In order to solve the above problems, we designed a structure that can flexibly configure the sequence length of the input data of the RNN layer for different target data. The flexible connection layer between CNN and RNN layer, we call it CRFC layer, and its structure is dynamically adjusted based on the characteristics of the output of the CNN layer. For the case where the length of the target character sequence is much smaller than the width of the output feature, its structure is shown in the Fig. 3. The specific process is as follows: (a) the features output www.nature.com/scientificreports/ by the CNN layer are fed into pooling and fully connected operations, the width W of the feature is transformed into T, and the two are added. We use K to denote the core size of the pooling layer, S to denote the step size of the pooling layer, P to denote the padding of the pooling layer, T to denote the number of sequences we need to input to the RNN layer, and W to denote the width of the feature output by the CNN layer, the calculation formula of K, S, P is as follows: where T 1 represents the number of final sequences of data that we finally get and need to input into the RNN layer. The structure of this layer is shown in Fig. 4, where B represents batch size, C is number of channel, H and W are height and width of feature respectively, and T represents number of sequence. When the length of the target character sequence is greater than the width of the output feature, CNN directly uses full connection to convert the width of the feature from W to T, the structure is shown in Fig. 5.

Modifications of the convolution layer.
In order to improve the recognition effect of narrow characters, the pooling layer of the original CRNN uses a narrow pooling core, but this pooling core does not take into account the wide characters. In the actual data, narrow characters and wide characters coexist. In addition, in order to make the network have stronger feature extraction ability, we improve the CNN layer. The convolutional layer of CRNN-RES is used to extract the feature information of the image, which can be regarded as the feature extraction layer of the network. The purpose of our modification to the convolutional network is to make the network have stronger feature extraction capabilities and ensure that each convolutional layer can extract richer image feature information. As you can see from Fig. 4, compared to the convolutional network of CRNN, we added a layer of BatchNomalization 13 after the third layer of convolution, in order to make the network better to fit the features of the image, reduce the probability that the features are too complex and exceed the fitting ability of the network. After the convolution of the first layer and the second layer, a pooling layer with a kernel size of 1 × 2 was added respectively, and the output results of the pooling layer with a kernel size of 2 × 2 were fused with those of the pooling layer with a kernel size of 1 × 2 . The purpose is to allow the network to use multiple receptive fields to extract features, enhance the feature extraction capabilities of the network, and make the network have better feature extraction capabilities for both wide characters and narrow characters. In the last two pooling layers, we added the pooling layer with the kernel size of 3 × 2 respectively and fused the output results of the pooling layer with the kernel size of 1 × 2 and the output results of the pooling layer with the kernel size of 3 × 2 . The purpose here is to make the network use multiple receptive fields to extract features so that the network is friendly to both wide characters and narrow characters. The purpose of adding pooled layers is to obtain different receptive fields and more receptive field characteristics. In this way, the network can better extract the features of narrow characters and wide characters, so as to improve the recognition accuracy of characters of different sizes.
In order to facilitate the reader's understanding, we will use algebra and graphs to illustrate our changes to the convolutional layer: Assume that the result of maximum pooled branch 1 is A and the result of maximum pooled branch 2 is M. Then the final result P will be obtained by the pooling layer is: Our modifications to the CNN module are shown in the following Table 1. Readers can refer to the following Table 1 to understand the detailed structure and parameters of the CNN-RES network. Through the following (14) P = A + M Figure 5. When the length of the target character sequence is greater than the width of the output feature, CNN directly uses full connection to convert the width of the feature from W to T.

Evaluation
In this section, we introduce the implementary details, evaluation metrics, and the evaluation results of our method, which includes the comparison results with baseline methods.

Dataset and hyperparameters.
In order to compare with CRNN, and more intuitively show the performance improvement brought by improved network, we chose the same synthetic dataset (Synth) 14 as the training data. The dataset contains 8 million training images and their corresponding actual words. The same dataset as CRNN was selected for testing, that is ICDAR 2003 (IC03) 4 , ICDAR 2013 (IC13) 5 , IIIT 5k-word (IIIT5k) 6 , and Street View Text (SVT), the test set of these four kinds of data, and the partition of the dataset was not modified. The data used is exactly the same as CRNN. Before the images been input into the network, we also scaled the image uniformly and equally to the size of 100x32. We tested the effects of modifying CNN layer, RNN layer and CRFC layer separately on IIIT5k dataset. Since the maximum character length in IIIT5k is 22, we set the hyperparameters t to 22. All experimental results were calculated using the Tesla V100 GPU.
Experimental results and discussion. As can be seen from the Tables 2 and 3, the method proposed in this paper is 1.40%, 3.15%, 5.43%, and 2.16% higher than CRNN in the four datasets, respectively. Compared with CRNN, the method proposed in this paper has significantly improved the accuracy. The network model   We performed ablation experiments to verify the effects of different structures on the model. As shown in Table 4, when CRFC layer is added, the accuracy is improved from 0.782 to 0.813, which proves the effectiveness of CRFC layer. After we use double pooling to modify the CNN layer in CRNN, the accuracy is improved from 0.782 to 0.790, which proves the effectiveness of double pooling to modify the CNN layer. Moreover, after modifying the double-layer BiLSTM of RNN to the single-layer BiLSTM of recursive training, the accuracy is improved from 0.790 to 0.804, and the recognition speed is reduced from 7.01 to 6.71 ms. It is proved that our model can not only improve the accuracy, but also improve the speed of the model.
In addition, we show the recognition effect of specific samples in Table 5 and show that our method is more robust in character dense image and wide character image recognition. In short, the CRNN-RES network proposed in this paper achieves a higher recognition accuracy than CRNN under the condition of faster speed and smaller models than CRNN.

Conclusion
This paper introduces a novel neural network method based on BiLSTM to improve the performance of CRNN network. Our method reduce the number of network parameters, while archiving higher accuracy. Extensive experiments on datasets demonstrate the effectiveness of our proposed method.