A multi-scale gated multi-head attention depthwise separable CNN model for recognizing COVID-19

Coronavirus 2019 (COVID-19) is a new acute respiratory disease that has spread rapidly throughout the world. In this paper, a lightweight convolutional neural network (CNN) model named multi-scale gated multi-head attention depthwise separable CNN (MGMADS-CNN) is proposed, which is based on attention mechanism and depthwise separable convolution. A multi-scale gated multi-head attention mechanism is designed to extract effective feature information from the COVID-19 X-ray and CT images for classification. Moreover, the depthwise separable convolution layers are adopted as MGMADS-CNN’s backbone to reduce the model size and parameters. The LeNet-5, AlexNet, GoogLeNet, ResNet, VGGNet-16, and three MGMADS-CNN models are trained, validated and tested with tenfold cross-validation on X-ray and CT images. The results show that MGMADS-CNN with three attention layers (MGMADS-3) has achieved accuracy of 96.75% on X-ray images and 98.25% on CT images. The specificity and sensitivity are 98.06% and 96.6% on X-ray images, and 98.17% and 98.05% on CT images. The size of MGMADS-3 model is only 43.6 M bytes. In addition, the detection speed of MGMADS-3 on X-ray images and CT images are 6.09 ms and 4.23 ms for per image, respectively. It is proved that the MGMADS-3 can detect and classify COVID-19 faster with higher accuracy and efficiency.

Methodology MGMADS-CNN model. Aiming to progress the detection accuracy and reduce the model size, a novel CNN-based model inspired by the multi-scaled gated multi-head attention mechanism, namely MGMADS-CNN, is proposed.
The structure of proposed MGMADS-CNN model is shown in Fig. 1, and the outputs and parameters of each layer are shown in Table 1. The model consists of convolution extraction blocks and attention modules. From left to right in Fig. 1, the first two convolutional blocks (B1 and B2 in Table 1) are composed of two depthwise separable convolution layers and a maximum pooling layer. The following three convolutional blocks (B3, B4 and B5 in Table 1) includes two depthwise separable convolution layers, a standard convolution layer, and a maximum pooling layer. Each depthwise separable convolution layer is output after batch normalization. The multi-head attention blocks represented by an ' A' in Fig. 1. The multiscale means the MGMADS-CNN can be feasible with different attention blocks. The MGMADS-1 means the structure including ' A1' in Table 1, the MGMADS-2 means the structure including ' A1' and ' A2' , and the MGMADS-3 means the structure including ' A1' , ' A2' and ' A3' in Table 1. To ensure the integrity of input information, the model uses the global average pooling instead of the fully connected layer. The global average pooling can avoid the loss of shallow feature information, and reduce the number of parameters, and prevent overfitting during the training.
The input images are 256 × 256. During the convolution sampling, it's necessary to add image channels and the convolution kernels for reserving the image features and avoiding the information loss. After convolution, the image channels will be increased from 3 to 64, and then gradually to 512. The step size of the maximum pooling layer is 2, so the image size is one half after each maximum pooling, the scale of the image is gradually reduced and the image information is continuously compressed. The '(None, 256, 256, 3)' in Table 1 means the output shape of each layer in the model. 'None' means that the 'batch_size' is not specified. '(256, 256)' is the size of the input feature map. '3' is the number of channels in the feature map.
The loss function of MGMADS-CNN is categorical_crossentropy loss function, which is used to evaluate the difference between the probability distribution obtained by the training and the real ones. It describes the www.nature.com/scientificreports/ distance between the actual output (probability) and the expected output (probability), that is, the smaller the value of cross entropy, the closer the two probability distributions are. The formula of the categorical_crossentropy loss function is: where C represents the loss function, x represents the sample, y represents the actual value, a represents the output value, and n represents the total number of samples. The gradient of weight W and bias B is derived as follows: (1) C = − 1 n x y ln a + 1 − y ln(1 − a)  Depthwise separable convolution. Depthwise separable convolution is a special convolution method that operates on space and depth. The core concept is to decompose a complete convolution operation into two steps, namely depthwise convolution and pointwise convolution 22 . The depthwise convolution applies a convolution kernel to each input channel for filtering. The pointwise convolution applies a 1 × 1 convolution to combine the outputs of the depthwise convolution. This decomposing has the effect of drastically reducing computation and model size. Pointwise convolution. Pointwise convolution uses the conventional 1 × 1 convolution kernel, and projects the channel calculated by the depthwise convolution onto the new channel space. After depthwise convolution, the pointwise convolution uses N convolution kernels sized 1 × 1 × M to convolve the M D G × D G feature maps, and then perform weighted combination in the depth direction to output N D G × D G × 1 feature maps G (D G , D G , N). The Pointwise Convolution process is shown in Fig. 3, where the number of convolution kernels determines the number of feature maps.
Efficiency analysis. The convolution uses a convolution kernel to filter and combine the features to produce a new representation. The depthwise separable convolution decomposes filtering and combination into two steps, thereby breaking the interaction between the output channels and the kernels, and the quantity of the channels and the kernel sizes are greatly reduced the computational cost 24 . For the input image F (D F , D F , M), the standard convolution kernel K (D K , D K , M, N). The formula of the standard convolution is:  The formula of the depthwise separable convolution is: The computational cost of the depthwise separable convolution is: To illustrate how the proposed model has high computation efficiency, we define a ratio of calculation consumption (RCC) to express it: From Eq. (5), it is easy to find that the RCC is far smaller than one and proximity to one of N. This indicates that with the map size increasing, the proposed method has lower cost and higher efficiency. The explanation for the efficiency improvement is that the model can learn apart features information of the spaces and the channels when executing the depthwise separable convolution. The depthwise separable convolution makes the detection speed faster and the size of model smaller by reducing the number of calculation parameters.
Multi-scale gated multi-head attention mechanism (MGMA). In the CNN model, the pooling layer and the convolution layer are often applied together to reduce the dimension of the input features and the calculation cost. However, too many pooling layers result in the loss of information about small targets in the deep feature maps. In this paper, a multi-scale gated multi-head attention mechanism (MGMA) is proposed to avoid the drawback.
The attention mechanism 23 is a signal processing mechanism of human vision. It makes full use of limited visual resources to focus on specific vision areas selectively. The attention mechanism has been applied in deep learning recently and widely used in image processing 25 , speech recognition 26 , and natural language processing 27 , and so on. The core task of attention mechanism is to select critical information from mass information quickly and accurately. Compared with the standard convolution calculation, the attention mechanism is characterized by high accuracy, fewer parameters and lower calculation cost.
Scaled dot-product attention. The essence of the attention mechanism is a process of weighted summation of 'Value' based on 'Query' and 'Key' , and redistribution of weights. From the formal point of view, the attention mechanism can be understood as a key-value query, which maps queries and key-values to the output. The output is the weighted summation of 'Value' , and the weights are the similarities of 'Query' and 'Key' 28 .
As shown in Fig. 4, the 'Key' dimension d k and the 'Value' dimension d v are input into the network. The similarity between 'Query' and 'Key' is calculated by dot product operation and divided by √ d k . The weights of each 'Value' through the softmax function are obtained. Figure 4. Attention calculation process. www.nature.com/scientificreports/ In practice, the 'Query' , 'Key' , and 'Value' are packaged into matrices Q, K and V , which are used in the same set of queries for the operation of the attention function. The similarity between the 'Query' and multiple 'Keys' is calculated when a certain 'Query' is given, and the weight coefficient of each 'Key' is obtained corresponding to each 'Value' . The dot product is considered to carry out as Eq. (6): is a scale factor in the formula, which is mainly used to adjust the calculation result. The numerical conversion of the original score can be completed and normalized through a classifier softmax, by which the original calculated score can be sorted into a probability distribution with the sum of all element weights. Meanwhile, it can set the weights for more important elements through the internal mechanism of the softmax classifier. The following formula is used for the output a i of softmax function.
The scaled dot-product attention can be expressed as: The whole process of attention mechanism calculation is firstly to carry out dot product similarity calculation of 'Query' and 'Key' , then to obtain the weight coefficient by used softmax function, and finally to carry out the weighted summation of 'Value' according to the weight coefficient.
Multi-head attention mechanism. The multi-head attention mechanism is a special scaled dot-product attention calculation approach. As shown in Fig. 5, the multi-head attention mechanism learns a variety of mappers through the model. First, the linear transformation is carried out on each d model of 'Query' , 'Key' , and 'Value' , and then it calculates its scaled dot product attention in parallel to generate a d v -dimensional output. The multiple (h times) outputs are integrated by Concat function, and linear transformation is performed again to obtain the final output value.
The multi-head attention mechanism does not increase the complexity of the algorithm, but enables the model to learn the correlation information in the different representation subspaces, and improves the model's perception ability. The calculation formula is shown as follows: In this paper, there are 8 parallel attention layers, which indicates h = 8. Set d k = d v = d model /h = 64. Due to the dimensionality of each attention layer decreases, the total computational cost is similar to that of the full-dimensional single-head attention layer. Compared with the scaled dot-product attention mechanism, the multi-head attention mechanism has lower complexity and allows the model to learn different representation information avoiding the loss of small target information due to the average value. MGMA transmits the changed feature images to each independent branch channel, and calculates the multihead attention mechanism of each scale channel in the feature maps. Then, the weighted output of feature maps by multiple scale channels is integrated to obtain the multi-scale feature information. The structure is shown in Fig. 6.
In Fig. 6, the 'H' represents the number of branches of the gated attention channel. ' A' represents the multihead attention computing layer. The number of multi-head attention computing layers is the same as the number of channel branches H. In MGMA, when the size of convolutional neural network changes (usually after the maximum pooling layer), the feature information of this layer will be input into the 1 × 1 layer for dimensional transformation, then input into MGMA to extract the features.
In MGMADS-Net, the deep separable convolution can effectively reduce the computational parameters of the model. In order to further improve the classification accuracy of the model, MGMA uses a multi-head attention mechanism to integrate the results of attention operations in different subspaces of multiple branching channels. As a result, MGMADS-Net model has improved the attention to targets of different sizes, especially small and medium-sized targets in large-size images, thus improves the classification accuracy of the model.
These datasets are marked by hospital experts with scientific rigor. The image samples are shown in Fig. 7: According to the dataset distribution, we conduct four-classification on X-ray images and binary classification on CT images as Fig. 7 shown.
Dataset settings. The distribution of the collected dataset has a data unbalance problem distinctly, which makes the classifier bias towards the class with more samples, which goes against the model's generalization characteristic and the objective judgement of the models.
Data augmentation technique is a way to solve the problems of data shortage and unbalance. It is a popular valid approach to avoid network overfitting when training the model in the current researches. In this paper, the augmentation methods such as affine transformation 20 , image mirroring 30 and position transformation 31 , are used to expand and enhance the dataset.
Affine transformation. Affine transformation includes rotation, translation, scaling, reflection and shearing, which can increase the amount of synthesized data and improve the robustness of the model. The principle of affine transformation can be described as follows.
The image is randomly rotated along the X axis and Y axis, and the enhancement matrix is: where A =  www.nature.com/scientificreports/ θ is the rotation transformation angle, γ is the shearing factor, µ is the scaling factor. The matrix A and B are the rotation transformation matrix, C is the shearing transformation matrix, and D is the random scaling matrix. The pixel coordinate x, y, z is converted to ( x ′ , y ′ , z ′ ).
Image mirroring. The horizontal and vertical mirroring transformation are carried out to the raw dataset. The horizontal mirroring takes the vertical center line of the image as the axis, swapping the pixels, that is, swapping the left half and right half of the image. The vertical mirroring takes the horizontal center line of the image as the axis, and reverses the upper half of the image with the off-duty part.
Suppose the image has width and height. Let the width of the image be width and the length be height. (x 0 ,y 0 ) are the coordinates of the original image, (x 1 , y 1 ) are the transformed coordinates: The horizontal and vertical mirroring: After data augmentation through above techniques, the datasets used in this paper are shown in Table 2.
In Table 2, there are 17,439 X-ray images including 4832 normal, 4418 bacterial pneumonia, 4166 viral pneumonia and 4023 with COVID-19 X-ray images. The lung CT images include 5683 normal and 5156 with COVID-19. The augmented dataset has more training data which can improve the generalization and the reliability abilities of the model. It is significant to enhance the robustness of the model and overcome the imbalance problem of positive and negative samples.

Experimental result and analysis
The training, validation and testing experiments are undertaken on the platform of Intel Core i5-9400F with Windows10 64-bit OS and NVIDIA GeForce GTX 1660 GPU. Python 3.6 is used to code the model, and deep learning frameworks such as TensorFlow GPU 1.8.0, CUDA 9.0 and Keras 2.1.4 are used to build the model structure. In addition, the models use the Pycharm 2017 IDE tool and packages such as Numpy, Scikit-Learn, (14)  Here TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively. Generally speaking, high specificity means low misdiagnosis rate, and high sensitivity means low missed diagnosis rate. The higher the accuracy, the better the classification effect.
Tenfold cross-validation. In order to effectively reduce the variability of test results, this paper adopts k-fold cross-validation to conduct experiments, and randomly divide the dataset D into k uniform and disjoint subsets D 1 , D 2 ,…, D k , such that: In this paper, k = 10. The validation process can be executed as following steps: Step 1 The dataset is randomly divided into ten equal parts.
Step 2 Taking one part as the test set, one as the validation set, and the other eight parts as the training set.
Step 4 Changing the data distribution randomly and execute the Step 2 ten times. The dataset distribution at each cyclic execution is as shown in Fig. 8.After ten executions, the average accuracy is used to evaluate the performance, and the formula is as follows: Training and validation comparisons. The eight models (LeNet-5, AlexNet, GoogLeNet, ResNet, VGG-Net-16, MGMADS-1, MGMADS-2 and MGMADS-3) are conducted on the declared platform and framework. In validation, with each epoch execution, it gets a loss value. After 100 epochs, 100 loss values can be obtained. In order to present the differences among the models directly and graphically, it picks up validation_loss (Val_loss) and validation_accuracy (Val_acc) of ResNet, VGGNet-16 and MGMADS-3 as representatives shown in Fig. 9.
At epoch 100, the three models have Val_loss values of 0.2069, 0.1891 and 0.0140, and the Val_acc values are 93.19%, 92.40% and 96.25% respectively on the X-ray images dataset. Meanwhile, on the CT images dataset, the Val_loss values are 0.1746, 0.1483 and 0.0136 corresponding to the three models, and the Val_acc values are 95.53%, 93.75% and 98.09%. On both X-ray dataset and CT images dataset, the Val_loss and Val_acc of the MGMADS-3 achieve outstanding results. In the view of accuracy, the MGMADS-3 model has achieved a higher value, and the growth trend and oscillation amplitude of the Val_acc curve behave better than that of the ResNet and VGGNet-16 models. It is verified the validity and feasibility of the MGMADS-3 model.
The detailed comparisons on LeNet-5, AlexNet, GoogLeNet, ResNet, VGGNet, MGMADS-1, MGMADS-2 and MGMADS-3 models are list out as Column 3-6 in Table 3. For the four classification of the X-ray images in the Table 2   It is easy to figure out from the Table 3 that no matter the training or the validation, the MGMADS-3 model achieves the highest accuracy than the compared typical models, which has a good recognition effect on X-ray and CT images.

MGMADS-CNN performances tests.
The performances adopted to evaluation the models include the model size, detection speed, the specificity, the sensitivity, and the test accuracy. Another vital evaluation target named receiver operating characteristics (ROC) is presented in the following. At the end, the comparisons with the related published literatures are list out.
Test performance. The comparison experimental results are shown Column 7-10 in Table 3 of the eight models as above.
From Table 3, the model sizes of MGMADS-CNN are about 40 M bytes, which much smaller than other networks, resulting in a lighter weighted network structure. With the shrink models, the performances are kept stable and even improved slightly. In terms of model classification, MGMADS-CNN models have higher specificity, sensitivity and accuracy compared with other models. For example, the MGMADS-3 model achieves 98.06%, 96.60% and 96.75% on the X-ray images data, and 98.17%, 98.05% and 98.25% on the CT images data.
In order to compare the detection speed of the model more intuitively, ResNet and VGGNet-16 models are selected as the comparison networks. The detection speed histograms of ResNet, VGGNet-16 and MGMADS-3 models in X-ray image data and CT image data are shown in Fig. 10.  www.nature.com/scientificreports/ As shown in Fig. 10, on the X-ray images data, the detection speeds for each image of ResNet, VGGNet-16 and MGMADS-3 model are 12.24 ms, 10.09 ms, 6.09 ms, and on the CT images data, the detection speeds are 10.37 ms, 7.83 ms, 4.23 ms. It can be seen obviously that the detection speeds are improved to a new level either on X-ray images or CT images. It is further verified that the proposed model can detect and classify COVID-19 faster.
Receiver operating characteristic curves. The Receiver Operating Characteristics (ROC) curve is the plot of True Positive Rate (TPR) against False Positive Rate (FPR). It represents the diagnostic ability of the model by measuring the degree of separability among different classes. The ROC curves on the X-ray images data and CT images data are shown in Fig. 11.
The higher the area under the curve (AUC), the better is the model in distinguishing among different classes. From Fig. 11, the AUC of ResNet, VGGNet-16 and MGMADS-3 are 0.86, 0.69, and 0.97 on X-ray images data, and 0.93, 0.70, and 0.98 on CT images data.
Comparison to the related literatures. Considering the related reports published in 2020, the comparisons between the literatures mentioned in the Section 'Introduction' and our proposed models are summarized in Table 4. The number in the bracket of the third column represents 2, 3, and 4 classifications.
As shown in the Table 4, the MGMADS-3 network model proposed in this paper has achieved better classification effects than the other models whether of four-classification on X-ray images or binary classification on CT images.
There are two reasons for the better classification effect of the proposed model. First, the MGMA can effectively extract more features information from different subspaces and learn the correlation information of small targets. Second, the depthwise separable convolution substituting the standard convolution can reduce the number of model parameters. Compared with the typical convolutional structures, the proposed MGMADS-CNN  www.nature.com/scientificreports/ model with lighter structure is featured as faster detection speed, higher classification accuracy, and better efficiency.

Conclusion
In the paper, a new CNN model named MGMADS-CNN is proposed, which is based on multi-head attention mechanism and depthwise separable convolution. The MGMADS-CNN model has the advantages of small size, fast detection speed and high accuracy. The MGMADS-CNN model can extract small target information from different subspaces by the multi-head attention mechanism, thus improve the accuracy of the model. Compared with the standard convolution, the depthwise separable convolution reduces the number of model calculation parameters, thus reduces the size of the model and improving the detection speed of the model. The model proposed in this paper not only improves the practicability of COVID-19 classification, but also provides a novel idea for computer-aided diagnosis (CAD).