Hybrid convolution neural network with channel attention mechanism for sensor-based human activity recognition

In the field of machine intelligence and ubiquitous computing, there has been a growing interest in human activity recognition using wearable sensors. Over the past few decades, researchers have extensively explored learning-based methods to develop effective models for identifying human behaviors. Deep learning algorithms, known for their powerful feature extraction capabilities, have played a prominent role in this area. These algorithms can conveniently extract features that enable excellent recognition performance. However, many successful deep learning approaches have been built upon complex models with multiple hyperparameters. This paper examines the current research on human activity recognition using deep learning techniques and discusses appropriate recognition strategies. Initially, we employed multiple convolutional neural networks to determine an effective architecture for human activity recognition. Subsequently, we developed a hybrid convolutional neural network that incorporates a channel attention mechanism. This mechanism enables the network to capture deep spatio-temporal characteristics in a hierarchical manner and distinguish between different human movements in everyday life. Our investigations, using the UCI-HAR, WISDM, and IM-WSHA datasets, demonstrated that our proposed model, which includes cross-channel multi-size convolution transformations, outperformed previous deep learning architectures with accuracy rates of 98.92%, 98.80%, and 98.45% respectively. These results indicate that the suggested model surpasses state-of-the-art approaches in terms of overall accuracy, as supported by the research findings.

1. We developed a hybrid CNN embedded with a channel attention mechanism, called ResNet-BiGRU-SE, to extract deep spatio-temporal features hierarchically and distinguish human activities in daily living. 2. Various CNN architectures have been employed as the underlying models for sensor-based HAR. To evaluate the performance of the ResNet-BiGRU-SE model, we compared its effectiveness with that of other CNNbased models on the HAR dataset. Additionally, we conducted a comparative analysis between our proposed approach and state-of-the-art models using three benchmark HAR datasets (UCI-HAR, WISDM, and IM-WSHA) for a fair assessment.
The remaining sections of the study are arranged as follows: Section "Related works" explores the research on sensor-based HAR based on DL and current frameworks; Section "Research methodology" describes the hybrid DL framework presented in this study for sensor-based HAR; and Section "Experiments and results" describes the experimental setup and provides experimental findings. This section also contains an analysis of the experimental outcomes. Section "Discussion" concludes the study and addresses future employment.

Related works
HAR poses challenges as a time series classification problem, involving the prediction of an individual's movements using sensory input. Typically, it necessitates extensive domain knowledge and signal processing techniques to extract appropriate features from raw data that align with a machine learning algorithm. DL methods, such as CNNs and Long Short-Term Memory Neural Networks (LSTMs), have demonstrated their effectiveness by automatically learning relevant features from raw sensory input, thereby achieving state-of-the-art performance 14,15 . HAR aims to collect and recognize real-world actions performed by individuals or groups while considering the surrounding environmental factors. This field holds significant promise in the study of Human-Computer Interaction 16,17 as it has the potential to revolutionize how humans interact with technology in the present era. The objectives of HAR can be categorized into five main areas: identifying fundamental movements, detecting everyday motions, recognizing unique events, forecasting caloric expenditure, and performing individual biometric recognition 18 . To achieve these goals, a variety of sensors can be utilized, including environmental sensors and wearable video cameras. In practice, wearable sensors often take the form of smartphones or sensors integrated into wearable devices.
While camera sensors can provide unique information not obtainable from other sensor types, they come with certain drawbacks. Camera-based systems require constant monitoring of individuals, resulting in the need for significant storage capacity and computational capabilities. Additionally, continuous surveillance by camera systems may lead to discomfort or unease among individuals 19 . An example of a camera-based indoor human motion tracking system is presented by Zhou et al. 20  www.nature.com/scientificreports/ video processing capabilities. Another benefit of camera sensors is their ability to provide accurate data for human motion identification systems. Ambient sensors offer the ability to monitor and record an individual's interactions with their environment. In the experimental context of Zhan et al. 's study 21 , wireless Bluetooth acceleration and gyroscope sensors were employed to capture situational components and demonstrate their usage. Furthermore, room-side wired microphone arrays were utilized to detect ambient sound, while Reed switches were placed on doors, drawers, and shelves to detect their operation and generate contextual information. However, it should be noted that environmental sensors are limited to specific conditions and architectural configurations, rendering the HAR system non-universal. A well-designed and trained HAR system cannot be directly applied to a different environmental setting. Additionally, the implementation cost associated with these sensors tends to be relatively high.
Wearable technologies worn on the human body have the capability to recognize the physical aspects and characteristics of individuals' everyday tasks. Inertial sensors such as accelerometers and gyroscopes, along with GPS and magnetic field sensors, are commonly used in applications for action identification. In specific studies, action identification is achieved by utilizing one or more accelerometers attached to various regions of the human body. Dong and Biawas 22 introduced a wearable sensor network designed for HAR. Additionally, Curone et al. 23 utilize a tri-axial accelerometer worn on the body for action recognition.
Given the significant advancements made by DL across various ML applications, and considering the inherent multi-class nature of DL techniques, our systematic review begins with a concise overview of DL for human activity detection. Wang et al. 24 conducted a comprehensive review of 56 publications from 2011 that utilized DL techniques, including deep neural networks, CNNs, RNNs, auto-encoders, and limited Boltzmann machines, for sensor-based HAR. They found that no single model outperforms all others in every scenario, emphasizing the importance of selecting a model based on the specific application requirements. Additionally, they compared three benchmark datasets for HAR: the Opportunity dataset 25 , the Skoda dataset 26 , and the UCI-HAR dataset 27 (collected using smartphones with multiple inertial measurement units). Among these datasets, they identified studies 12,28-30 as representing the state-of-the-art in DL for HAR utilizing inertial measurement units (IMUs).
Sophisticated HAR models benefit from complex and deeper structures, leading to improved accuracy compared to previous feature learning methods. These models utilize CNNs for automatic feature extraction. In the context of object identification, the CNN feature extractor is often referred to as the backbone. This term emphasizes that the architecture of the feature extractor and the overall model construction are evaluated separately and independently.
Instead of relying on basic models, researchers have developed sophisticated backbone models to enhance performance. Dong et al. 31 introduced a combination of Hierarchical Cross-Filtering (HCF) and an inception module. Long et al. 32 proposed a method of independently learning large-scale and small-scale networks and subsequently joining them. This approach incorporates two different sizes of residual blocks as crucial components. Tuncer et al. 33 suggested utilizing a ResNet structure with multiple layers as feature extractors, with the extracted features cascaded to serve as the backbone. Ronald et al. 34 presented the iSPLInception backbone, which is based on Inception-ResNet and utilizes a multichannel-residual hybrid architecture for HAR research. Mehmood et al. 35 employed DenseNet as the backbone and leveraged dense connections for HAR purposes.

Research methodology
This research investigated sensor-based HARs using DL techniques to extract abstract characteristics from raw sensor data. As shown in Fig. 1, the explored HAR framework consists of four key process steps: data acquisition, data pre-processing, model training, and model assessment.
Data acquisition. This section highlights the HAR datasets utilized in the evaluation of this study. For assessment purposes, three public datasets were included: UCI-HAR, WISDM, and IM-WSHA. These datasets consist of inertial data collected from smartphone sensors, with each dataset capturing information from a  IM-WSHA dataset. The IM-Wearable Smart Home Activities (IM-WSHA) dataset 37 is a comprehensive collection of signal data specifically designed to serve as a standard dataset for HAR. This dataset features three wearable Inertial Measurement Unit (IMU) sensors that capture three-axis accelerometer, gyroscope, and magnetometer data. The sampling frequency of the dataset is 100 Hz. To accurately capture individuals' movement patterns during their daily activities, the IMU sensors were strategically positioned on different body parts, namely the thorax, femur, and wrist. The study involved ten participants, with an equal distribution of males and females, who performed a total of eleven distinct physical tasks within an indoor environment. These tasks encompassed various common activities such as walking, exercising, cooking, drinking, talking on the phone, doing laundry, watching television, studying, brushing hair, using a laptop, and vacuum-cleaning.
Data pre-processing. The acquired raw data from sensors often contains measurement noise and additional unforeseen noise caused by the participant's dynamic movements during data collection. The presence of noise in the signal distorts the usable information it carries. Therefore, it becomes crucial to reduce the influence of noise and extract valuable information from the signal for further processing. Common filtering techniques used to address this issue include mean, Low-pass, and Wavelet filtering 38,39 . In our work, we employed a thirdorder low-pass Butterworth filter with a cutoff frequency of 20 Hz across all three dimensions of the accelerometer, gyroscope, and magnetometer sensors for effective signal denoising. This choice of filter parameters is suitable for recording human motion since the energy content below 15 Hz accounts for 99.9% of the signal, making it an appropriate resolution. Once the noise was removed, the filtered sensor data underwent a transformation to prepare them for further analysis. In this phase, a Min-Max normalization approach was employed to adjust each dataset's values within the range of [0, 1]. This normalization is advantageous for learning techniques aiming to assess the effects of various factors.
During the data segmentation phase, the normalized data from all sensors is divided into equal-sized portions using fixed-size sliding windows. In this study, we chose a sliding window of 2.56 seconds, which resulted in sequences of sensory data with a specific length. These segmented portions are then used for model training.
The proposed hybrid convolutional neural network. This research proposes an effective biometric recognition model called ResNet-BiGRU-SE for utilizing motion signal data captured from smartphone sensors. The proposed method automatically generates identifying characteristics based on the sensor data inputs. ResNet-BiGRU-SE consists of a convolutional block and eight hybrid residual blocks, which extract standard spatial features. The model also includes a global average-pooling (GAP) layer, a flattened layer, and a fully connected layer, as illustrated in Fig. 2.
Convolutional block. CNNs typically employ a predefined set of elements and are commonly utilized for supervised learning. In these neural networks, each neuron is connected to every other neuron in the subsequent layers. The activation function of the neural network converts the input value of the neurons into their output value. The effectiveness of the activation function is influenced by two important factors: sparsity and the neural network's ability to handle reduced gradient flow to its lower layers 40 . In CNNs, pooling is often used for dimen- In this study, we utilized a convolutional block (ConvB) to process the raw sensor data and extract low-level features. The ConvB, as depicted in Fig. 2, consists of four layers: 1D-convolutional (Conv1D), batch normalization (BN), exponential linear unit (ELU), and max-pooling (MP). Conv1D employs multiple trainable convolutional kernels to capture different features, generating a feature map for each kernel. The BN layer is employed to stabilize and accelerate the training process, while the ELU layer enhances the model's expressive capability. Additionally, the MP layer is used to reduce the size of the feature map while retaining the most significant characteristics.
Structure of gated recurrent unit. Gate recurrent unite (GRU) was developed as a new RNN-based approach to prevent the exploding/vanishing gradient issue; nevertheless, the design's memory cells result in a higher memory consumption 41 . The GRU is a straightforward variation of the LSTM in which individual memory cells are omitted from its design 42 . As seen in Fig. 3a, a GRU's network has an update and a reset gate that handles the update level of each concealed state, i.e., it determines which data must flow to the next stage and which does not. GRU computes the hidden state h t at time t based on the output of the update gate z t , the reset gate r t , and the current input x t . The prior hidden state h t−1 is determined as follows: where σ is a sigmoid function and ⊕ is an elementary addition operation, and ⊗ is an elementary multiplication operation.
Schuster and Paliwal 43 introduced a bidirectional RNN (BiRNN) in 1997 in order to address the drawback of a conventional (unidirectional) RNN. In addition to the present input, the output at a given period also incorporates past and future data. This is performed by concurrently training the network in the forward and reverse directions. A normal RNN does this by dividing its neurons into a portion responsible for the forward direction and a portion responsible for the reverse direction. Positive neuron output is not linked to negative neuron output, and vice versa. This results in the general structure depicted in Fig. 3b. The relevant computations are shown in the following equations:  www.nature.com/scientificreports/ Hybrid residual block. Commonly, simple DL algorithms employ convolution layers followed by fully connected layers for classification tasks, without incorporating shortcut connections. These architectures are known as sequential networks, where each layer passes data to the next layer. However, as the size of the sequential network increases, a challenge arises in the form of vanishing or exploding gradients. This can pose difficulties for the effective training of such networks.
To overcome this problem, ResNet utilizes residual blocks, which allow for skip connections between blocks of convolutional layers. These skip connections enhance gradient propagation and facilitate the training of increasingly deeper CNNs, mitigating the issue of gradient vanishing. A residual layer can be represented as follows: Where x denotes the input, f(x) denotes the layer's output, ELU(x) denotes the exponential linear unit function, and R(x) denotes the residual block's output. The residual element f(x) is generated in this block as two consecutive repetitions of a trio of operational processes: convolution with a filter of size 3 × 1, batch normalization, and ELU activation. The f(x) feature map is then concatenated with the input x, and the ELU activation function is then applied to the combined characteristics.
In order to extract hybrid features hierarchically by incorporating both spatio-temporal and channel-wise data, we introduced the SEResidual block based on previous work by Muqeet et al. 44 . As depicted in Fig. 4, this residual block consists of Conv1D layers, BN layers, ELU layers, SE modules, and shortcut connections with BiGRU. The inclusion of SE modules enhances the model's representational capacity by incorporating channel attention. Figure 4 illustrates the construction of a SE component. After a convolution process, several feature maps are compiled. Nevertheless, specific feature maps could include duplicated data. The SE module performs feature recalibration to improve the discriminative information and disable the less valuable aspects. This module has two primary phases: squeezing and excitation. The exponential linear unit function and R(x) is the residual block's output. The residual element f(x) is generated in this block as two consecutive repetitions of a trio of operational processes: convolution with a filter of size 3 × 1, batch normalization, and ELU activation. The f(x) feature map is then concatenated with the input x. The ELU activation function is then applied to the combined characteristics.
Initially, the squeeze process comprises all information related to the channels. H × W is the size of the feature map C ×H× W that corresponds to one channel in U. Utilizing channel descriptor function, including global average pooling (GAP), feature maps for each channel are compressed into 1 × 1 feature map 45 . During this step, a scalar value reflecting a global channel is established. The procedure indicated by Eq. (10), where U c (i, j) is a feature map relating to channel c after the convolution layer has been applied to X. F squeeze is the channel descriptor function, and GAP was employed in this investigation.  (11) describes the excitation stage, where z is the result acquired by squeezing, W i are the ith FC layers, is the sigmoid function, and F excite is the excitation mechanism. According to the sigmoid, the resulting value of the excitation stage is between 0 and 1 and might even be employed as a calibration weight. The current feature map U is multiplied by the newly derived weight s. The design of the squeeze and excitation stages in the SE block is shown in Fig. 4, along with the operation of the SE component implemented in this investigation.
In order to deploy the activations to the side path network, the final step needs reconfiguring the output U, where X = [x 1 , x 2 , ..., x n ] . s n U n is the channel-wise multiplication of the scalar sn by the feature map. This procedure supplies adjustable weights to the feature channels that are the basis of the SE block 46 .
Hyperparameters. The DL process relies on the configuration of hyperparameters, which govern the learning procedure. In the case of the ResNet-BiGRU-SE model, the following hyperparameters were utilized: (1) learning rate ( α ), (2) epochs, (3) batch size, (4) optimization method, and (5) loss function. Initially, the learning rate α was set to 0.001. The training process involved 200 epochs and used batches of size 128. If the validation loss did not improve for 30 consecutive epochs, a predefined function was triggered to stop the training early. After six additional epochs, the ResNet-BiGRU-SE model's learning rate was adjusted to 75% of its initial value, as the accuracy did not improve during the verification phase. To minimize errors, the Adam optimization algorithm 47 was employed, with the following parameters: β 1 = 0.9, β 2 = 0.999, and ǫ = 1 × 10 −8 . For error identification, the categorical cross-entropy function 48 was utilized, as it has demonstrated superior performance compared to classification and mean square error metrics.

Cross validation method. The k-fold cross-validation (k-CV) technique is a valuable method for estimat-
ing the performance of a classification model using multiple data subsets 49 . This approach involves randomly dividing a dataset, obtained from either a single individual or multiple participants, into k non-overlapping subsets of approximately equal size. Each subset is then used to evaluate the classification model trained on the remaining k -1 subsets. The overall effectiveness of the model is determined by computing the mean value of performance measures such as accuracy, precision, recall, and F-measure, obtained from the k-CV 50 . It's worth noting that this approach can be computationally demanding, particularly when dealing with large sample sizes or high values of k. In this study, we applied the k-CV technique with k set to 5, as depicted in Fig. 5, to assess the performance of the models.
Performance measurement. In order to evaluate the effectiveness of the proposed DL model, we employed a 5-CV procedure. This technique enables us to comprehensively assess the model's performance using four widely-used evaluation metrics: accuracy, precision, recall, and F-measure. The mathematical equations representing these four assessment indicators are provided below:  The four measures discussed in this context are commonly employed to evaluate the effectiveness of sensor-based HAR. In this context, recognition refers to accurately identifying a specific category, known as true positive (TP), while correctly identifying all other categories as true negatives (TN). Misclassifying sensor data into another category results in a false positive (FP) identification. Likewise, misclassifying action sensor data from another category as belonging to the considered category leads to a false negative (FN) understanding of that category. The pseudo-code for the HAR algorithm used in this study is described in Algorithm 1.

Experiments and results
In this section, we present the studies conducted to determine the most efficient CNN models for sensor-based HAR. Our research focused on three benchmark smartphone sensing datasets, namely UCI-HAR, WISDM, and IM-WSHA datasets, which are commonly used for HAR tasks. The performance of the DL models was evaluated using accuracy and F-measure, which are widely recognized metrics for assessing model effectiveness in HAR applications.
In the investigation, we compared the CNN backbone models VGG16 51 , ResNet18 52 , PyramidNet18 53 , Inception-V3 54 , Xception 55 , and Inception-ResNet 34 . These models were presented as a solution to the issue of image recognition; consequently, we reconstructed the framework of these models for HAR. Furthermore, the identification capabilities of CNN models and our suggested model are compared. Experiment setting. This research utilized Google Colab Pro+ with a Tesla V100-SXM2-16GB graphics processor module to accelerate the training of DL models. The ResNet-BiGRU-SE and other primary DL models were developed in a Python library with TensorFlow and CUDA backends. These studies focused on the following Python libraries: • Numpy and Pandas were used for managing data while retrieving, processing, and analyzing sensor data.
• Matplotlib and Seaborn were applied for charting and presenting the results of data exploration and model evaluation. • Scikit-learn (Sklearn) was utilized as a module for sampling and data production in investigations.
• TensorFlow, Keras, and TensorBoard were operated to produce and train models using DL.
Experimental results. In this study, we assessed the proposed framework by comparing it to baseline DL algorithms using three publicly available datasets: UCI-HAR, WISDM, and IM-WSHA. The following subsections present the experimental findings of these DL methods trained on smartphone sensing data from these benchmark datasets.
In the first experiment, we evaluated the performance of the proposed ResNet-BiGRU-SE model on the UCI-HAR dataset. The results are summarized in Table 2. The findings indicate that the proposed model outperforms other CNN models, achieving an impressive average accuracy of 98.92% and an F-measure of 98.99%. It is noteworthy that the proposed model has a relatively small number of training parameters, with only 127,814 values. This demonstrates its efficiency despite its compact design.
The results presented in Table 3 are obtained from the second experiment conducted using the WISDM dataset. These findings demonstrate that the proposed ResNet-BiGRU-SE model outperforms other CNN models, achieving an impressive average accuracy of 98.80% and an F-measure of 98.62%. It is worth noting that despite  Table 4. The results clearly demonstrate that the ResNet-BiGRU-SE model outperforms other CNN models, achieving a remarkable accuracy rate of 98.45% and an F-measure of 97.60%. These results highlight the superior performance of the ResNet-BiGRU-SE model in accurately classifying activities based on the IM-WSHA dataset.

Discussion
Comparison results with state-of-the-art models. We conducted a comprehensive comparison of our proposed model with state-of-the-art DL models in the field of sensor-based HAR. In Table 4, we compared our ResNet-BiGRU-SE network with several other DL techniques, namely 1D-CNN 56 , Bidir-LSTM 57 , CNN-LSTM 58 , SDAE 59 , and CNN-GRU 60 . Each of these models was developed in accordance with its respective study descriptions.
Notably, our suggested ResNet-BiGRU-SE model achieved an outstanding success rate of 98.92% on the UCI-HAR dataset, surpassing the performance of all the other models. The comparative results are presented in Table 5, providing a clear illustration of the superior performance of our proposed ResNet-BiGRU-SE model in comparison to the other models.  www.nature.com/scientificreports/ In our evaluation using the WISDM dataset, we compared our proposed model with state-of-the-art DL algorithms for sensor-based HAR. Table 6 provides a comprehensive comparison of our ResNet-BiGRU-SE network with several other DL approaches, including LSTM 61 , CNN with statistical features 14 , U-Net 62 , and CNN-GRU 60 . Each of these models was implemented based on the descriptions provided in their respective publications.
Remarkably, our suggested ResNet-BiGRU-SE model achieved an impressive accuracy of 98.80% on the WISDM dataset, outperforming all the other models. This outstanding performance further highlights the superiority of our proposed ResNet-BiGRU-SE model in accurately classifying activities based on the WISDM dataset.
The findings from our study strongly support our hypothesis that our hybrid DL model, which combines local spatio-temporal characteristics with long-term contextual understanding, improves the comprehension of sensor data and ultimately enhances the performance of activity classification. Additionally, the results suggest that deep residual models exhibit favorable performance when applied to raw signals. However, the inclusion of BiGRU and SE modules further enhances the effectiveness of HAR for real-life human motion detection. These findings highlight the significance of incorporating both architectural enhancements and feature extraction techniques in order to achieve optimal results in HAR applications. Table 7 provides a comprehensive evaluation of various advanced techniques on the IM-WSHA dataset. One approach utilized a reweighted genetic algorithm (GA) to combine statistical and frequency features extracted   Influence of validation methods. Sensor-based HAR studies commonly employ three validation techniques: hold-out validation 65 , k-CV, and Leave-One-Subject-Out cross-validation (LOSO) 50 . Hold-out validation involves dividing the dataset into a training set (typically 70% of the data) and a test set (30% of the data).
On the other hand, k-CV repeatedly partitions the dataset into k subsets for training and testing, evaluating the algorithm's performance k times. LOSO validation creates a training set with n -p samples and a testing set with p samples, where p represents all data from a single subject. This approach ensures that there is no overlap between subjects in the training and testing sets.
To assess the impact of these validation techniques, we conducted supplementary investigations using three HAR datasets: UCI-HAR, WISDM, and IM-WSHA. We evaluated the effectiveness of the ResNet-BiGRU-SE model and presented the results in Fig. 6. These evaluations allowed us to determine how different validation techniques influenced the performance of our proposed model in HAR tasks.
The findings presented in Fig. 6 clearly demonstrate the significant impact of validation approaches on the effectiveness of HAR. Among the three benchmark datasets, the ResNet-BiGRU-SE model, which utilized k-CV, achieved the highest levels of accuracy. However, it is important to note that this result may be influenced by the fact that the k-CV method does not consider scenarios where all samples come from a single study participant. This issue arises due to the time series segmentation used in the pre-processing phase. In a generalized HAR implementation, independently dividing the dataset can lead to instances where a participant's data appears in both the training and test sets simultaneously, causing data leakage that artificially inflates the classifier's accuracy.
On the other hand, when adopting the LOSO approach, which takes into account individual-specific data (i.e., subject labels), the accuracy of the classification model tends to decrease. The implementation of this improved assessment approach resulted in a 12% reduction in accuracy, indicating a preliminary overestimation of the results. It is important to consider these factors when selecting a validation approach to ensure accurate and reliable performance evaluation in HAR tasks.

Misclassification.
To analyze the misclassification patterns of the suggested model, we conducted a comprehensive examination of the confusion matrices generated by the ResNet-BiGRU-SE model on three different HAR datasets: UCI-HAR, WISDM, and IM-WSHA. These datasets contain activity data collected from a variety of sensor categories, as summarized in Table 1. By studying the confusion matrices, we can gain insights into the specific activities that are frequently misclassified by the model and identify potential areas for improvement.
Regarding the UCI-HAR dataset, the categories of "sitting" and "standing" exhibited the highest frequency of misclassifications. This can be attributed to the similarity in linear acceleration patterns observed in these static actions 66 . However, the ResNet-BiGRU-SE model performed well in accurately categorizing the other four activities, as shown in Fig. 7a. Moving on to the WISDM dataset, Fig. 7b presents the confusion matrix of the model. It can be observed that the classification of "walk upstairs" and "walk downstairs" resulted in the highest number of errors, likely due to the contrasting nature of these two activities as different forms of physical movement.