Introduction

Human action recognition plays a vital role for distinguishing a particular behavior of interest in the video. It has critical applications including visual surveillance for detection of suspicious human activities to prevent the fatal accidents1,2, automation-based driving to sense and predict human behavior for safe navigation3,4. In addition, there are large amount of non-trivial applications such as human–machine interaction5,6, video retrieval7, crowd scene analysis8 and identity recognition9.

In the early days, the majority of research in Human Activity Recognition was conducted using hand-crafted methods10,11,12. However, as deep learning technology evolved and gained increasing recognition in the research community, a multitude of new techniques have been proposed, achieving remarkable results.

Action recognition preserves a similar property of image recognition since both of the fields handle visual contents. In addition, action recognition classifies not only still images but also dynamics temporal information from the sequence of images. Built on these intrinsic characteristics, action recognition’s methods can be grouped into two main approaches namely recurrent neural networks (RNN) based approach and 3-D ConvNet based approach. Besides of the main ones, there are other methods that utilize the content from both spatial and temporal and coined the name two-stream 2-D ConvNet based approach13.

Initially, action recognition was viewed as a natural extension of image recognition, and spatial features from still frames could be extracted using ConvNet, which is one of the most efficient techniques in the image recognition field. However, traditional ConvNets are only capable of processing a single 2-D image at a time. To expand to multiple 2-D images, the neural network architecture needs to be re-designed, including adding an extra dimension to operations such as convolution and pooling to accommodate 3-D images. Examples of such techniques include C3D14, I3D15, R3D16, S3D17, T3D18, LTC19, among others

Similarly, since a video primarily consists of a temporal sequence, techniques for sequential data, such as Recurrent Neural Networks and specifically Long Short Term Memory, can be utilized to analyze the temporal information. Despite the larger size of images, feature extraction is often employed. Long-term Recurrent Convolutional Networks (LRCN)20 and Beyond-Short-Snippets21 were among the first attempts to extract feature maps from 2-D ConvNets and integrate them with LSTMs to make video predictions. Other works have adopted bi-directional LSTMs22,23, which are composed of two separate LSTMs, to explore both forward and backward temporal information.

To further improve performance, other researchers argue that videos usually contain repetitive frames or even hard-to-classify ones which makes the computation expensive. By selecting relevant frames, it can help to improve action recognition performance both in terms of efficiency and accuracy24. A similar concept based on attention mechanisms is the main focus in recent researches to boost overall performance of the ConvNet-LSTM frameworks25,26.

While RNNs are superior in the field, they process data sequentially, meaning that information flows from one state to the next, hindering the ability to speed up training in parallel and causing the architectures to become larger in size. These issues limit the application of RNNs to longer sequences. In light of these challenges, a new approach, the Transformer, emerged27,28,29,30,31.

There has been a rapid advancement in action recognition in recent years, from 3-D ConvNets to 2-D ConvNets-LSTM, two-stream ConvNets, and more recently, Transformers. While these advancements have brought many benefits, they have also created a critical issue as previous techniques are unable to keep up with the rapidly changing pace. Although techniques such as evolutionary computation offer a crucial mechanism for architecture search in image recognition, and swarm intelligence provides a straightforward method to improve performance, they remain largely unexplored in the realm of action recognition.

In our recent research32, we developed a dynamic Particle Swarm Optimization (PSO) framework for image classification. In this framework, each particle navigates the landscape, exchanging information with neighboring particles about its current estimate of the geometry (such as the gradient of the Loss function) and its position. The overall goal of this framework is to create a distributed, collaborative algorithm that improves the optimization performance by guiding some of the particles up to the best minimum of the loss function. We extend this framework to action recognition by incorporating state-of-the-art methods for temporal data (Transformer and RNN) with the ConvNet module in an end-to-end training setup.

In detail, we have made the following improvements compared to our previous publication.

We have supplemented a more comprehensive review of the literature on Human Action Recognition. We have implemented the following enhancements and additions to our work:

  1. (1)

    We have introduced an improved and novel network architecture that extends a PSO-ConvNet to a PSO-ConvNet Transformer (or PSO-ConvNet RNN) in an end-to-end fashion.

  2. (2)

    We have expanded the scope of Collaborative Learning as a broader concept beyond its original application in image classification to include action recognition.

  3. (3)

    We have conducted additional experiments on challenging datasets to validate the effectiveness of the modified model.

These improvements and additions contribute significantly to the overall strength and novelty of our research.

The rest of the article is organized as follows: In Sect. 2, we discuss relevant approaches in applying Deep Learning and Swarm Intelligence to HAR. In addition, the proposed methods including Collaborative Learning with Dynamic Neural Networks and ConvNet Transformer architecture as well as ConvNet RNN model are introduced in Sects. 3.13.2 and  3.3, respectively. The results of experiments, the extension of the experiments and discussions are presented in Sects. 4,  5 and 6. Finally, we conclude our work in Sect. 7.

Related works

In recent years, deep learning (DL) has greatly succeed in computer vision fields, e.g., object detection, image classification and action recognition24,30,33. One consequence of this success has been a sharp increase in the number of investments in searching for good neural network architectures. An emerging promising approach is changing from the manual design to automatic Neural Architecture Search (NAS). As an essential part of automated machine learning, NAS automatically generates neural networks which have led to state-of-the-art results34,35,36. Among various approaches for NAS already present in the literature, evolutionary search stands out as one of the most remarkable methods. For example, beginning with just one layer of neural network, the model develops into a competitive architecture that outperforms contemporary counterparts34. As a result, the efficacy of the their proposed classification system for HAR on UCF-50 dataset was demonstrated33 by initializing the weights of a convolutional neural network classifier based on solutions generated from genetic algorithms (GA).

In addition to Genetic Algorithms, Particle Swarm Optimization—a population-based stochastic search method influenced by the social behavior of flocking birds and schooling fish—has proven to be an efficient technique for feature selection37,38. A novel approach that combines a modified Particle Swarm Optimization with Back-Propagation was put forth for image recognition, by adjusting the inertia weight, acceleration parameters, and velocity39. This fusion allows for dynamic and adaptive tuning of the parameters between global and local search capability, and promotes diversity within the swarm. In catfish particle swarm optimization, the particle with the worst fitness is introduced into the search space when the fitness of the global best particle has not improved after a number of consecutive iterations40. Moreover, a PSO based multi-objective for discriminative feature selection was introduced to enhance classification problems41.

There have been several efforts to apply swarm intelligence to action recognition from video. One such approach employs a combination of binary histogram, Harris corner points, and wavelet coefficients as features extracted from the spatiotemporal volume of the video sequence42. To minimize computational complexity, the feature space is reduced through the use of PSO with a multi-objective fitness function.

Furthermore, another approach combining Deep Learning and swarm intelligence-based metaheuristics for Human Action Recognition was proposed43. Here, four different types of features extracted from skeletal data—Distance, Distance Velocity, Angle, and Angle Velocity—are optimized using the nature-inspired Ant Lion Optimizer metaheuristic to eliminate non-informative or misleading features and decrease the size of the feature set.

The ideas of applying pure techniques of Natural Language Processing to Computer Vision have been seen in recent years29,30,44. By using the sequences of image patches with Transformer, the models29 can perform specially well on image classification tasks. Similarly, the approach was extended to HAR with sequence of frames30. In “Video Swin Transformer”31, the image was divided into regular shaped windows and utilize a Transformer block to each one. The approach was found to outperform the factorized models in efficiency by taking advantage of the inherent spatiotemporal locality of videos where pixels that are closer to each other in spatiotemporal distance are more likely to be relevant. In our study, we adopt a different approach by utilizing extracted features from a ConvNet rather than using original images. This choice allows us to reduce computational expenses without compromising efficiency, as detailed in Sect. 3.2.

Temporal Correlation Module (TCM)45 utilizes fast-tempo and slow-tempo information and adaptively enhances the expressive features, and a Temporal Segment Network (TSN) is introduced to further improve the results of the two-stream architecture46. Spatiotemporal vector of locally aggregated descriptor (ActionS-ST-VLAD) approach designs to aggregate relevant deep features during the entire video based on adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS) in which the key-frame features are selected47. Moreover, the concept of using temporal difference can be found in the works48,49,50. Temporal Difference Networks (TDN) approach proposes for both finer local and long-range global motion information, i.e., for local motion modeling, temporal difference over consecutive frames is utilized whereas for global motion modeling, temporal difference across segments is integrated to capture long-range structure48. SpatioTemporal and Motion Encoding (STM) approach proposes an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features in which a 2D channel-wise convolution is applied to two consecutive frames and then subtracts to obtain the approximate motion representation49.

Other related approaches that can be mentioned include Zero-Shot Learning, Few-Shot Learning, and Knowledge Distillation Learning51,52,53. Zero-Shot Learning and Few-Shot Learning provide techniques for understanding domains with limited data availability. Similar to humans, who can identify similar objects within a category after seeing only a few examples, these approaches enable the model to generalize and recognize unseen or scarce classes. In our proposed approach, we introduce the concept of Collaborative Learning, where particles collaboratively train in a distributed manner.

Despite these advances, the field remains largely uncharted, especially with respect to recent and emerging techniques.

Proposed methods

Collaborative dynamic neural networks

Define \(\mathscr {N}(n,t)\) as the set of k nearest neighbor particles of particle n at time t, where \(k\in \mathbb {N}\) is some predefined number. In particular,

$$\begin{aligned} \begin{aligned} \mathscr {N}(n,t) ={}&\{(x^{(n)}(t),v^{(n)}(t)), (x^{(i_1)}(t),v^{(i_1)}(t)),(x^{(i_2)},v^{(i_2)})(t),\ldots , (x^{(i_k)}(t),v^{(i_k)}(t))\} \end{aligned} \end{aligned}$$
(1)

where \(i_1\), \(i_2, \ldots i_k\) are the k closest particles to n and \(x^{(i_k)}(t)\) and \(v^{(i_k)}(t)\in \mathbb {R}^D\) represent the position and velocity of particle \(i_k\) at time t. Figure 1 illustrates this concept for \(k=4\) particles.

Figure 1
figure 1

A demonstration of the \(\mathscr {N}(n,t)\) neighborhood, consisting of the positions of four closest particles and particle n itself, is shown. The velocities of the particles are depicted by arrows.

Given a (continuous) function \(L\,:\,\mathbb {R}^D\longrightarrow \mathbb {R}\) and a (compact) subset \(S\subset \mathbb {R}^D\), define

$$\begin{aligned} \mathscr {Y}=\textsf{argmin}\left\{ L(y)\,:\,y\in S\right\} \end{aligned}$$
(2)

as the subset of points that minimize L in S, i.e., \(L(z)\le L(w)\) for any \(z\in \mathscr {Y}\subset S\) and \(w\in S\).

Dynamic 1 We investigate a set of neural networks that work together in a decentralized manner to minimize a Loss function L. The training process is comprised of two phases: (1) individual training of each neural network using (stochastic) gradient descent, and (2) a combined phase of SGD and PSO-based cooperation. The weight vector of each neural network is represented as the position of a particle in a D-dimensional phase space, where D is the number of weights. The evolution of the particles (or neural networks) is governed by Eq. (3), with the update rule specified by the following dynamics:

$$\begin{aligned} \begin{array}{ccl} \psi ^{(n)}(t+1) &{} = &{} -\eta \nabla L\left( x^{(n)}(t)\right) \\ &{} &{} \\ \phi ^{(n)}(t+1) &{} = &{} x^{(n)}(t)+\psi ^{(n)}(t+1)\\ &{} &{} \\ v^{(n)}(t+1) \!\!\! &{} \!\!\! = \!\!\! &{} \!\!\! \sum \limits _{\ell \in \mathscr {N}(n,t)} w_{n\ell } \psi ^{(\ell )}(t+1) + c_1 r(t)\left( P^{(n)}(t)-\phi ^{(n)}(t+1)\right) + c_2 r(t)\left( P_g^{(n)}(t)-\phi ^{(n)}(t+1)\right) \\ x^{(n)}(t+1) &{} = &{} x^{(n)}(t)+v^{(n)}(t) \end{array} \end{aligned}$$
(3)

where \(v^{(n)}(t)\in \mathbb {R}^{D}\) is the velocity vector of particle n at time t; \(\psi ^{(n)}(t)\) is an intermediate velocity computed from the gradient of the Loss function at \(x^{(n)}(t)\); \(\phi ^{(n)}(t)\) is the intermediate position computed from the intermediate velocity \(\psi ^{(n)}(t)\); \(r(t)\overset{i.i.d.}{\sim }\textsf{Uniform}\left( \left[ 0,1\right] \right)\) is randomly drawn from the interval \(\left[ 0,1\right]\) and we assume that the sequence r(0), r(1), r(2), \(\ldots\) is i.i.d.; \(P^{(n)}(t)\in \mathbb {R}^D\) represents the best position visited up until time t by particle n, i.e., the position with the minimum value of the Loss function over all previous positions \(x^{(n)}(0),\,x^{(n)}(1),\,\ldots ,\,x^{(n)}(t)\); \(P_{g}^{(n)}(t)\) represents its nearest-neighbors’ counterpart, i.e., the best position across all previous positions of the particle n jointly with its corresponding nearest-neighbors \(\bigcup _{s\le t} \mathscr {N}\left( n,s\right)\) up until time t:

$$\begin{aligned} \begin{array}{ccl} P^{(n)}(t+1) &{} \in &{} \textsf{argmin}\left\{ L(y)\,:\,y=P^{(n)}(t),x^{(n)}(t)\right\} \\ &{} &{} \\ P_{g}^{(n)}(t+1) &{} \in &{} \textsf{argmin}\left\{ L(y)\,:\,y=P_{g}^{(n)}(t),x^{(k)}(t);\right. \\ &{} &{} \left. k\in \mathscr {N}(n,t)\right\} \\ &{} &{}\\ \end{array}. \end{aligned}$$
(4)

The weights \(w_{n\ell }\) are defined as

$$\begin{aligned} w_{n\ell }= f\left( \left| \left| x^{(n)}(t)-x^{(\ell )}(t)\right| \right| \right) , \end{aligned}$$
(5)

with \(\left| \left| \cdot \right| \right|\) being the Euclidean norm and \(f:\,\mathbb {R}\rightarrow \mathbb {R}\) being a decreasing (or at least non-increasing) function. In Dynamic 1, we assume that

$$\begin{aligned} f(z)= \frac{M}{\left( 1+z\right) ^{\beta }}, \end{aligned}$$
(6)

for some constants \(M,\beta >0\). This strengthens the collaboration learning between any of two particles by pushing each particle against each other.

Dynamic 2 An alternative to Eq. (3) is to pull back a particle instead of pushing it in the direction of the gradient. In the previous section, the assumption was that all particles were located on the same side of a valley in the loss function. However, if one particle is on the opposite side of the valley relative to the rest of the particles, it will be pulled further away from the minimum using the first dynamic. To address this issue, we introduce a second dynamic (Dynamic 2) that pulls the particle back. The formula for this dynamics is as follows:

$$\begin{aligned} \begin{array}{ccl} x_{(i)}(t+1) &{} = &{} x_{(i)}(t) + \sum _{j=1}^N \frac{M_{ij}}{(1+\left| \left| x_i(t)-x_j(t)\right| \right| ^2)^\beta } (x_j(t) - \nabla L(x_j(t))) + c r\left( P_{nbest(i)}(t)-x_{i}(t)\right) \\ \end{array} \end{aligned}$$
(7)

where \(x_{(i)}(t)\in \mathbb {R}^{D}\) is the position of particle i at time t; M, \(\beta\) and c are constants set up by experiments with \(\left| \left| \cdot \right| \right|\) being the Euclidean norm; \(r(t)\overset{i.i.d.}{\sim }\textsf{Uniform}\left( \left[ 0,1\right] \right)\) randomly drawn from the interval \(\left[ 0,1\right]\) and we assume that the sequence r(0), r(1), r(2), \(\ldots\) is i.i.d.; \(P_{nbest(i)}(t)\in \mathbb {R}^D\) represents nearest-neighbors’ best , i.e., the best position across all previous positions of the particle n jointly with its corresponding nearest-neighbors \(\bigcup _{s\le t} \mathscr {N}\left( n,s\right)\) up until time t.

ConvNet transformer architecture for action recognition

In this section, we discuss a hybrid ConvNet-Transformer architecture that replaces the traditional ConvNet-RNN block for temporal input to classify human action in videos. The architecture is composed of several components, including a feature extraction module using ConvNet, a position embedding layer, multiple transformer encoder blocks, and classification and aggregation modules. The overall diagram of the architecture can be seen in Fig. 2. The goal of the architecture is to effectively capture the temporal information present in the video sequences, in order to perform accurate human action recognition. The hybrid ConvNet-Transformer design leverages the strengths of both ConvNets and Transformers, offering a powerful solution for this challenging task.

Features extraction via ConvNet and position embedding

In the early days of using Transformer for visual classification, especially for images, the frames were typically divided into smaller patches and used as the primary input31,54,55. However, these features were often quite large, leading to high computational requirements for the Transformer. To balance efficiency and accuracy, ConvNet can be utilized to extract crucial features from images, reducing the size of the input without sacrificing performance.

We assume that, for each frame, the extracted features from ConNet have a size of (w, h, c) where w and h are the width and height of a 2D feature and c is the number of filters. To further reduce the size of the features, global average pooling is applied, reducing the size from \(w \times h \times c\) to c.

The position encoding mechanism in Transformer is used to encode the position of each frame in the sequence. The position encoding vector, which has the same size as the feature, is summed with the feature and its values are computed using the following formulas. This differs from the sequential processing of data in the RNN block, allowing for parallel handling of all entities in the sequence.

$$\begin{aligned} \begin{array}{ccl} PE_{(pos,2i)} &{} = &{} sin(pos/{10000^{2i/{d_{model}}}})\\ PE_{(pos,2i+1)} &{} = &{} cos(pos/{10000^{2i/{d_{model}}}}) \end{array} \end{aligned}$$
(8)

where pos, i and PE are the time step index of the input vector, the dimension and the positional encoding matrix; \(d_{model}\) refers to the length of the position encoding vector.

Figure 2
figure 2

Rendering end-to-end ConvNet-Transformer architecture.

Transformer encoder

The Transformer Encoder is a key component of the hybrid ConvNet-Transformer architecture. It consists of a stack of N identical layers, each comprising multi-head self-attention and position-wise fully connected feed-forward network sub-layers. To ensure the retention of important input information, residual connections are employed before each operation, followed by layer normalization.

The core of the module is the multi-head self-attention mechanism, which is composed of several self-attention blocks. This mechanism is similar to RNN, as it encodes sequential data by determining the relevance between each element in the sequence. It leverages the inherent relationships between frames in a video to provide a more accurate representation. Furthermore, the self-attention operates on the entire sequence at once, resulting in significant improvements in runtime, as the computation can be parallelized using modern GPUs.

Our architecture employs only the encoder component of a full transformer, as the goal is to obtain a classification label for the video action rather than a sequence. The full transformer consists of both encoder and decoder modules, however, in our case, the use of only the encoder module suffices to achieve the desired result.

Assuming the input sequence (\(X={x_1,x_2, \ldots ,x_n}\)) is first projected onto these weight matrices \(Q = XW_Q\), \(K = XW_K\), \(V = XW_V\) with \(W_Q\),\(W_K\) and \(W_V\) are three trainable weights, the query (\(Q = q_1,q_2, \ldots ,q_n\)), key (\(K = k_1,k_2, \ldots ,k_n\)) of dimension \(d_k\), and value (\(V = v_1,v_2, \ldots ,v_n\)) of dimension \(d_v\), the output of self-attention is computed as follows:

$$\begin{aligned} \begin{array}{ccl} Attention(Q,K,V) = \; & {} softmax(\frac{QK^T}{\sqrt{d_k}})V. \end{array} \end{aligned}$$
(9)

As the name suggested, multi-head attention is composed of several heads and all are concatenated and fed into another linear projection to produce the final outputs as follows:

$$\begin{aligned} MultiHead(Q,K,V)= & {} Concat(head_1,head_2, \ldots ,head_h)W^O. \end{aligned}$$
(10)
$$\begin{aligned} \text {where } head_i = \,\; & {} Attention(QW^Q_i,KW^K_i,VW^V_i) \end{aligned}$$
(11)

where parameter matrices \(W^Q_i\in \mathbb {R}^{d_{model} \times d_k}\), \(W^K_i\in \mathbb {R}^{d_{model} \times d_k}\), \(W^V_i\in \mathbb {R}^{d_{model} \times d_v}\) and \(W^O\in \mathbb {R}^{hd_v \times d_{model}}\), \(i=1,2, \ldots ,h\) with h denotes the number of heads.

Frame selection and data pre-processing

Input videos with varying number of frames can pose a challenge for the model which requires a fixed number of inputs. Put simply, to process a video sequence, we incorporated a time distributed layer that requires a predetermined number of frames. To address this issue, we employ several strategies for selecting a smaller subset of frames.

One approach is the “shadow method,” where a maximum sequence length is established for each video. While this method is straightforward, it can result in the cutting of longer videos and the loss of information, particularly when the desired length is not reached. In the second method, we utilize a step size to skip some frames, allowing us to achieve the full length of the video while reducing the number of frames used. Additionally, the images are center-cropped to create square images. The efficacy of each method will be evaluated in our experiments.

Layers for classification

Assuming, we have a set of videos \(S(S_1,S_2, \ldots ,S_m)\) with corresponding labels \(y(y_1,y_2, \ldots ,y_m)\) where m is the number of samples. We select l frames from the videos and obtain g features from the global average pooling 2-D layer. Each transformer encoder generates a set of representations by consuming the output from the previous block. After N transformer encoder blocks, we can obtain the multi-level representation \(H^N(h^N_1,h^N_2, \ldots ,h^N_l)\) where each representation is 1-D vector with the length of g (see Fig. 2 block (A) \(\rightarrow\) (D)).

The classification module incorporates traditional layers, such as fully connected and softmax, and also employs global max pooling to reduce network size. To prevent overfitting, we include Gaussian noise and dropout layers in the design. The ConvNet-Transformer model is trained using stochastic gradient descent and the categorical cross entropy loss is used as the optimization criterion.

ConvNet-RNN

Recent studies have explored the combination of ConvNets and RNNs, particularly LSTMs, to take into account temporal data of frame features for action recognition in videos20,21,22,23,56,57.

To provide a clear understanding of the mathematical operations performed by ConvNets, the following is a summary of the relevant formulations:

$$\begin{aligned}&{\left\{ \begin{array}{ll} O_{i}=X &{} \text {if }i=1\\ Y_{i}=f_{i}(O_{i-1},W_{i}) &{} \text {if }i>1\\ O_{i}=g_{i}(Y_{i}) \end{array}\right. } \end{aligned}$$
(12)
$$\begin{aligned}&{\left\{ \begin{array}{ll} Y_{i}=W_{i} \circledast O_{i-1} &{} i\textrm{th}\text { layer is a convolution}\\ Y_{i}=\boxplus _{n,m} O_{i-1} &{} i\textrm{th}\text { layer is a pool}\\ Y_{i}=W_{i}*O_{i-1} &{} i\textrm{th}\text { layer is a FC} \end{array}\right. } \end{aligned}$$
(13)

where X represents the input image; \(O_{i}\) is the output for layer \(i\textrm{th}\); \(W_{i}\) indicates the weights of the layer; \(f_{i}(\cdot )\) denotes weight operation for convolution, pooling or FC layers; \(g_{i}(\cdot )\) is an activation function, for example, sigmoid, tanh and rectified linear (ReLU) or more recently Leaky ReLU58; The symbol (\(\circledast\)) acts as a convolution operation which uses shared weights to reduce expensive matrix computation59; Window (\(\boxplus _{n,m}\)) shows an average or a max pooling operation which computes average or max values over neighbor region of size \(n \times m\) in each feature map. Matrix multiplication of weights between layer \(i\textrm{th}\) and the layer \((i-1)\textrm{th}\) in FC is represented as (\(*\)).

The last layer in the ConvNet (FC layer) acts as a classifier and is usually discarded for the purpose of using transfer learning. Thereafter, the outputs of the ConvNet from frames in the video sequences are fed as inputs to the RNN layer.

Considering a standard RNN with a given input sequence \({x_1, x_2, \ldots ,x_T}\), the hidden cell state is updated at a time step t as follows:

$$\begin{aligned} h_t=\sigma (W_h h_{t-1}+W_x x_t+b), \end{aligned}$$
(14)

where \(W_h\) and \(W_x\) denote weight matrices, b represents the bias, and \(\sigma\) is a sigmoid function that outputs values between 0 and 1.

The output of a cell, for ease of notation, is defined as

$$\begin{aligned} y_t=h_t, \end{aligned}$$
(15)

but can also be shown using the softmax function, in which \(\hat{y}_t\) is the output and \(y_t\) is the target:

$$\begin{aligned} \hat{y_t}=softmax(W_y h_t+b_y). \end{aligned}$$
(16)

A more sophisticated RNN or LSTM that includes the concept of a forget gate can be expressed as shown in the following equations:

$$\begin{aligned}{} & {} f_t=\sigma (W_{fh} h_{t-1}+W_{fx} x_t+b_f), \end{aligned}$$
(17)
$$\begin{aligned}{} & {} i_t=\sigma (W_{ih} h_{t-1}+W_{ix} x_t+b_i), \end{aligned}$$
(18)
$$\begin{aligned}{} & {} c'_t=tanh(W_{c'h} h_{t-1}+W_{c'x} x_t+b'_c), \end{aligned}$$
(19)
$$\begin{aligned}{} & {} c_t=f_t \odot c_{t-1}+i_t \odot c'_t, \end{aligned}$$
(20)
$$\begin{aligned}{} & {} o_t=\sigma (W_{oh} h_{t-1}+W_{ox} x_t+b_o), \end{aligned}$$
(21)
$$\begin{aligned}{} & {} h_t=o_t \odot tanh(c_t), \end{aligned}$$
(22)

where the \(\odot\) operation represents an elementwise vector product, and f, i, o and c are the forget gate, input gate, output gate and cell state, respectively. Information is retained when the forget gate \(f_t\) becomes 1 and eliminated when \(f_t\) is set to 0.

For optimization purposes, an alternative to LSTMs, the gated recurrent unit (GRU), can be utilized due to its lower computational demands. The GRU merges the input gate and forget gate into a single update gate, and the mathematical representation is given by the following equations:

$$\begin{aligned}{} & {} r_t=\sigma (W_{rh} h_{t-1}+W_{rx} x_t+b_r), \end{aligned}$$
(23)
$$\begin{aligned}{} & {} z_t=\sigma (W_{zh} h_{t-1}+W_{zx} x_t+b_z), \end{aligned}$$
(24)
$$\begin{aligned}{} & {} h'_t=tanh(W_{h'h} (r_t \odot h_{t-1})+W_{h'x} x_t+b_z), \end{aligned}$$
(25)
$$\begin{aligned}{} & {} h_t=(1-z_t) \odot h_{t-1}+z_t \odot h'_t. \end{aligned}$$
(26)

Finally, it’s worth noting that while traditional RNNs only consider previous information, bidirectional RNNs incorporate both past and future information in their computations:

$$\begin{aligned}{} & {} h_t=\sigma (W_{hx} x_t+W_{hh} h_{t-1} +b_h), \end{aligned}$$
(27)
$$\begin{aligned}{} & {} z_t=\sigma (W_{ZX} x_t+W_{HX} h_{t+1} +b_z), \end{aligned}$$
(28)
$$\begin{aligned}{} & {} \hat{y_t}=softmax(W_{yh} h_t+W_{yz} z_t+b_y), \end{aligned}$$
(29)

where \(h_{t-1}\) and \(h_{t+1}\) indicate hidden cell states at the previous time step (\(t-1\)) and the future time step (\(t+1\)).

Results

Benchmark datasets

The UCF-101 dataset, introduced in 2012, is one of the largest annotated video datasets available60, and an expansion of the UCF-50 dataset. It comprises 13,320 realistic video clips collected from YouTube and covers 101 categories of human actions, such as punching, boxing, and walking. The dataset has three distinct official splits (rather than a pre-divided training set and testing set), and the final accuracy in our experiments is calculated as the arithmetic average of the results across all three splits.

HMDB-5161 was released around the same time as UCF-101. The dataset contains roughly 5k videos belonging to 51 distinct action classes. Each class in the dataset holds at least 100 videos. The videos are collected from a multiple sources, for example, movies and online videos.

Kinetics-40062 was recently made available in 2017. The dataset consists of 400 human action classes with at least 400 video clips for each action. The videos were assembled from realistic YouTube in which each clip lasts around 10s. In total, the dataset contains about 240k training videos and 20k validation videos and is one of the largest well-labeled video datasets utilized for action recognition.

Downloading individual videos from the Kinetics-400 dataset poses a significant challenge due to the large number of videos and the fact that the dataset only provides links to YouTube videos. Therefore, we utilize Fiftyone63, an open-source tool specifically designed for constructing high-quality datasets, to address this challenge. In our experiment, we collected top-20 most accuracy categories according to the work62 including “riding mechanical bull”, “presenting weather forecast”, “sled dog racing”, etc. Eventually, we obtained 7114 files for training and 773 files for validation with a significant number of files were not collected because the videos were deleted or changed to private, etc. In the same manner, we gathered all categories from HMDB-51 and obtained 3570 files for training and 1530 files for validation. The tool provides one-split for the HMDB-51, but the document does not specify which split.

Our experiments were conducted using Tensorflow-2.8.264, Keras-2.6.0, and a powerful 4-GPU system (GeForce® GTX 1080 Ti). We used Hiplot65 for data visualization. Figure 3 provides snapshot of samples from each of the action categories.

Figure 3
figure 3

A snapshot of samples of all actions from UCF-101 dataset60.

Evaluation metric

For evaluating our results, we employ the standard classification accuracy metric, which is defined as follows:

$$\begin{aligned} Accuracy=\frac{\text{ Number } \text{ of } \text{ correct } \text{ predictions }}{\text{ Total } \text{ numbers } \text{ of } \text{ predictions } \text{ made }}. \end{aligned}$$
(30)

Implementation

Training our collaborative models for action recognition involves building a new, dedicated system, as these models require real-time information exchange. To the best of our knowledge, this is the first such system ever built for this purpose. To accommodate the large hardware resources required, each model is trained in a separate environment. After one training epoch, each model updates its current location, previous location, estimate of the gradient of the loss function, and other relevant information, which is then broadcast to neighboring models. To clarify the concept, we provide a diagram of the collaborative system and provide a brief description in this subsection.

Our system for distributed PSO-ConvNets is designed based on a web client-server architecture, as depicted in Fig. 4. The system consists of two main components: the client side, which is any computer with a web browser interface, and the server side, which comprises three essential services: cloud services, app services, and data services.

The cloud services host the models in virtual machines, while the app services run the ConvNet RNN or ConvNet Transformer models. The information generated by each model is managed by the data services and stored in a data storage. In order to calculate the next positions of particles, each particle must wait for all other particles to finish the training cycle in order to obtain the current information.

The system is designed to be operated through a web-based interface, which facilitates the advanced development process and allows for easy interactions between users and the system.

Figure 4
figure 4

Dynamic PSO-ConvNets System Design. The system is divided into two main components, client and server. The client side is accessed through web browser interface while the server side comprises of cloud, app, and data services. The cloud stores virtual machine environments where the models reside. The app service is where the ConvNet-RNN or ConvNet-Transformer runs, and the information generated by each model is managed and saved by the data service. The particles in the system update their positions based on shared information, including current and previous locations, after completing a training cycle.

Effectiveness of the proposed method

Table 1 presents the results of Dynamic 1 and Dynamic 2 on action recognition models. The experiment settings are consistent with our previous research for a fair comparison. As shown in Fig. 2, we consider two different ConvNet architectures, namely DenseNet-201 and ResNet-152, and select eight models from the Inception66, EfficientNet67, DenseNet68, and ResNet69 families. In the baseline action recognition methods (DenseNet-201 RNN, ResNet-152 RNN, DenseNet-201 Transformer, and ResNet-152 Transformer), features are first extracted from ConvNets using transfer learning and then fine-tuned. However, in our proposed method, the models are retrained in an end-to-end fashion. Pretrained weights from the ImageNet dataset70 are utilized to enhance the training speed. Our results show an improvement in accuracy between \(1.58\%\) and \(8.72\%\). Notably, the Dynamics 2 for DenseNet-201 Transformer achieves the best result. We also report the time taken to run each method. Fine-tuning takes less time, but the technique can lead to overfitting after a few epochs.

Table 1 Three-fold classification accuracy (%) on the UCF-101 benchmark dataset.

The experiments described above were conducted using the settings outlined in Tables 2 and 3. The batch size, input image size, and number of frames were adjusted to maximize GPU memory utilization. However, it is worth noting that in Human Activity Recognition (HAR), the batch size is significantly reduced compared to image classification, as each video consists of multiple frames. Regarding the gradient weight M, a higher value indicates a stronger attractive force between particles.

Table 2 Hyper-parameter settings for the proposed method.
Table 3 Settings of gradient weight M.

Comparison with state-of-the-art methods

Table 4 Comparisons of the proposed method and previous methods on the UCF-101 benchmark dataset.

The comparison between our method (Dynamic 2 for ConvNet Transformer) and the previous approaches is shown in Table 4. The second method (Transfer Learning and Fusions) trains the models on a Sports-1M Youtube dataset and uses the features for UCF-101 recognition. However, the transfer learning procedure is slightly different as their ConvNet architectures were designed specifically for action recognition. While it may have been better to use a pretrained weight for action recognition datasets, such weights are not readily available as the models differ. Also, training the video dataset with millions of samples within a reasonable time is a real challenge for most research centers. Despite these limitations, the use of Transformer and RNN seem to provide a better understanding of temporal characteristics compared to fusion methods. Shuffle &Learn tries with two distinct models using 2-D images (AlexNet) and 3-D images (C3D) which essentially is series of 2-D images. The accuracy is improved, though, 3-D ConvNets require much more power of computing than the 2-D counterparts. There are also attempts to redesign well-known 2-D ConvNet for 3-D data (C3D is built from scratch of a typical ConvNet14), e.g., DPC approach and/or pretrained on larger datasets, e.g., 3D ST-puzzle approach. Besides, VideoMoCo utilizes Contrastive Self-supervised Learning (CSL) based approaches to tackle with unlabeled images. The method extends image-based MoCo framework for video representation by empowering temporal robustness of the encoder as well as modeling temporal decay of the keys. Our Dynamic 2 method outperforms VideoMoCo by roughly \(9\%\). SVT is a self-supervised method based on the TimeSformer model that employs various self-attention schemes79. On pre-trained of the entire Kinetics-400 dataset and inference on UCF-101, the SVT achieves 90.8% and 93.7% for linear evaluation and fine-tuning settings, respectively. When pre-trained on a subset of Kinetics-400 with 60,000 videos, the accuracy reduces to 84.8%. Moreover, the TSN methods apply ConvNet and achieves an accuracy of 84.5% using RGB images (2% less than our method) and 92.3% using a combination of three networks (RGB, Optical Flow and Warped Flow). Similarly, the STM approach employs two-stream networks and pre-trained on Kinetics that enhances the performance significantly. Designing a two-stream networks or multi-stream networks would require a larger resource, due to the limitations, we have not pursued this approach at this time. Furthermore, using optical flow80 and pose estimation81 on original images may improve performance, but these techniques are computationally intensive and time consuming, especially during end-to-end training. The concept of Collaborative Learning, on the other hand, is based on a general formula of the gradient of the loss function and could be used as a plug-and-play module for any approach. Finally, the bag of words method was originally used as a baseline for the dataset and achieved the lowest recognition accuracy (\(44.5\%\)).

Hyperparameter optimization

In these experiments, we aimed to find the optimal settings for each model. Table 5 presents the results of the DenseNet-201 Transformer and ResNet-152 Transformer using transfer learning, where we varied the maximum sequence length, number of frames, number of attention heads, and dense size. The number of frames represents the amount of frames extracted from the sequence, calculated by \(step=\text{(maximum } \text{ sequence } \text{ length)/(number } \text{ of } \text{ frames) }\). The results indicate that longer sequences of frames lead to better accuracy, but having a large number of frames is not necessarily the best strategy; a balanced approach yields higher accuracy. Furthermore, we discovered that models performed best with 6 attention heads and a dense size of either 32 or 64 neurons.

Figures 5 and  6 show the results for ConvNet RNN models using transfer learning. In the experiments, we first evaluated the performance of eight ConvNets (Inception-v3, ResNet-101, ResNet-152, DenseNet-121, DenseNet-201, EfficientNet-B0, EfficientNet-B4, and EfficientNet-B7). The two best performers, DenseNet-121 and ResNet-152 ConvNet architectures, were selected for further experimentation. The results of varying the number of frames showed a preference for longer maximum sequence lengths.

Table 5 Three-fold classification accuracy results (%) on the UCF-101 benchmark dataset for DenseNet-201 transformer and ResNet-152 transformer with transfer learning training.
Figure 5
figure 5

Hyperparameter optimization results for ConvNet RNN models with transfer learning. the models are numbered as follows: 1. Inception-v3, 2. ResNet-101, 3. ResNet-152, 4. DenseNet-121, 5. DenseNet-201, 6. EfficientNet-B0, 7. EfficientNet-B4, 8. EfficientNet-B7. The abbreviations acc, gn, and lr stand for accuracy, Gaussian noise, and learning rate, respectively.

Figure 6
figure 6

Impact of varying the number of frames on the three-fold accuracy of DenseNet-201 RNN and ResNet-152 RNN using transfer learning on the UCF-101 benchmark dataset.

Extension

Table 6 Comparison of Collaborative Learning and Individual Learning on classification accuracy (%).

In this Section, we extend our experiments to perform on more challenge datasets, i.e., Kinetics-400 and HMDB-51. In our methods, ConvNets are retrained to improve accuracy performance when compared to Transfer Learning32,44, but these processes can take a long time on the entire Kinetics-400 dataset. As a result, we decided to obtain only a portion of the entire dataset in order to demonstrate our concept. As shown in Table 6, our main focus in this study is to compare Non-Collaborative Learning (or Individual Learning) and Collaborative Learning approaches. In each experiment, we conduct two repetitions and record both the mean accuracy and the best accuracy (Max) achieved. All settings are the same as in the experiments with UCF-101 dataset. The learning rate range is obtained by running a scan from a low to a high learning rate. As a consequence, the learning rates of particles PSO-1, PSO-2 and PSO-3 are set at \(10^{-2}\), \(10^{-3}\) and \(10^{-4}\), respectively, whereas the learning rates of the wilder particle PSO-4 has a range of \([10^{-5},10^{-1}]\). The results show a preference for the Collaborative Learning methods as the Dynamic 1 and Dynamic 2 outperform the Individual Learning through both datasets, e.g., an improvement of 0.7% can be seen on Kinetics-400 using DenseNet-201 Transformer. The results obtained in our experiments clearly demonstrate the superiority of our proposed Collaborative Learning approach for video action recognition.

Discussion

The performance of action recognition methods such as ConvNet Transformer and ConvNet RNN is largely dependent on various factors, including the number of attention heads, the number of dense neurons, the number of units in RNN, and the learning rate, among others. Collaborative learning is an effective approach to improve the training of neural networks, where multiple models are trained simultaneously and both their positions and directions, as determined by the gradients of the loss function, are shared. In our previous research, we applied dynamics to ConvNets for image classification and in this study, we extend the concept to hybrid ConvNet Transformer and ConvNet RNN models for human action recognition in sequences of images. We first aim to identify the optimal settings that lead to the highest accuracy for the baseline models. As seen in Table 1, the ConvNet Transformer models did not perform as well as the ConvNet RNN models with transfer learning, which could be due to the limited data available for training, as transformers typically require more data than RNN-based models. However, our proposed method, incorporating dynamics and end-to-end training, not only outperforms the baseline models, but also results in the ConvNet Transformer models outperforming their ConvNet RNN counterparts. This can be attributed to the additional data provided to the transformer models through data augmentation and additional noise.

Conclusion

Recognizing human actions in videos is a fascinating problem in the art of recognition, and while Convolutional Neural Networks provide a powerful method for image classification, their application to HAR can be complex, as temporal features play a critical role.

In this study, we present a novel video action recognition framework that leverages collaborative learning with dynamics. Our approach explores the hybridization of ConvNet RNN and the recent advanced method Transformer, which has been adapted from Natural Language Processing for video sequences. The experiments include the exploration of two dynamics models, Dynamic 1 and Dynamic 2. The results demonstrate a round improvement of 2–9% in accuracy over baseline methods, such as an \(8.72\%\) increase in accuracy for the DenseNet-201 Transformer using Dynamic 2 and a \(7.26\%\) increase in accuracy for the ResNet-152 Transformer using Dynamic 1. Our approach outperforms the previous methods, offering significant improvements in video action recognition.

In summary, our work makes three key contributions: (1) We incorporate Dynamic 1 and Dynamic 2 into a hybrid model that combines ConvNet with two popular sequence modeling techniques—RNN and Transformer. (2) We extend the distributed collaborative learning framework to address the task of human action recognition. (3) We conducted extensive experiments on the challenging datasets including UCF-101, Dynamics-400 and HMDB-51 over a period of 2–3 months to thoroughly evaluate our approach. To validate its effectiveness, we compared our method against state-of-the-art approaches in the field.