SCB-YOLOv5: a lightweight intelligent detection model for athletes’ normative movements

Intelligent detection of athlete behavior is beneficial for guiding sports instruction. Existing mature target detection algorithms provide significant support for this task. However, large-scale target detection algorithms often encounter more challenges in practical application scenarios. We propose SCB-YOLOv5, to detect standardized movements of gymnasts. First, the movements of aerobics athletes were captured, labeled using the labelImg software, and utilized to establish the athlete normative behavior dataset, which was then enhanced by the dataset augmentation using Mosaic9. Then, we improved the YOLOv5 by (1) incorporating the structures of ShuffleNet V2 and convolutional block attention module to reconstruct the Backbone, effectively reducing the parameter size while maintaining network feature extraction capability; (2) adding a weighted bidirectional feature pyramid network into the multiscale feature fusion, to acquire precise channel and positional information through the global receptive field of feature maps. Finally, SCB-YOLOv5 was lighter by 56.9% than YOLOv5. The detection precision is 93.7%, with a recall of 99% and mAP value of 94.23%. This represents a 3.53% improvement compared to the original algorithm. Extensive experiments have verified that our method. SCB-YOLOv5 can meet the requirements for on-site athlete action detection. Our code and models are available at https://github.com/qingDu1/SCB-YOLOv5.

• We collect images of aerobic athletes, classify their actions anew, and establish the ANBD dataset for recog- nizing their behavior.• In terms of model optimization, the SCB-YOLOv5 model is proposed.It features a more lightweight backbone and incorporates a weighted BiFPN to enhance the performance of the original model.• We compared and analyzed comprehensive experiments with several approaches.The quality metrics of the target detectors were evaluated to validate the effectiveness of our approach.

Target detection algorithms
Target detection algorithms rely on convolutional operations 9 .Based on their framework structure, these algorithms can be categorized into two types: one-stage and two-stage models, as depicted in Fig. 2. One-stage algorithms generally prioritize high real-time performance and simplicity, often at the expense of detection accuracy.Conversely, two-stage algorithms utilize the regional proposal network (RPN) to generate suggestions and then employ a fully connected layer to produce category predictions and bounding boxes, resulting in higher detection accuracy.Despite having fewer network layers, some one-stage algorithms [10][11][12] have recently outperformed two-stage networks in both accuracy and speed, and are widely employed in automated detection applications.

Actions recognition
Deep learning-based algorithms for recognizing athletes' normative actions, encompassing classification and detection.These capture video sequence images of athletes' movements and employ convolutional neural networks for model training to predict their actions.Numerous scholars have delved into human action recognition from a kinesiology perspective.Zhang et al. 23 proposed a method that relies on multimodal sequence fitting to detect the behavior of college basketball players.This method combines motion patterns and visual motion features captured by cameras.Integrating global and local motion patterns can significantly improve the performance of group behavior recognition.Julian Fritsch et al. 24 introduced an intelligent detection algorithm for recognizing the post-scoring emotions of volleyball players, achieving a precision rate of 80.09%.Zhao et al. 25 focused on deep video analysis, extracting frame sequences as inputs for a 3D convolution-based deep neural network.4][15][16][17] , SSD 18 , RetinaNet 19 , etc. and two-stage networks include Faster-RCNN 20 , Mask-RCNN 21 , Cascad-RCNN 22 , etc.

Method SCB-YOLOv5 model
All the images used in this research were obtained from the Hunan Institute of Engineering in Xiangtan, China, with 216 individuals, comprising 19 teachers and 197 students.All volunteers who participated in the photo shoots were informed about the data usage and provided consent for the research presented in this paper.
The overall structure of the SCB-YOLOv5 is derived from YOLOv5 and mainly consists of five components: input, Backbone, Neck, Head, and Predict.The specific structure is shown in Fig. 4. ShuffleNet v2 27 servers as the backbone to achieve a more lightweight design, integrating the CBAM at the base layer to capture additional feature information.The Neck component comprises BiFPN, which integrates semantic information from the deep network into the shallow network.Finally, the output predicts image features and generates the bounding box with the highest confidence based on the size of the target.

Mosaic-9
The mosaic data enhancement in Yolov4 14 randomly selects four images from the training set and combines their contents to create synthesized images directly used for training.This data augmentation method could enhance the ability of YOLOv4 to recognize objects in complex backgrounds.Therefore, we employ the Mosaic-9 17 enhancement in YOLOv5, as illustrated in Fig. 5. Initially, a batch of images is randomly selected from the dataset, followed by the random selection of nine images from the extracted set.These pieces are then cut and stitched together to create a new image.This process is repeated batch size times (batch size refers to the number of images extracted from the dataset), resulting in an enhanced image of the specified batch size.

ShuffletNet V2
ShuffleNet V2 27 introduces new enhancements to the structure of ShuffleNet V1 28 .The ShuffleNet V2 network structure is illustrated in Table 1.A 1 × 1 convolutional layer is incorporated to blend the features before the global average pooling.Efficient utilization of each stage enables an increase in feature channels and enhances network capacity.Notably, half of the feature channels in each block are directly transmitted to the subsequent one.This mechanism resembles feature reuse, akin to the concepts of DenseNet 29 and CondenseNet 30 .Such a structure enables information communication between different channel groups and enhances reliability.

Convolutional block attention module
Replacing the YOLOv5 backbone with the lightweight ShuffleNet V2 27 results in the loss of certain image feature information.To preserve more high-level semantic information, CBAM 18 is added after the ShuffleNet V2, thereby directing more attention towards the significant aspects of the image.The CBAM attention mechanism is illustrated in Fig. 6.CBAM comprises a channel attention module and a spatial attention module.It can effectively prioritize information crucial to the current task goal, enhancing the relevance of the extracted features from convolutional layers, capturing more comprehensive high-level semantic information, and improving target recognition.The calculation formula is as follows: (1)

Fusing the neck part of the BiFPN
The significance of analyzing images at multiple scales arises from the inherent complexity of images.Real-world scene images encompass a multitude of large and small target objects, each bearing diverse information such as size, position, color, and other attributes.Hence, relying solely on the bottom-up FPN pyramid structure has the potential to overlook information across different scales.To address this issue, Mingxing Tan et al. 10 proposed BiFPN, a straightforward and efficient feature pyramid network, as depicted in Fig. 7.
The multi-scale feature fusion of BiFPN aims to aggregate features with different resolutions.Because the input features have different resolutions, BiFPN uses a band-weighted feature fusion method (Fast Normalized Fusion).
where ∈ = 0.0001 is used to avoid numerical instability, and w is a learned parameter, similar to an attention mechanism, used to distinguish the significance of various features in the feature fusion process.

Experiment settings
All our experiments were conducted using the

Model performance evaluation metrics
Average precision (AP) and mean average precision (mAP) are commonly used in target detection to evaluate the detection algorithms.The calculation formulas are presented in Eqs. ( 4) and ( 5): where AP is the average precision of a single category, mAP is the mean of the AP values of all categories, F1 is the reconciled mean of P and R, P is the precision rate, R is the recall rate, and k is the number of detected categories.
AP is a common metric for evaluating the overall performance of a detector.However, excessive emphasis on labeled positive samples while pursuing AP can result in a high number of false detections.In practical evaluations, the F1-Score is employed as the evaluation criterion, offering a more balanced and effective measure of overall performance.Following 100 epochs of training, a relatively high confidence threshold (confidence = 0.5) is usually set to filter out a large number of false detection frames.Subsequently, the performance is analyzed using the F1-SCORE.As shown in Fig. 8.

Comparison with detectors results
Extensive experiments were conducted, including multi-group visual quality comparisons and image quality assessment, as shown in Table 1.(1) Among the original models, we found the YOLOv8 delivers the best detection www.nature.com/scientificreports/performance.However, it comes with a significantly larger number of model parameters compared to YOLOv5, albeit offering a comparable detection performance at a slightly highter cost.Therefore, the experiment aims to enhance YOLOv5.( 2) SCB-YOLOv5 greatly reduces the network parameters and minimizes hardware computation.The mAP reaches 94.23%, which is 3.53% higher than YOLOv5.

Detection demos
The detection demos are shown in Fig. 9, corresponding to the data in Table 2.In the SSD algorithm figures a-ii, b-ii, c-ii, f-ii, and g-ii miss the detection of "wronghand" behavior.In developing the SCB-YOLOv5, we meticulously documented the impact of each adjustment in the experiment, with a specific focus on the changes in Precision, Recall, and mAP.As shown in Fig. 10, the black line graph demonstrates that SCB-YOLOv5 outperforms the other methods in each performance metric after training stabilization.

Ablation study
To evaluate the effectiveness of the improved module on the SCB-YOLOv5, the ablation study was performed on our dataset.The results are shown in Table 3.After replacing the original backbone with the ShuffleNet V2 network, there was a decrease due to the reduced complexity of the model.This was done to maintain the detection performance of the model.Consider incorporating an attention mechanism after the ShuffleNet V2 backbone, and subsequently integrating BiFPN across multi-scale features to improve SCB-YOLOv5.Through extensive experiments, it has been proven that the adopted optimization strategy can enhance the accuracy of detecting.The mAP value has increased by 3.53 percentage points compared to the original.

Conclusion
In this study, we introduce a dataset for detecting the actions of aerobic athletes.A lightweight algorithm SCB-YOLOv5 is designed to recognize and regulate actions.To innovate the application of digital sports teaching processes.
The results of multiple sets of experiments show that the enhanced model has a more significant impact on recognizing athletes' irregular hand and leg movements, outperforming other detectors.This finding holds major significance in promoting the sustainable and healthy development of "Internet + Education".

Figure 1 .
Figure 1.Our intelligent detection system for athlete normative behavior, including RGB cameras, clients, and alarms.

Figure 3 .
Figure 3. Classifying the movement behaviors of aerobics players, we categorized the common movements of the participants as follows: (a) the red box in the middle indicates that the athlete's arms and legs are straight, demonstrating standard movements; (b) the green box in the middle indicates that the athlete's hands are bent at the elbows, which is an incorrect movement that does not meet the normative requirements; and (c) the blue box in the middle indicates that the athlete's calves are bent, which is an incorrect movement that does not meet the standard movements.

Figure 4 .
Figure 4.In the SCB-YOLOv5 structure, we primarily updated the backbone network and the Neck part of the YOLOv5.

Figure 5 .
Figure 5. Mosaic-9.Using this strategy can effectively increase the amount of data, and our experiments verify that it has a positive effect on the subsequent model training.
Windows 10 system, utolizing the PyTorch deep learning framework.The processor employed is Intel(R) Core(TM) i5-10400F CPU @ 2.90 GHz, with 16 GB of RAM, and the GPU model is NVIDIA GeForce RTX 1650 graphics card with 4 GB of memory.The dataset utilized for the experiments is ANBD, which consists of 2121 images as detailed in our dataset.The dataset is divided into training and validation sets in an 8:2 ratio.Model training encompasses 100 epochs, with a batch size set to 2, and the initial learning rate set to 0.01.

Fig. 7 .
Fig.7.BiFPN Node Diagram.The purple curve connects input nodes and output nodes within the same layer.The blue curve conveys semantic information of high-level features, while the red curve conveys positional information of bottom-level features.
Figure a-iv has the highest sensitivity with a detection confidence of 100%.Conversely, the YOLOX and RetinaNet algorithms are less effective in detecting the "wrongleg" behavior.Figures b-iii and e-iii demonstrate the mission of "wrongleg".Finally, SCB-YOLOv5 achieves accurate detection results.

Figure 8 .
Figure 8.The F1-SCORE of detectors, where SCB-YOLOv5 has the highest F1-SCORE, while the other detectors exhibit less favorable detection results for the classification of wronghand, again confirming Table2above.

Figure 9 .
Figure 9. Detection demos with SSD, YOLOv5, YOLOX, YOLOv7, and YOLOv8 are shown in the figures i-iv depict diagrams of athlete behavior in various scenarios.Hand irregularities are marked in the red circles, foot irregularities in the yellow circles, and iv showcases correct behavior demonstrations.In figures a-i to f-iv, each detection box displays the detection confidence level, where higher values indicate greater confidence in the results of that detection.

Figure 10 .
Figure 10.Line graph comparing the performance of each improvement experiment, where the horizontal coordinate is the 100epoch of the training of each algorithm model and the vertical coordinate is the index value.

Table 1 .
ShuffleNet V2 structure, for each stage, its first block is required to be doubled, and the step size strips are all equal to 2.

Table 2 .
Comparison of the detector results.The best experimental results are labeled in bold, and the secondbest results are indicated in italic.

Table 3 .
Results of ablation experiments.