Abstract
Intelligent detection of athlete behavior is beneficial for guiding sports instruction. Existing mature target detection algorithms provide significant support for this task. However, large-scale target detection algorithms often encounter more challenges in practical application scenarios. We propose SCB-YOLOv5, to detect standardized movements of gymnasts. First, the movements of aerobics athletes were captured, labeled using the labelImg software, and utilized to establish the athlete normative behavior dataset, which was then enhanced by the dataset augmentation using Mosaic9. Then, we improved the YOLOv5 by (1) incorporating the structures of ShuffleNet V2 and convolutional block attention module to reconstruct the Backbone, effectively reducing the parameter size while maintaining network feature extraction capability; (2) adding a weighted bidirectional feature pyramid network into the multiscale feature fusion, to acquire precise channel and positional information through the global receptive field of feature maps. Finally, SCB-YOLOv5 was lighter by 56.9% than YOLOv5. The detection precision is 93.7%, with a recall of 99% and mAP value of 94.23%. This represents a 3.53% improvement compared to the original algorithm. Extensive experiments have verified that our method. SCB-YOLOv5 can meet the requirements for on-site athlete action detection. Our code and models are available at https://github.com/qingDu1/SCB-YOLOv5.
Similar content being viewed by others
Introduction
The movements of aerobic gymnasts must be standardized, as this directly impacts their safety. Scientific and standardized movements can reduce or even eliminate the risk of injury. The new development stage of digital sports embodies the deep integration and interaction of “digital” and “sports”1. Detecting and analyzing the behavior of aerobic athletes can help promote innovation in sports education, including actively promoting sports image recognition2, 3D motion modeling analysis3, and live streaming4.
In recent years, human action recognition based on deep learning5,6,7,8 has found extensive applications in smart cities, industrial production, intelligent transportation systems, and other fields. The analysis of automated video content holds the potential to significantly advance monitoring capabilities, encompassing action recognition, target tracking, and pedestrian re-identification. Such advancements offer a practical approach to recognizing athletes’ actions.
Precisely categorization and assessment of athlete behavior through visual data involves leveraging computer vision technology to predict action categories and evaluate action quality. Existing object detectors serve as important references for recognizing athlete behavior. However, deep neural models have a large number of model parameters, and their calculations are complex, imposing high demands on hardware computing capabilities, memory bandwidth, and data storage. This makes it costly to use them for practical sports education. Given the specialized and standardized nature of athletes’ movements, further research is needed on existing recognition methods.
In summary, we have proposed a lightweight intelligent detection model: SCB-YOLOv5. This model is capable of analyzing human behavior in video images, identifying various types of actions, and responding promptly to specific circumstances, as illustrated Fig. 1.
We summarize the main contributions are as follows:
-
We collect images of aerobic athletes, classify their actions anew, and establish the ANBD dataset for recognizing their behavior.
-
In terms of model optimization, the SCB-YOLOv5 model is proposed. It features a more lightweight backbone and incorporates a weighted BiFPN to enhance the performance of the original model.
-
We compared and analyzed comprehensive experiments with several approaches. The quality metrics of the target detectors were evaluated to validate the effectiveness of our approach.
Related works
Target detection algorithms
Target detection algorithms rely on convolutional operations9. Based on their framework structure, these algorithms can be categorized into two types: one-stage and two-stage models, as depicted in Fig. 2. One-stage algorithms generally prioritize high real-time performance and simplicity, often at the expense of detection accuracy. Conversely, two-stage algorithms utilize the regional proposal network (RPN) to generate suggestions and then employ a fully connected layer to produce category predictions and bounding boxes, resulting in higher detection accuracy. Despite having fewer network layers, some one-stage algorithms10,11,12 have recently outperformed two-stage networks in both accuracy and speed, and are widely employed in automated detection applications.
Actions recognition
Deep learning-based algorithms for recognizing athletes’ normative actions, encompassing classification and detection. These capture video sequence images of athletes’ movements and employ convolutional neural networks for model training to predict their actions. Numerous scholars have delved into human action recognition from a kinesiology perspective.
Zhang et al.23 proposed a method that relies on multimodal sequence fitting to detect the behavior of college basketball players. This method combines motion patterns and visual motion features captured by cameras. Integrating global and local motion patterns can significantly improve the performance of group behavior recognition. Julian Fritsch et al.24 introduced an intelligent detection algorithm for recognizing the post-scoring emotions of volleyball players, achieving a precision rate of 80.09%. Zhao et al.25 focused on deep video analysis, extracting frame sequences as inputs for a 3D convolution-based deep neural network. This algorithm automatically captures spatio-temporal features of athlete behavior, thereby enhancing the accuracy of recognizing body movements.
Our dataset
Deep learning-based target detectors necessitate a substantial number of pre-labeled samples to enhance accuracy and generalization capability26. We categorized movement classification based on fundamental body postures, basic techniques, and coordination. To create the athlete normative behaviors dataset (ANBD), we utilized pictures and video footage captured by members of a university aerobics team in Hunan Province, China. These videos were edited to extract one frame every 5 s, resulting in a collection of 2121 images showcasing various athletes in different scenes and angles.
The determination of whether a movement constitutes a standardized action relies on the varying amplitudes of the athlete’s arm, elbow, leg, and other movements, as illustrated in Fig. 3. Among these, (a) depicts the standardized correct action, labeled “correct”; (b) illustrates the wrong hand action, labeled “wronghand”; (c) portays the wrong leg action, labeled “wrongleg”.
Method
SCB-YOLOv5 model
All the images used in this research were obtained from the Hunan Institute of Engineering in Xiangtan, China, with 216 individuals, comprising 19 teachers and 197 students. All volunteers who participated in the photo shoots were informed about the data usage and provided consent for the research presented in this paper.
The overall structure of the SCB-YOLOv5 is derived from YOLOv5 and mainly consists of five components: input, Backbone, Neck, Head, and Predict. The specific structure is shown in Fig. 4. ShuffleNet v227 servers as the backbone to achieve a more lightweight design, integrating the CBAM at the base layer to capture additional feature information. The Neck component comprises BiFPN, which integrates semantic information from the deep network into the shallow network. Finally, the output predicts image features and generates the bounding box with the highest confidence based on the size of the target.
Mosaic-9
The mosaic data enhancement in Yolov414 randomly selects four images from the training set and combines their contents to create synthesized images directly used for training. This data augmentation method could enhance the ability of YOLOv4 to recognize objects in complex backgrounds. Therefore, we employ the Mosaic-917 enhancement in YOLOv5, as illustrated in Fig. 5. Initially, a batch of images is randomly selected from the dataset, followed by the random selection of nine images from the extracted set. These pieces are then cut and stitched together to create a new image. This process is repeated batch size times (batch size refers to the number of images extracted from the dataset), resulting in an enhanced image of the specified batch size.
ShuffletNet V2
ShuffleNet V227 introduces new enhancements to the structure of ShuffleNet V128. The ShuffleNet V2 network structure is illustrated in Table 1. A 1 × 1 convolutional layer is incorporated to blend the features before the global average pooling. Efficient utilization of each stage enables an increase in feature channels and enhances network capacity. Notably, half of the feature channels in each block are directly transmitted to the subsequent one. This mechanism resembles feature reuse, akin to the concepts of DenseNet29 and CondenseNet30. Such a structure enables information communication between different channel groups and enhances reliability.
Convolutional block attention module
Replacing the YOLOv5 backbone with the lightweight ShuffleNet V227 results in the loss of certain image feature information. To preserve more high-level semantic information, CBAM18 is added after the ShuffleNet V2, thereby directing more attention towards the significant aspects of the image. The CBAM attention mechanism is illustrated in Fig. 6.
CBAM comprises a channel attention module and a spatial attention module. It can effectively prioritize information crucial to the current task goal, enhancing the relevance of the extracted features from convolutional layers, capturing more comprehensive high-level semantic information, and improving target recognition. The calculation formula is as follows:
while, \(F\in {R}^{C*H*W}\) is the input feature, \({M}_{C}\in {R}^{C*1*1}\) is the one-dimensional convolution of the channel attention module, \({M}_{S}\in {R}^{1*H*W}\) is the spatial attention module, \({F}^{\mathrm{^{\prime}}}\) is the output feature after passing through the channel attention module, and \({F}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}\) is the final output feature.
Fusing the neck part of the BiFPN
The significance of analyzing images at multiple scales arises from the inherent complexity of images. Real-world scene images encompass a multitude of large and small target objects, each bearing diverse information such as size, position, color, and other attributes. Hence, relying solely on the bottom-up FPN pyramid structure has the potential to overlook information across different scales. To address this issue, Mingxing Tan et al.10 proposed BiFPN, a straightforward and efficient feature pyramid network, as depicted in Fig. 7.
The multi-scale feature fusion of BiFPN aims to aggregate features with different resolutions. Because the input features have different resolutions, BiFPN uses a band-weighted feature fusion method (Fast Normalized Fusion).
where \(\in\) = 0.0001 is used to avoid numerical instability, and \(w\) is a learned parameter, similar to an attention mechanism, used to distinguish the significance of various features in the feature fusion process.
Experimental results and discussion
Experiment settings
All our experiments were conducted using the Windows 10 system, utolizing the PyTorch deep learning framework. The processor employed is Intel(R) Core(TM) i5-10400F CPU @ 2.90 GHz, with 16 GB of RAM, and the GPU model is NVIDIA GeForce RTX 1650 graphics card with 4 GB of memory.
The dataset utilized for the experiments is ANBD, which consists of 2121 images as detailed in our dataset. The dataset is divided into training and validation sets in an 8:2 ratio. Model training encompasses 100 epochs, with a batch size set to 2, and the initial learning rate set to 0.01.
Model performance evaluation metrics
Average precision (AP) and mean average precision (mAP) are commonly used in target detection to evaluate the detection algorithms. The calculation formulas are presented in Eqs. (4) and (5):
where AP is the average precision of a single category, mAP is the mean of the AP values of all categories, F1 is the reconciled mean of P and R, P is the precision rate, R is the recall rate, and k is the number of detected categories.
AP is a common metric for evaluating the overall performance of a detector. However, excessive emphasis on labeled positive samples while pursuing AP can result in a high number of false detections. In practical evaluations, the F1-Score is employed as the evaluation criterion, offering a more balanced and effective measure of overall performance. Following 100 epochs of training, a relatively high confidence threshold (confidence = 0.5) is usually set to filter out a large number of false detection frames. Subsequently, the performance is analyzed using the F1-SCORE. As shown in Fig. 8.
Comparison with detectors results
Extensive experiments were conducted, including multi-group visual quality comparisons and image quality assessment, as shown in Table 1. (1) Among the original models, we found the YOLOv8 delivers the best detection performance. However, it comes with a significantly larger number of model parameters compared to YOLOv5, albeit offering a comparable detection performance at a slightly highter cost. Therefore, the experiment aims to enhance YOLOv5. (2) SCB-YOLOv5 greatly reduces the network parameters and minimizes hardware computation. The mAP reaches 94.23%, which is 3.53% higher than YOLOv5.
Detection demos
The detection demos are shown in Fig. 9, corresponding to the data in Table 2. In the SSD algorithm figures a-ii, b-ii, c-ii, f-ii, and g-ii miss the detection of “wronghand” behavior. Figure a-iv has the highest sensitivity with a detection confidence of 100%. Conversely, the YOLOX and RetinaNet algorithms are less effective in detecting the “wrongleg” behavior. Figures b-iii and e-iii demonstrate the mission of “wrongleg”. Finally, SCB-YOLOv5 achieves accurate detection results.
In developing the SCB-YOLOv5, we meticulously documented the impact of each adjustment in the experiment, with a specific focus on the changes in Precision, Recall, and mAP. As shown in Fig. 10, the black line graph demonstrates that SCB-YOLOv5 outperforms the other methods in each performance metric after training stabilization.
Ablation study
To evaluate the effectiveness of the improved module on the SCB-YOLOv5, the ablation study was performed on our dataset. The results are shown in Table 3. After replacing the original backbone with the ShuffleNet V2 network, there was a decrease due to the reduced complexity of the model. This was done to maintain the detection performance of the model. Consider incorporating an attention mechanism after the ShuffleNet V2 backbone, and subsequently integrating BiFPN across multi-scale features to improve SCB-YOLOv5. Through extensive experiments, it has been proven that the adopted optimization strategy can enhance the accuracy of detecting. The mAP value has increased by 3.53 percentage points compared to the original.
Conclusion
In this study, we introduce a dataset for detecting the actions of aerobic athletes. A lightweight algorithm SCB-YOLOv5 is designed to recognize and regulate actions. To innovate the application of digital sports teaching processes.
The results of multiple sets of experiments show that the enhanced model has a more significant impact on recognizing athletes’ irregular hand and leg movements, outperforming other detectors. This finding holds major significance in promoting the sustainable and healthy development of “Internet + Education”.
Data availability
Datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.
Code availability
Our code and models are available at https://github.com/qingDu1/SCB-YOLOv5. Our model code is open data. However, the data set used in this article is not available, and the authors can be contacted if necessary.
References
Al-Emran, M., Malik, S. I. & Al-Kabi, M. N. A survey of Internet of Things (IoT) in education: Opportunities and challenges. In Toward Social Internet of Things (SIoT): Enabling Technologies, Architectures and Applications (eds Hassanien, A. E. et al.) 197–209 (Springer, Cham, 2020).
Li, G. & Zhang, C. Automatic detection technology of sports athletes based on image recognition technology. EURASIP J. Image Video Process. 2019, 1–9 (2019).
Ghosh, P., Song, J., Aksan, E., Hilliges, O. Learning human motion models for long-term predictions. In Proceedings of International Conference on 3D Vision, 458–466 (IEEE, 2017).
Levallet, N. et al. Enhancing the fan experience at live sporting events: The case of stadium Wi-Fi. Case Stud. Sport Manag. 8(1), 6–12 (2019).
Chen, D. D. Image recognition of sports athletes’ high-intensity sports injuries based on binocular stereo vision. Comput. Intell. Neurosci 2022, 4322597–4322597 (2022).
Batty, M. Big data, smart cities and city planning. Dialogues Hum. Geogr. 3(3), 274–279 (2013).
Baines, T., Lightfoot, H., Smart, P. & Fletcher, S. Servitization of manufacture: Exploring the deployment and skills of people critical to the delivery of advanced services. J. Manuf. Technol. Manag. 24(4), 637–646 (2013).
Zhu, K., Wang, R., Zhao, Q., Cheng, J. & Tao, D. A cuboid CNN model with an attention mechanism for skeleton-based action recognition. IEEE Trans. Multimed. 22(11), 2977–2989 (2020).
Cao, D., Chen, Z. & Gao, L. An improved object detection algorithm based on multi-scaled and deformable convolutional neural networks. Hum. Cent. Comput. Info. 10(1), 1–22 (2020).
Tan, M., Pang, R., Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 10781–10790 (2020).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Zhang, P., Lin, S., Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, USA, 10012–10022 (2021).
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirill, A., Zauyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision, 213–229 (Glasgow, 2020).
Redmon, J., Divvala, S., Girshick, R., Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 779–788 (2016).
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. In Computing Research Repository. https://arxiv.org/abs/2004.10934 (2020).
Ge Z, Liu S, Wang F, et al. Yolox: Exceeding yolo series in 2021. Preprint at arXiv:2107.08430 (2021).
Li C, Li L, Jiang H, et al. YOLOv6: A single-stage object detection framework for industrial applications. Preprint at arXiv:2209.02976 (2022).
Wang C.Y., Bochkovskiy A., Liao H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Preprint at arXiv:2207.02696 (2022).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C., Ssd: Single shot multibox detector. In Proceedings of Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands (2016).
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P. Focal loss for dense object detection. In Proc. IEEE International Conference on Computer Vision, Honolulu, USA, 2980–2988 (2017).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1 (2015).
He K., Gkioxari G, Dollár P, et al. Mask r-cnn. In Proc. IEEE International Conference on Computer Vision, 2961–2969 (2017).
Cai, Z., Vasconcelos, N., Cascade R-CNN: Delving into high quality object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 6154–6162 (Salt Lake City, 2018).
Zhang, L. Behaviour detection and recognition of college basketball players based on multimodal sequence matching and deep neural networks. Comput. Intell. Neurosci. https://doi.org/10.1155/2022/7599685 (2022).
Fritsch, J., Ebert, S. & Jekauc, D. The recognition of affective states associated with players’ non-verbal behavior in volleyball. Psychol. Sport Exerc. 64, 102329 (2023).
Zhao, X. P. Research on athlete behavior recognition technology in sports teaching video based on deep neural network. Comput. Intell. Neurosci. https://doi.org/10.1155/2022/7260894 (2022).
Heffington, C., Park, B. B. & Williams, L. K. The “most important problem” dataset (MIPD): A new dataset on American issue importance. Confl. Manag. Peace Sci. 36(3), 312–335 (2019).
Ma, N., Zhang, X; Zheng, H.T., Sun, J. Shufflenet v2: Practical guidelines for efficient CNN architecture design, 116–131. https://arxiv.org/abs/1807.11164 [cs.CV] (2018).
Zhang, X., Zhou, X., Lin, M., Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices, 6848–6856. https://arxiv.org/abs/1707.01083 [cs.CV] (2018).
Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Darrell, T., Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. https://arxiv.org/abs/1404.1869 [cs.CV] (2014).
Huang, G., Liu, S., Van der Maaten, L., Weinberger, K.Q. Condensenet: An efficient densenet using learned group convolutions, 2752–2761. https://arxiv.org/abs/1711.09224 [cs.CV] (2018).
Funding
This work was supported in part by the Research Foundation of Hunan Province Innovation Foundation for Postgraduate, China Grant QL20230233.
Author information
Authors and Affiliations
Contributions
Q.D.: conceptualization, methodology, software, writing—original draft; L.T.: conceptualization, writing—review and editing; Y.L.: data curation, writing—original draft.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Du, Q., Tang, L. & Li, Y. SCB-YOLOv5: a lightweight intelligent detection model for athletes’ normative movements. Sci Rep 14, 8624 (2024). https://doi.org/10.1038/s41598-024-59218-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-59218-w
Keywords
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.