Introduction

Different studies have argued that anthropometric characteristics1, genetic factors2, environmental factors3, psychosocial and psychological factors4 influence basic motoric traits and are naturally characteristics that should be taken into account when selecting talent. In recent years, the connection between morphometric traits and athletic performance has drawn a lot of attention in the fields of sports science and physical fitness testing 5. Thus, by analyzing the morphometric characteristics that affect basic motoric characteristics, appropriate branch choices can be made at an early age6.

Speed is also one of the basic motoric traits and is a fundamental measure of speed and agility, which are important components in many sports and physical activities7. Comprehending the elements that influence sprint performance is crucial for enhancing training plans, identifying potential, and fostering whole sports growth8,9. A study by Gunnar and Pettersen examined the relationship between anthropometric characteristics and sprint performance. Participants consisted of 10–16 year old youth. Body mass index (BMI) in 10–12 year olds and height in 13–16 year olds were among the parameters affecting 10 m and 20 m sprint performance10. In the study conducted by Bonato et al11 in which the effect of anthropometric characteristics on sprint performance in 200 m and 400 athletes was examined, it was concluded that there was a negative relationship between body mass and BMI and sprint performance. Although the results of the observed studies suggest that anthropometric characteristics affect sprint performance, this cannot be separated from two different perspectives on talent selection. One of these perspectives is that only individuals with innate talent can reach the level of an elite athlete, while the other is that every individual can reach the elite level with practice and frequent training 12. The point where these two different views converge is that talent identification should be done early and children should be directed to branches that are appropriate to their abilities.

Talent screening is practiced in many countries and many different methods are used. In recent years, Machine Learning (ML) and Artificial Intelligence (AI) algorithms have been used in sports sciences for many analyses such as performance analysis, injury prevention and rehabilitation, competition analysis, athlete tracking and data collection, and sports management13,14,15. In recent years, different ML and AI methods have been utilized in talent prediction16. The fact that ML applications try to optimize future predictions and decisions by identifying hidden patterns in a data set, determining the complexity in the data set with multilayer neural networks17, and making predictions and modeling based on the findings obtained from the data set shows that ML applications should be frequently used in the identification of motor characteristics and talent screening18.

When the literature is examined, it is seen that many different test batteries and methods are applied in early childhood talent selection for different branches. In recent years, many studies have been conducted with ML and AI algorithms. However, it has been determined that the number of comprehensive studies examining the anthropometric characteristics that determine speed performance, which is one of the basic motoric characteristics, with ML algorithms is limited. In this context, in our research, we aimed to predict the morphological features that determine 20-m performance in children between the ages of 6–11 years with ML applications. In this way, it is thought that the morphological features that need to be developed and analyzed for maximum sprint performance will be determined and accordingly, children's sprint performance will be predicted and directed to appropriate branches. For this reason, the hypothesis of our research was determined as “Some morphometric characteristics affect 20 m performance”.

Methods

In conducting this investigation, we adhered to the conceptual framework shown in Fig. 1. The primary components of the research study are the collection of data pertaining to twenty-meter sprint performance, preprocessing of the collected data, division of the data into training and testing samples, selection and reduction of relevant features, construction of regression models utilizing traditional Machine Learning, ML, algorithms, and assessment of the predictive model's performance.

Figure 1
figure 1

The proposed framework for prediction of Sprint Performance using ML technology.

Participants

In this study, 282 participants, 130 males (age: 8.26 ± 1.84 kg, height: 136.51 ± 16.74 cm, weight: 36.61 ± 14.16 kg, BMI 17.77 ± 3.28 kg/m2) and 152 females (age: 8.74 ± 1.83 kg, height: 135.26 ± 12.83 cm, weight: 32.73 ± 10.32 kg, BMI 17.48 ± 2.99 kg/m2), aged between 6 and 11 years, participated in the first levels of primary education. In this study, we focused on participants with normal development at the first level of primary education. Participants reported that they did not have any anxiety and insomnia during the test. In this study, G-Power (version 3.1.9.7, IBM, Düsseldorf) was used to determine the minimum sample19. According to this analysis, when α err prob = 0.05; minimum effect size = 0.30; and power (1−β err prob) = 0.80, it was determined that there should be at least 270 participants with an actual power of 80.0%.

In this study, voluntary participants between the ages of 6 and 11, studying at the first level of primary education, showing normal physical, cognitive and affective development were included. Participants were selected from sedentary individuals who did not actively compete in any sports branch other than physical education and sports lessons. Participants who had developmental disorders, needed special education through inclusive education, had chronic diseases, had any disability, or were taking growth hormone, glucocorticoids, antipsychotic or corticosteroid drugs that could change the body structure were not included in the study. Participants who had difficulty understanding the instructions of the researcher during the tests and who persistently refused to perform the tests were excluded from the study.

All participants, parents and teachers were informed about the purpose, rationale and contribution of the research to the literature. Informed consent forms were signed by all researchers and participants and they were informed that they could withdraw from the study at any time. The necessary permissions were obtained from the Health Sciences Non-Interventional Research Ethics Committee (Approval Number: 2024/4892). The data obtained in the study and all test procedures were performed in accordance with the principles set out in the Declaration of Helsinki. The proposed model can be observed in Fig. 1.

Data collection

Antropometric measurements

An electronic scale (Tanita BC 420 SMA, Tanita Europe GmbH, Sindelfingen, Germany) was used to measure weight, with an accuracy of 0.1 kg. The children were simply decked up in underwear and a T-shirt. A telescopic height-measuring device (Seca 225 stadiometer, Birmingham, UK) was used to measure the children's height while they were barefoot, to the nearest 0.1 cm. A skinfold caliper (Holtain, Holtain Ltd, Pembrokeshire, United Kingdom, range 0–40 mm) was used to measure the skinfold thickness (mm) twice on the right side of the body, to the nearest 0.2 mm. The following locations were used to measure skinfolds: (1) triceps, located on the back of the arm between the olecranon process and the acromion; (2) biceps, situated slightly above the cubital fossa's center, at the same level as the triceps skinfold; Skinfold thickness (mm) was measured twice on the right side of the body to the nearest 0.2 mm with a skinfold caliper (Holtain, Holtain Ltd, Pembrokeshire, United Kingdom, range 0–40 mm). Skinfold measurements were taken at the following sites: (1) triceps, between the acromion and the olecranon process on the back of the arm; (2) biceps, at the same level as the triceps skinfold, just above the center of the cubital fossa; (3) subscapular, approximately 20 mm below the tip of the scapula, at a 45° angle to the lateral side of the body; (4) suprailiac, approximately 20 mm above the iliac crest and 20 mm toward the medial line; (5) abdominal, midway between the spina iliaca anterior superior and the umbilicus; (6) quadriceps, superior 1/3 of the quadriceps muscle vertically; (7) gastrocnemius, midway medial to the muscle. Circumferences and length of the participants were measured with a tape measure with a precision of 1 cm. In this context; (1) head circumference, frontal and occipital region, (2) neck circumference, just below the larynx, (3) shoulder circumference, just below the acromion, at the end of expiration when the deltoid is most bulging, (4) chest circumference, end of expiration, 4th rib in front, 6th rib on the side, (5) abdominal circumference, umblicus level and subclavian ribs on the sides. costa in front and the 6th costa on the side, (5) abdominal circumference was measured at the level of the umblicus and the trunk circumference at the subcostal level on the sides, (6) thigh, mid-thigh, (7) gastrocnemius, where the gastrocnemius muscle was most bulging, (8) fathom length, standing with arms against the wall and in 90 degrees of abduction, between the fingertips, (9) leg length, between the spina iliaca anterior superior and the medial malleolus, (10) thigh length, between the spina iliaca anterior superior and the medial condyle of the femur, (11) foot length, between the posterior calcaneus and the 2nd phalax.

20 meters sprint performance test

Before the speed tests, 2 pairs of photocells (Smart Speed, Fusion Equipment, AUS) were placed along the running track at 0 and 20 m distances. Participants sprinted twice, starting on their own from a semi-crouched position 0.3 m behind the starting line. The sprint tests were performed on an indoor running track to avoid being affected by weather conditions. The temperature of the area was 22 degrees Celsius. After the familiarization phase, each participant was given two trials. Participants were tested in groups of 10 and after each participant finished the first test, the first participant was taken again for the second test. A rest interval of at least 5 min was ensured between both tests and the best trial was recorded20.

Dataset preprocessing and exploration

The subsequent phase involves the preparation of the dataset. The total number of records in the dataset is 282. Table 1 presents a comprehensive statistical summary of the combined dataset. Table 1 presents a comprehensive overview of the data distribution, encompassing many statistical measures such as the number of observations (which is represented by symbol N), average, standard deviation, minimum value, 1st quantile, median, 3rd quantile, and maximum value. The depiction of the variance of diverse variables is facilitated through making use of a scatterplots of the target variable (Sprint Performance) versus all input variables as exemplified in Fig. 2. In order to facilitate the study of the data records, the entries have undergone normalization using the z-score method. The normalized values exhibit centralization around zero and possess a standard deviation of one. The z-scores of a random variable X, characterized by a mean value of M and a standard deviation (\({\varvec{sd}}\)), can be determined using Eq. (1).

$${\varvec{Z}} - {\varvec{score}} = \user2{ }\frac{{{\varvec{x}} - {\varvec{M}}}}{{{\varvec{sd}}}}$$
(1)
Table 1 Statistical summary of the sprint performance dataset.
Figure 2
figure 2figure 2

The values of sprint performance against each feature displayed in a scatter-plot graphs.

The effectiveness of forecasting models is significantly impacted by the relevance of the input features. The Pearson correlation coefficient is a dominant approach to assess the relationship between the input variables and determine the extent to which the outcomes are influenced by the feature space. Figure 3 shows the correlation matrix between Sprint performance (the output variable) and the other variables (the input variables). Pearson's Correlation Coefficient is the standard statistical method for analyzing the linear association between two independent random variables. Two vectors’ correlation score, CS, indicates how dependent they are on one another. For any two vectors \({\varvec{x}}1\) and \({\varvec{x}}2\), we get (Eq. 2), which gives us the correlation coefficient where \({\varvec{cov}}\left( {{\varvec{x}}1,{\varvec{x}}2} \right)\) is the covariance between \({\varvec{x}}1\) and \({\varvec{x}}2\user2{ }\) and \({\varvec{\sigma}}\left( {{\varvec{x}}1} \right),\user2{ and \sigma }\left( {{\varvec{x}}2} \right)\) are their variances.

$${\varvec{cs}} = \frac{{{\varvec{cov}}\left( {{\varvec{x}}1,{\varvec{x}}2} \right)}}{{\sqrt {{\varvec{\sigma}}\left( {{\varvec{x}}1} \right){\varvec{\sigma}}\left( {{\varvec{x}}2} \right)} }}$$
(2)
Figure 3
figure 3

The correlation matrix between Sprint performance (the output variable) and the other variables (the input variables).

The \({\varvec{cs}}\) takes on a value between 1 and + 1, which is linearly dependent on whether or not the input variables \({\varvec{x}}1\) and \({\varvec{x}}2\user2{ }\) are correlated. If they are unrelated, then it would be equal to zero. As depicted in Fig. 3, the response variable (performance of Sprint) exhibits a stronger correlation with the following input variables including (Age, Height, waist circumference, hip circumference, leg length, thigh length, foot length).

Dataset splitting

The data sets have been partitioned into training and testing instances using a random allocation method, with a ratio of 80% for training and 20% for testing. In the second stage, the experiments were repeated by selecting k = 5 from the cross-validation method. The training examples are utilized for constructing the prediction models, whereas the testing samples are employed for evaluating the correctness of these predictions. The training and testing samples are subsequently inputted into a feature selection step in order to identify the crucial aspects that could potentially impact the accuracy of the prediction.

Feature space

Three distinct tests were undertaken to determine the optimal technique. The initial experiment involved utilizing the entire feature space for training the machine learning prediction models. In the second experiment, we exclusively employed significant features for the purpose of training and testing the regression models. The significant features that have been studied are those extracted using correlation analysis, namely with a correlation score (CS) greater than or equal to 0.4. These features have a higher Pearson Correlation Coefficient with the outcomes. In the third experiment, Principal Component Analysis (PCA) was employed to reduce the feature space. This allowed us to obtain only the primary vectors that were deemed statistically significant for training and testing the prediction models. PCA is a technique that effectively lowers the dimensionality of the feature space by identifying and extracting the most significant patterns that capture the essential information contained within the input features. The result of performing Principal Component Analysis is the identification of the principal components inside the feature space.

ML Prediction models construction and evaluation

Regression is a widely employed supervised machine learning methodology utilized for the purpose of forecasting continuous quantitative outcomes. During the process of regression analysis, the estimation of the relationship between an outcome (also known as the response variable) and a number of input variables (also known as predictors) is conducted using a labelled dataset. Different types of regression analysis can be chosen based on various factors, such as the qualities of the variables, the objective variables being examined, or the specific characteristics and form of the regression curve that represents the relationship between the dependent and independent variables. Linear regression, stepwise regression, decision trees, support vector machines, ensembles, and Gaussian process regressors are illustrative examples of conventional machine learning regression methodologies. A regression model that fits well is characterized by projected values that closely match the actual data values. The mean model, which employs the mean value for each projected outcome, is typically employed in cases where there are no informative predictor variables available. The adequacy of a proposed regression model should thus be superior to that of the mean model.

There are other performance indicators that can be utilized to assess the accuracy of forecasting models, including the R squared (R2) and the Mean Squared Error (MSE), and the Root-Mean-Square Error (RMSE). The coefficient of determination, denoted as R2, quantifies the degree to which the forecasting model is capable of capturing the variability observed in the results. The calculation of R squared is described by Eq. (3). The second metric, as depicted by Eq. 4, is the MSE which is a quantitative measure utilized to assess the effectiveness of a regression model. It quantifies the average of the squared discrepancies between the observed and predicted values of the target variable. A lower value of MSE signifies superior model performance, since it signifies a smaller average difference between the anticipated and actual values. The last metric is the RMSE which quantifies the disparities between the observed and anticipated values. It is computed according to Eq. (5), whereby k represents the sample size, \({\varvec{x}}_{{\varvec{i}}} \user2{ }\) denotes the actual values, \({\varvec{x}}_{{\varvec{i}}}^{\sim }\) represents the forecasted values, and \({\varvec{x}}_{{\varvec{i}}}^{ - }\) signifies the mean of the actual values.

$${\varvec{R}}^{2} = 1 - \user2{ }\frac{{\mathop \sum \nolimits_{{{\varvec{i}} = 1}}^{{\varvec{K}}} ({\varvec{x}}_{{\varvec{i}}} - {\varvec{x}}_{{\varvec{i}}}^{\sim } )^{2} }}{{\mathop \sum \nolimits_{{{\varvec{i}} = 1}}^{{\varvec{K}}} ({\varvec{x}}_{{\varvec{i}}} - {\varvec{x}}_{{\varvec{i}}}^{ - } )^{2} }}$$
(3)
$${\varvec{MSE}} = \user2{ }\frac{1}{{\varvec{k}}}\user2{ }\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{k}}} \left( {{\varvec{x}}_{{\varvec{i}}} - {\varvec{x}}_{{\varvec{i}}}^{\sim } } \right)^{2} \user2{ }$$
(4)
$${\varvec{RMSE}} = \user2{ }\sqrt {\frac{1}{{\varvec{K}}}\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{K}}} ({\varvec{x}}_{{\varvec{i}}} - {\varvec{x}}_{{\varvec{i}}}^{\sim } )^{2} } \user2{ }$$
(5)

Ethics approval and consent to participate

The study was conducted in accordance with the principles of the Declaration of Helsinki and was aproved by the Ethics Committee of the Institute of Health Sciences of Inonu University approved the study under registration number 2024/4892.

Results

Based on the proposed framework, all input features underwent preprocessing, and numerous strategies for feature extraction were employed to generate different collections of feature combinations. All of the input features stated in Table 1 were used to simulate the Sprint performance in the first experiment carried out under the suggested framework. Tables 2 show the results of analyzing how well various machine learning regression methods, such as linear regression, decision trees, support vector machines, ensembles, and Gaussian process regressor models, were able to predict how well an individual would perform in a twenty-meter sprint after being trained individually using all of the available input features. The values of the root mean squared error and the R squared that were obtained for each regression method are presented in Table 2. Figure 3 shows a comparison between the predicted value of the response variable (obtained through the various regression methods) and the actual value of the variable. Table 2 displays the findings, which show that the highest prediction performance was achieved by the Gaussian process regression model across all feature sets.

Table 2 Attained performance of the Sprint prediction model using all input features.

Table 2 presents the results of the analysis, revealing that the Gaussian process regression machine learning model consistently outperformed other regression approaches across all feature sets. The metrics used to assess prediction performance include Root Mean Square Error (RMSE) R-squared (R2) and Mean Squared Error (MSE). Notably, the Gaussian process regression demonstrated the highest accuracy with an RMSE of 0.114 and an impressive R2 value of 0.96. In comparison, Ensembles (Bagged Trees) exhibited higher RMSE (0.224) and lower R2 (0.85), emphasizing its comparatively weaker predictive capability. The Linear regression and Support Vector Machines (SVM) with a Linear Kernel, also performed well but slightly fell short of the Gaussian process regression model in terms of accuracy. These findings underscore the effectiveness of the Gaussian process regression model approach in predicting the target variable based on the given feature sets. Figure 4 exhibits the behaviour of the comparative prediction values of each method. Figure 4c and e show that the prediction value does not fit ideally on the regression line compared to the other methods. Table 3 displays the findings, which show that the highest prediction performance was achieved by the Gaussian process regression model using selected features after dimensionality reduction using PCA technique.

Figure 4
figure 4

A comparison between the predicted value of the response variable (all features).

Table 3 Attained performance of the Sprint prediction model trained on selected features using PCA approach.

To confirm the findings, the PCA technique is used to identify the principal components in the feature space (Fig. 5). The same set of regression approaches are applied to the identified components. Table 3 hosts the comparison results of RMSE, R2 and MSE values for the applied regression approaches. The Gaussian process regression model achieves the lowest RMSE and MSE values of 0.405 and 0.164 respectively, and the best R2 value of 0.51. Alternately, the Decision Trees and Support vector machines (SVM) techniques give the worst MSE value of 0.21, while the lowest R2 value of 0.38 and the highest RMSE values of 0.453 are obtained when the Decision Trees regression approach is applied. Those results support previous results shown in Table 3. Figure 4 shows the behavior of the comparative prediction values of each method. Figure 5a and c show that the linear regression and SVM models have outliner data points that affected the RMSE, R2 and MSE values. Table 4 displays the findings, which show that the highest prediction performance was achieved by the Linear regression model using selected features from correlation analysis.

Figure 5
figure 5

A comparison between the predicted value of the response variable after reducing dimensionality using the PCA technique.

Table 4 Attained performance of the Sprint prediction model trained on selected features using correlation approach.

According to Table 4, the R2 and MSE values for the Linear Regression, Decision Trees, SVM, and Gaussian process regressor models are 0.96 and 0.012, respectively. The Linear Regression model achieves a slightly better RMSE value of 0.111 compared to the Decision Trees, while the Gaussian process regression models have an RMSE of 0.112, and the SVM model has an RMSE of 0.113. The best performance is obtained using the Ensembles (Bagged Trees) approach. These results indicate that the correlation features have been accurately identified through hypothesis testing for the entire population. In multiple linear regression, the null hypothesis for a population p is formulated as H0: β1 = 0, β2 = 0, β3 = 0, … βp = 0, which suggests no relationship between the outcome and the p input predictors. The model is considered effective if any β ≠ 0, representing the alternative hypothesis Ha, where Ha: at least one βj ≠ 0 (j = 1, … p). In this study, we used the ANOVA F-statistics test for hypothesis testing, excluding variables with zero variance. In regression-based problems, this test measures the significance level of the estimated coefficients from linear regression models. The significance of each coefficient is determined by calculating four metrics: the sum of squares (SS), the mean sum of squares (MS), the degrees of freedom (DF), the F-statistic, and the p-value. Table 5 shows the results of the test for sprint performance. Based on the F-value and p-value, the null hypothesis is rejected, indicating a linear association between the predicted sprint performance and the variables of gender, age, height, weight, hip circumference, thigh circumference, gastrocnemius circumference, leg length, thigh length, and foot length. The findings presented in Fig. 6 show the relationship between age and 20-m sprint performance. In this graph, it is observed that children's sprint performance improves as their age increases. In particular, the effect of age on sprint time becomes evident starting from early childhood. This finding supports that age is an important determinant of sprint performance and that training programs should be customized according to age.

Table 5 Hypothesis testing using ANOVA test for the target variable, Sprint, and the input predictors.
Figure 6
figure 6

A comparison between the predicted value of the response variable based on correlation analysis.

Discussion

The study aimed to explore the relationship between morphometric characteristics and 20-m sprint performance in children at the primary education level, utilizing machine learning algorithms. The inclusion of 282 volunteer participants, comprising 130 males and 152 females aged between 6 and 11 years, allowed for a comprehensive analysis. Demographic details, skinfold thickness, diameter and circumference measurements, along with 20-m sprint performance data, were collected and subjected to rigorous preprocessing techniques, subsequently divided into training and test datasets. Employing machine learning algorithms, a predictive model was generated, revealing that gender, age, various morphometric measurements including height, weight, circumferences, lengths, and skinfold thickness had an impressive accuracy rate of 96% in predicting the 20-m sprint performance. These findings underscore the significant influence of morphometric features on a child’s 20-m sprint performance. The high predictive accuracy, particularly concerning variables like gender, age, and a range of morphometric measurements, suggests their pivotal role in determining sprint capabilities at the primary education stage. The utilization of machine learning algorithms effectively captured the complex interplay between these morphometric traits and athletic performance, demonstrating promising potential for practical application in assessing and enhancing children's physical capabilities.

There are some studies that include morphometric qualifications of the children1,21,22,23,24,25. Aerenhouts et al.’s study (2011) provides valuable insights into the dynamics of sprint performance among junior and senior athletes concerning sprint start and acceleration. The findings underscore the crucial role of acceleration in achieving higher running velocities, particularly in senior athletes1. The study highlights the distinction in the developmental progression of sprint capabilities between different age categories and emphasizes the impact of muscularity on acceleration performance, shedding light on the intricacies of sprint performance in athletic contexts. In contrast, the investigation of our study focused on morphometric features predicting 20-m sprint performance in children at the primary education level, utilizing machine learning algorithms. Our study aimed at creating a predictive model for educators and healthcare professionals, offering insights into potential predictors of sprint performance within an educational context, differing notably in its predictive approach and focus on children’s physical education rather than athlete-specific performance dynamics. The study by Papaiakovou et al.21 examined how chronological age and gender influence sprint performance in children and adolescents. Boys and girls both displayed progressive speed development in sprinting phases. Older boys and girls exhibited significantly higher speeds compared to younger participants. Gender disparities were notable from 15 years onward, suggesting that gender and age collectively impact sprint performance21. Conversely, our study aimed to predict 20-m sprint performance in 6–11-year-old children using morphometric features and machine learning algorithms, achieving a 96% predictive accuracy based on various anthropometric and demographic data. These studies differ in their age ranges, focus on specific sprint distances, and methodologies employed for performance prediction. The study by Mendez-Villanueva et al. (2011) explored age-related differences in sprint qualities and their relationship in highly trained young male soccer players. This research examined 10-m sprint (acceleration), flying 20-m sprint (maximum speed), and 10 × 30-m sprint (repeated-sprint) times, investigating the impact of anthropometry and biological maturation on these performances. The study revealed age-based differences in sprint qualities, with performances varying among age groups. In contrast, the our study aimed to predict 20-m sprint performance in 6 to 11-year-old children utilizing morphometric features and machine learning algorithms, achieving 96% predictive accuracy. These studies differ in their focus on sprint qualities in soccer players and children, along with their methodologies for performance prediction. The study by Ramos et al.24,25 aimed to describe and establish normative data for morphological and fitness attributes in 281 young male Portuguese basketball players aged 12–16 years. It encompassed measures of chronological age, maturity parameters, morphological features (body mass, height, skinfolds), and fitness components (sprint, change of direction ability, jump, upper body strength). The study provided descriptive and normative values for these attributes, stratified by age and maturity status, offering a framework for evaluating and adjusting athletes’ performance over time24. In contrast, our study focused on predicting 20-m sprint performance in children aged 6–11 using specific morphometric features and machine learning algorithms, achieving a 96% predictive accuracy. These studies differ in their primary objectives; while Ramos et al. aimed to establish normative benchmarks for a range of attributes in young basketball players, our study concentrated on predicting sprint performance in primary education children using machine learning techniques. The study by Zapartidis et al.22 conducted a comparative analysis of physical fitness and selected anthropometric characteristics between selected (SP) and non-selected (NSP) young handball players for the Greek preliminary national teams, encompassing both male and female players. Their findings highlighted several distinctions between the two groups. In male players, SP individuals exhibited superior ball velocity, standing long jump, 30-m sprint, estimated VO2max, greater height, and arm span. Female SP players had higher values in ball velocity and standing long jump. Moreover, the study revealed variances within different playing positions, such as male backs, wings, and pivots, showcasing the significance of physical and anthropometric attributes in player selection22. In contrast, our study aimed to predict 20-m sprint performance in primary education children using machine learning techniques, focusing on morphometric features. It achieved a 96% predictive accuracy based on gender, age, and various morphometric measurements, catering to a distinct research objective. The study conducted by Lago-Peñas et al.23, focused on establishing the anthropometric and physiological profiles of young soccer players according to their playing positions and evaluating their relevance for competitive success. It involved 321 young male soccer players and detailed various physical attributes and performance tests for different player positions. Notably, differences in physical characteristics were observed among player positions, correlating with performance in specific tests. Additionally, the study attempted to link these profiles to team success, noting slight physiological performance differences between players from successful and unsuccessful teams, albeit not statistically significant 24. Conversely, our study aimed to predict 20-m sprint performance in children aged 6–11 using machine learning algorithms, achieving a 96% predictive accuracy based on a range of morphometric features. The difference lies in the focus on predicting sprint performance in young children and the utilization of machine learning techniques, distinct from Lago-Peñas’ emphasis on the physical and performance profiles of soccer players and their relationship to team success23. The study by Ramos et al.24,25 delved into the relationship between morphological attributes, physical fitness, and basketball performance in elite young Portuguese players. They evaluated 416 male and female under-14 basketball athletes, examining maturational parameters, morphological attributes, and fitness parameters, linking these attributes with game performance. The research unveiled correlations between specific physical attributes and basketball performance for both genders, identifying key predictors such as height, body composition, strength, agility, and maturation status. Notably, separate regression models were constructed for males and females, offering predictions for basketball performance based on these attributes22. Conversely, our research centered on utilizing machine learning algorithms to predict 20-m sprint performance in primary education children aged 6–11, achieving a high accuracy rate of 96%. Unlike Ramos et al.’s emphasis on the relationship between physical attributes, maturation, and basketball performance in elite young players, our study focused on forecasting sprint performance in young children using machine learning and specific morphometric features. The difference in focus highlights the distinct aims and methodologies between the two studies, one exploring performance predictors in elite youth basketball, and the other employing machine learning for sprint prediction in primary education children.

The comprehensive scope of variables considered in this study sheds light on the multifaceted nature of factors influencing sprint performance in children. The accuracy of the predictive model emphasizes the feasibility of utilizing morphometric data to evaluate and potentially improve children's sprinting abilities, offering invaluable insights for educators, coaches, and healthcare professionals working with this demographic. However, while the predictive model exhibited a high accuracy rate, certain limitations need consideration. The study was confined to a specific age range and might not be universally applicable to wider age groups. Moreover, external factors like nutritional habits, genetic predispositions, and additional lifestyle components were not incorporated, potentially impacting the holistic understanding of sprint performance determinants.

Conclusion

In conclusion, this study aimed to investigate the morphometric features influencing 20-m sprint performance in children at the first level of primary education using machine learning (ML) algorithms. The analysis included 282 volunteer participants, aged between 6 and 11 years, comprising 130 males and 152 females. Through a series of experiments, three distinct approaches were employed to determine the optimal ML technique for predicting sprint performance. Initially, the entire feature space was utilized to establish a baseline performance. Subsequently, correlation analysis was employed to identify significant features, enhancing the focus on relevant predictors. Finally, Principal Component Analysis (PCA) was utilized to streamline model complexity while retaining data variance. The results indicated that the correlation-based selected features, including Age, Height, waist circumference, hip circumference, leg length, thigh length, and foot length, yielded a minimum Mean Squared Error (MSE) value of 0.012 for predicting sprint performance in children. This successful application of correlation analysis underscores the robust linear associations between the selected features and the target variable, reaffirming their reliability as predictors. The findings of this study provide a solid foundation for further exploration into the intricate relationship between morphometric characteristics and 20-m sprint performance in children. Addressing the limitations through larger and more diverse sample sizes, inclusion of additional variables, and longitudinal studies could strengthen the generalizability and depth of understanding. Ultimately, this research offers a promising starting point for future investigations in the domain of childhood physical performance, guiding the development of tailored interventions and training programs to optimize children's athletic capabilities.