Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Reliability of health-related physical fitness tests in European adolescents. The HELENA Study



To examine the reliability of a set of health-related physical fitness tests used in the European Union-funded Healthy Lifestyle in Europe by Nutrition in Adolescence (HELENA) Study on lifestyle and nutrition among adolescents.


A set of physical fitness tests was performed twice in a study sample, 2 weeks apart, by the same researchers.


A total of 123 adolescents (69 males and 54 females, aged 13.6±0.8 years) from 10 European cities participated in the study.


Flexibility, muscular fitness, speed/agility and aerobic capacity were tested using the back-saver sit and reach, handgrip, standing broad jump, Bosco jumps (squat jump, counter movement jump and Abalakov jump), bent arm hang, 4 × 10 m shuttle run, and 20-m shuttle run tests.


The ANOVA analysis showed that neither systematic bias nor sex differences were found for any of the studied tests, except for the back-saver sit and reach test, in which a borderline significant sex difference was observed (P=0.044). The Bland–Altman plots graphically showed the reliability patterns, in terms of systematic errors (bias) and random error (95% limits of agreement), of the physical fitness tests studied. The observed systematic error for all the fitness assessment tests was nearly 0.


Neither a learning nor a fatigue effect was found for any of the physical fitness tests when repeated. The results also suggest that reliability did not differ between male and female adolescents. Collectively, it can be stated that the reliability of the set of physical fitness tests examined in this study is acceptable. The data provided contribute to a better understanding of physical fitness assessment in young people.


Health-related physical fitness includes the characteristics of functional capacity and is affected by the physical activity level and other lifestyle factors. Maintaining an appropriate level of health-related physical fitness allows a person to meet emergencies, reduce the risk of disease and injury, work efficiently, participate and enjoy physical activity (sports, recreation, leisure) and look one's physical best. A high health-related physical fitness level focuses on optimum health and prevents the onset of disease and problems associated with inactivity at all ages.1, 2, 3, 4

The HELENA (Healthy Lifestyle in Europe by Nutrition in Adolescence) Study5 includes a thorough assessment of health-related physical fitness. For this purpose, a set of standardized tests has been chosen, and the scientific rationale for their selection has been published elsewhere.6

Reliability can be defined as the consistency of measurements. Terms that have been used interchangeably with reliability in the literature are ‘repeatability’, ‘reproducibility’, ‘consistency’, ‘agreement’, ‘concordance’ and ‘stability’. Another related but different concept is validity. Validity is the ability of the measurement tool to measure what it is designed to measure. The validity of a tool is judged by comparison with a ‘gold standard’ method. Definitions and detailed discussions about reliability issues in sport sciences-related research and general science can be found in the reviews published by Atkinson and Nevill,7 Rothwell8 and Bruton et al.9

Realistically, some amount of error is always present when collecting data. The main components of measurement error are systematic bias (for example, general learning on the tests) and random error due to biological or mechanical variation. Several statistical methods have been used to evaluate certain aspects of reliability. Correlation analysis has been commonly used, but this has limitations that will be discussed further in this paper. The study on the agreement between two measurements by means of the Bland–Altman approach seems a more proper and useful method for reliability analyses.7, 8, 9

In this paper, we report the outcome of reliability testing, on a test-retest basis, of the set of health-related physical fitness tests used in the HELENA Study.6 The outcome is discussed and compared with the outcome of an extensive overview of published data on reliability testing.


Study design

The HELENA Study ( is a European Union-funded project on lifestyle and nutrition among adolescents from 10 European cities: Athens and Heraklion (Greece), Dortmund (Germany), Ghent (Belgium), Lille (France), Pécs (Hungary), Rome (Italy), Stockholm (Sweden), Vienna (Austria) and Zaragoza (Spain). The study was approved by the Research Ethics Committees of each city involved. Written informed consent was obtained from the parents of the adolescents and the adolescents themselves.

From a total sample of 204 adolescents who participated in the HELENA pilot study and performed all the physical fitness tests, a subsample of 123 adolescents (69 males and 54 females, aged 13.6±0.8 years) were asked to undergo the tests again 2 weeks later. The same inter-trial period has been used earlier in similar reliability studies carried out on healthy young people.10 The two physical fitness measurements were performed at the same time of day by the same researchers. Those adolescents who took part in the retest study did not differ in age, height, weight or body mass index (BMI) (P>0.05) from those adolescents who did not do so.

Anthropometric measurements

Anthropometric measurements were made with the participants barefoot and in their underwear. Weight was measured using an electronic scale (Type SECA 861) and recorded to the nearest 100 g. Height was measured using a telescopic height measuring instrument (Type SECA 225). The instrument was calibrated before the measurements with a metal calibrating rod. Height was recorded to the last complete 1 mm.

Physical fitness assessment

An extended and detailed manual of operations was designed for and thoroughly read by every researcher involved in fieldwork before the data collection started. In addition, a workshop training week was carried out in Zaragoza (Spain) in January 2006, to standardize and harmonize the measurement of physical fitness. The field workers were asked to always perform the same fitness tests so that they would become specialized in a single fitness measurement, and to minimize the potential inter-rater variability within each centre. The instructions given to the participants in every test were standardized for all the cities and were translated into the local language. In this way, the same verbal information was given to all participants in the HELENA Study.

The health-related physical fitness components, that is, flexibility, muscular strength, speed/agility and aerobic capacity (hereafter called cardiorespiratory fitness), were assessed by the physical fitness tests described below. The scientific rationale for the selection of all of these tests has been published earlier.6

  1. 1)

    Back-saver sit and reach test (flexibility assessment): a standard box with a small bar, which has to be pushed by the participant, was used to perform the test. The adolescent bends his/her trunk and reaches forward as far as possible from a seated position, with one leg straight and the other bent at the knee. The test is performed once again with the opposite leg. The farthest position of the bar reached by each leg was scored in centimetres and the average of the distances reached by both legs was used in the analysis.

  2. 2)

    Handgrip test (maximum handgrip strength assessment): a hand dynamometer with adjustable grip was used (TKK 5101 Grip D; Takey, Tokyo, Japan). The participant squeezes gradually and continuously for at least 2 s, performing the test with the right and left hands in turn, using the optimal grip span. The handgrip span was adjusted according to hand size using the equation that we have developed specifically for adolescents.11 The maximum score in kilograms for each hand was recorded. The average of the scores achieved in both handgrip tests was used in the analysis.

  3. 3)

    Standing broad jump test (lower limb explosive strength assessment): from a starting position immediately behind a line, standing with feet approximately shoulder's width apart, the adolescent jumps as far as possible with feet together. The result was recorded in centimetres. A non-slip hard surface, chalk and a tape measure were used to perform the test.

  4. 4)

    The Bosco protocol is composed of three different jumps: (4.1) Squat jump (lower limb explosive strength assessment): the adolescent performs a vertical jump without rebound movements starting from a half-squat position, keeping both knees bent at 90°, the trunk straight and both hands on hips. Previous counter movements are not allowed. (4.2) Counter movement jump (lower limb explosive strength and elastic component assessment): in a standing position, with legs straight and both hands on hips, the adolescent performs a vertical jump with an earlier fast counter movement. (4.3) Abalakov jump (lower limb explosive strength, elastic component and intermuscular coordination capacity assessment): the Abalakov jump is similar to the counter movement jump, but now the adolescent is allowed to freely coordinate the arms and trunk movements to reach the maximum height. The jump height is recorded in centimetres. The Infrared Platform ERGO JUMP Plus—BOSCO SYSTEM (Byomedic, SCP, Barcelona, Spain) was used for the jump assessment.

  5. 5)

    Bent arm hang test (upper limb endurance strength assessment): the adolescent hangs from a bar for as long as possible, with the arms bent at 90 degrees. The palms face forward and the chin must be over the bar's plane. The time spent in this position, to the nearest tenth of a second, is recorded. A cylindrical horizontal bar and a stopwatch were used to perform the test.

  6. 6)

    4 × 10 m shuttle run test (speed of movement, agility and coordination assessment): two parallel lines are drawn on the floor 10 m apart. The adolescent runs as fast as possible from the starting line to the other line and returns to the starting line, crossing each line with both feet every time. This is performed twice, covering a distance of 40 m (4 × 10 m). Every time the adolescent crosses any of the lines, he/she should pick up (the first time) or exchange (second and third time) a sponge that has earlier been placed behind the lines. The stopwatch is stopped when the adolescent crosses the end line with one foot. The time taken to complete the test is recorded to the nearest tenth of a second. A slip-proof floor, four cones, a stopwatch and three sponges were used to perform the test.

  7. 7)

    20-m shuttle run test (cardiorespiratory fitness assessment): the adolescents perform the test as described earlier by Léger et al.12 Participants are required to run between two lines 20 m apart, while keeping pace with audio signals emitted from a pre-recorded CD. The initial speed is 8.5 km h−1, which is increased by 0.5 km h−1 min−1 (1 min equals one stage). Participants are instructed to run in a straight line, to pivot on completing a shuttle, and to pace themselves in accordance with the audio signals. The test is finished when the participant fails to reach the end lines concurrent with the audio signals on two consecutive occasions. Otherwise, the test ends when the participant stops because of fatigue. All measurements were carried out under standardized conditions on an indoor rubber-floored gymnasium. The participants were encouraged to keep running as long as possible throughout the course of the test. The last completed stage or half-stage at which the participant drops out was scored. A gymnasium or space large enough to mark out a 20 m track, a 20 m tape measure, a CD player and a CD with the audio signals recorded were used to perform the test.

All the tests were performed twice and the best score was retained, except for the bent arm hang and the 20-m shuttle run tests, which were performed only once.

Review of fitness reliability studies

The search strategy for identifying the fitness reliability studies was based on combinations of the following terms: fitness, reliability, repeatability, reproducibility and measurement error. The databases used were Medline, PubMed and SportDiscus. The electronic search identified 112 publications that concerned the reliability of fitness assessment. The inclusion criteria were studies involving healthy children and/or adolescents aged 18 years or younger, and those published since 1990. In the end, 22 studies met the inclusion criteria and were selected. An additional search was carried out to find fitness reliability studies that used the Bland–Altman approach, including healthy or unhealthy people at any age.

Statistical analysis

The data are presented as means±s.d., unless otherwise stated. Both the potential systematic bias (H0; mean inter-trial difference=0; H1; mean inter-trial difference≠0) and sex differences on the studied physical fitness tests were analysed by one-way analysis of variance (ANOVA) on inter-trial difference (test 2−test 1, hereafter called T2−T1) with sex as a fixed factor. As no sex-specific effect on reliability of the studied physical fitness tests was found, the analyses were performed for both males and females together. The agreement between the corresponding fitness variables obtained during the two successive measurements was also examined graphically by plotting the difference between each pair of measurements against their mean, according to the Bland and Altman approach.13, 14 The 95% limits of agreement for all the physical fitness variables were calculated as the inter-trial mean difference±1.96 s.d. (of the inter-trial differences).

As the standard deviation for a sample of two observations can be written as T2−T1/√2, the presence of heteroscedasticity can then be analysed in line with the Bland–Altman approach by using the Kruskal–Wallis test, a non-parametric one-way ANOVA. A significant P-value would confirm heteroscedasticity, which means that the inter-trial variability, T2−T1, of a physical fitness test would differ with the physical fitness level groups. Sex-specific quartiles were estimated for every test performed and were used to classify the adolescents into different fitness levels. Distribution of the residuals for the inter-trial difference variables (T2−T1), but not for the absolute difference variables T2−T1, showed a satisfactory pattern. Therefore, parametric (ANOVA) and non-parametric (Kruskal–Wallis) approaches were used in this paper.

All calculations were performed using SPSS v.15.0 software for Windows. For all analyses, the significance level was 5%.


The physical characteristics of the study sample are shown in Table 1. Mean values and standard deviation for the two trials, as well as the mean inter-trial difference for the physical fitness tests in the studied male and female adolescents, are also shown in Table 1. Neither systematic bias nor sex differences were found for any of the studied tests, except for the back-saver sit and reach test, in which a borderline significant sex difference was observed (P=0.044).

Table 1 Reliability of physical fitness tests (mean±s.d.) in male (n=69) and female (n=54) adolescents

The Bland–Altman plots (Figures 1, 2 and 3) graphically showed the reliability patterns, in terms of systematic errors (bias or mean inter-trial differences) and random error (95% limits of agreement), of the physical fitness tests studied. It can be observed that the systematic error when fitness assessment was performed twice was nearly 0 for all the tests.

Figure 1

Bland–Altman plot of the back-saver sit and reach, handgrip and standing broad jump tests in adolescents. The central dotted line represents the mean differences between the second trial (T2) and the first trial (T1); the upper and lower dotted lines represent the upper and lower 95% limits of agreement (mean differences±1.96 s.d. of the differences), respectively.

Figure 2

Bland–Altman plot of the Bosco jumps, that is, squat jump, counter movement jump and Abalakov jump in adolescents. The central dotted line represents the mean differences between the second trial (T2) and the first trial (T1); the upper and lower dotted lines represent the upper and lower 95% limits of agreement (mean differences±1.96 s.d. of the differences), respectively.

Figure 3

Bland–Altman plot of the bent arm hang, 4 × 10 m shuttle run and 20-m shuttle run tests in adolescents. The central dotted line represents the mean differences between the second trial (T2) and the first trial (T1); the upper and lower dotted lines represent the upper and lower 95% limits of agreement (mean differences±1.96 s.d. of the differences), respectively.

The heteroscedasticity analysis showed that the higher the bent arm hang score (quartiles), the higher was the inter-trial difference (P<0.001). Moreover, it was observed that adolescents who scored high in the back-saver sit and reach test had a better inter-trial agreement compared with those adolescents who scored lower (P<0.01).


Review of fitness reliability studies and methodological discussion

Table 2 summarizes the fitness reliability studies carried out in healthy young people since 1990. The most frequently used statistical approaches to assess overall agreement between measurements were correlation methods (used in 95% of the reviewed studies). However, correlation is a measure of the strength of association between two variables but not necessarily a measure of agreement. Its use is considered inappropriate for that purpose because, first, it is not possible to assess systematic bias, and second, it depends on the range of the values in the sample.7, 14 For example, if an observer always overestimates (a positive systematic bias) the 4 × 10 m shuttle run test score by 20% compared with another observer, the correlation between the measurements would be perfect, but they would never agree. Moreover, the more heterogeneous the study sample, the greater the correlation. The intraclass correlation coefficient is an appropriate overall summary measure of agreement between measurements, which reflects both systematic bias and random error in test scores.8 However, it does not give any information on any variation in agreement with the size of the measurement, and it is also affected by the sample range.7

Table 2 Review of fitness reliability studies (n=22) published since 1990 in healthy young people

Several reviews have proposed the Bland–Altman approach as an appropriate descriptive method for a meaningful and useful interpretation of reliability.7, 8, 9 According to the review performed, only two (9%) of the physical fitness reliability studies carried out in healthy young people used the Bland–Altman approach.10, 35 When the search was extended to healthy or unhealthy people of any age, only eight additional studies using the Bland–Altman approach were found.36, 37, 38, 39, 40, 41, 42, 43 Given that physical fitness can already be considered a marker of health in this period of life,3, 6, 44 information about the reliability of health-related physical fitness tests in young people is of interest. In addition, the review also shows that both cardiorespiratory fitness and muscular strength were the most studied physical fitness qualities (in 41 and 50%, respectively, of the reviewed studies), whereas data about speed-agility, coordination and flexibility in young people are lacking (used in 4–14% of the reviewed studies).

Collectively, the review and the methodological discussion performed above suggest that methods such as correlation or regression have important limitations and are not useful enough for studying reliability. In addition, the decision about what is ‘acceptable’ agreement is a scientific judgement; statistics alone cannot answer this question, as measurements, which may be considered to agree well enough for one purpose may not agree well enough for another.45 For instance, if blood glucose concentration is measured twice, minutes apart, the acceptable error will be much lower if handgrip strength is measured 2 weeks apart.

Physical fitness reliability analysis

One of the main hypotheses tested in this study was whether a learning effect (positive systematic bias) exists among the physical fitness tests studied when repeated measurements are performed. Li et al.10 examined the reliability of a 6-min walk test in adolescents. They found a bias of 15 m (95% limits of agreement: −35 to 65), whereas no significant difference was found between both measurements. Johnston et al.35 studied the test-retest reliability of several physiological variables during a maximal cardiopulmonary exercise test in children. The peak oxygen consumption showed a bias of 1.4 ml kg−1 min−1 (95% limits of agreement: −3 to 5 ml kg−1 min−1), but no significant difference was found between test and retest scores. In our study, the bias for the physical fitness variables studied in the study sample was close to 0 in most of the tests. The results suggest that neither learning nor fatigue (negative systematic bias) effects occurred when physical fitness was assessed with the tests used in this study, on a test-retest basis, in adolescents.

Results from the heteroscedasticity analyses and Bland–Altman plots indicate that the better (longer time) the performance in the bent arm hang test, the worse the degree of the agreement, whereas the better the performance in the back-saver sit and reach tests (further reach), the better the degree of the agreement.

In addition, the reliability of the physical fitness tests analysed is similar between male and female adolescents. This result is in accordance with data reported in men and women who performed a cardiopulmonary test in laboratory conditions.39

The wide variety of physical fitness tests examined in this study, the relatively large number of participants involved in the study and the use of adolescents from 10 European cities are the notable strengths of this study.

In conclusion, our study provides reference values for reliability of a wide set of physical fitness tests in European adolescents. Neither a learning nor a fatigue effect was found for any of the physical fitness tests when repeated. The results also suggest that reliability did not differ between male and female adolescents. Collectively, it can be stated that the reliability of the physical fitness tests examined in this study is acceptable. The data provided contribute to a better understanding of physical fitness assessment in young people.


  1. 1

    Myers J, Prakash M, Froelicher V, Do D, Partington S, Atwood JE . Exercise capacity and mortality among men referred for exercise testing. N Engl J Med 2002; 346: 793–801.

    Article  Google Scholar 

  2. 2

    Gulati M, Pandey DK, Arnsdorf MF, Lauderdale DS, Thisted RA, Wicklund RH et al. Exercise capacity and the risk of death in women: the St James Women Take Heart Project. Circulation 2003; 108: 1554–1559.

    Article  Google Scholar 

  3. 3

    Carnethon MR, Gulati M, Greenland P . Prevalence and cardiovascular disease correlates of low cardiorespiratory fitness in adolescents and adults. JAMA 2005; 294: 2981–2988.

    CAS  Article  Google Scholar 

  4. 4

    Andersen LB, Harro M, Sardinha LB, Froberg K, Ekelund U, Brage S et al. Physical activity and clustered cardiovascular risk in children: a cross-sectional study (The European Youth Heart Study). Lancet 2006; 368: 299–304.

    Article  Google Scholar 

  5. 5

    Moreno LA, González-Gross M, Kersting M, Molnár D, de Henauw S, Beghin L et al. Healthy lifestyle in Europe by nutrition in adolescence. The HELENA Study. Public Health Nutr 2008; 11: 288–299.

    CAS  Article  Google Scholar 

  6. 6

    Ruiz JR, Ortega FB, Gutierrez A, Meusel D, Sjöström M, Castillo MJ . Health-related fitness assessment in childhood and adolescence: a European approach based on the AVENA, EYHS and HELENA studies. J Public Health 2006; 14: 269–277.

    Article  Google Scholar 

  7. 7

    Atkinson G, Nevill AM . Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med 1998; 26: 217–238.

    CAS  Article  Google Scholar 

  8. 8

    Rothwell PM . Analysis of agreement between measurements of continuous variables: general principles and lessons from studies of imaging of carotid stenosis. J Neurol 2000; 247: 825–834.

    CAS  Article  Google Scholar 

  9. 9

    Bruton A, Conway JH, Holgate ST . Reliability: what is it, and how is it measured? Physiotherapy 2000; 86: 94–99.

    Article  Google Scholar 

  10. 10

    Li AM, Yin J, Yu CC, Tsang T, So HK, Wong E et al. The six-minute walk test in healthy children: reliability and validity. Eur Respir J 2005; 25: 1057–1060.

    CAS  Article  Google Scholar 

  11. 11

    Ruiz JR, Espana-Romero V, Ortega FB, Sjostrom M, Castillo MJ, Gutierrez A . Hand span influences optimal grip span in male and female teenagers. J Hand Surg [Am] 2006; 31: 1367–1372.

    Article  Google Scholar 

  12. 12

    Leger LA, Mercier D, Gadoury C, Lambert J . The multistage 20 metre shuttle run test for aerobic fitness. J Sports Sci 1988; 6: 93–101.

    CAS  Article  Google Scholar 

  13. 13

    Altman DG, Bland JM . Measurement in medicine: the analysis of method comparison studies. Statistician 1983; 32: 307–317.

    Article  Google Scholar 

  14. 14

    Bland JM, Altman DG . Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; 1: 307–310.

    CAS  Article  Google Scholar 

  15. 15

    Erbaugh SJ . Reliability of physical fitness tests administered to young children. Percept Mot Skills 1990; 71: 1123–1128.

    CAS  Article  Google Scholar 

  16. 16

    Cotten DJ . An analysis of the NCYFS II Modified Pull-up Test. Res Q Exerc Sport 1990; 61: 272–274.

    CAS  Article  Google Scholar 

  17. 17

    Atwater SW, Crowe TK, Deitz JC, Richardson PK . Interrater and test-retest reliability of two pediatric balance tests. Phys Ther 1990; 70: 79–87.

    CAS  Article  Google Scholar 

  18. 18

    Engelman ME, Morrow Jr JR . Reliability and skinfold correlates for traditional and modified pull-ups in children grades 3–5. Res Q Exerc Sport 1991; 62: 88–91.

    CAS  Article  Google Scholar 

  19. 19

    Kollath JA, Safrit MJ, Zhu W, Gao LG . Measurement errors in modified pull-ups testing. Res Q Exerc Sport 1991; 62: 432–435.

    CAS  Article  Google Scholar 

  20. 20

    Rikli RE, Petray C, Baumgartner TA . The reliability of distance run tests for children in grades K-4. Res Q Exerc Sport 1992; 63: 270–276.

    CAS  Article  Google Scholar 

  21. 21

    Liu NY, Plowman SA, Looney MA . The reliability and validity of the 20-meter shuttle test in American students 12–15 years old. Res Q Exerc Sport 1992; 63: 360–365.

    CAS  Article  Google Scholar 

  22. 22

    Pate RR, Burgess ML, Woods JA, Ross JG, Baumgartner T . Validity of field tests of upper body muscular strength. Res Q Exerc Sport 1993; 64: 17–24.

    CAS  Article  Google Scholar 

  23. 23

    McManis BG, Wuest DA . Stability reliability of the modified push-up in children. Res Q Exerc Sport 1994; 65 (Suppl): A58–A59 (abstract).

    Google Scholar 

  24. 24

    Patterson P, Wiksten DL, Ray L, Flanders C, Sanphy D . The validity and reliability of the back saver sit-and-reach test in middle school girls and boys. Res Q Exerc Sport 1996; 67: 448–451.

    CAS  Article  Google Scholar 

  25. 25

    Mahar MT, Rowe DA, Parker CR, Mahar FJ, Dawson DM, Holt JE . Criterion-referenced and norm-referenced agreement between the mile run/walk and PACER. Meas Phys Educ Exerc Sci 1997; 1: 245–258.

    Article  Google Scholar 

  26. 26

    Anderson EA, Zhang JJ, Rudisill ME, Gaa J . Validity and reliability of a timed curl-up test: development of a parallel form for the FITNESSGRAM abdominal strength test. Res Q Exerc Sport 1997; 68 (Suppl): A-51.

    Google Scholar 

  27. 27

    Patterson P, Rethwisch N, Wiksten D . Reliability of the trunk lift in high school boys and girls. Meas Phys Educ Exerc Sci 1997; 1: 145–151.

    Article  Google Scholar 

  28. 28

    McSwegin PJ, Plowman SA, Wolff GM, Guttenberg GL . The validity of a one-mile walk test for high school age individuals. Meas Phys Educ Exerc Sci 1998; 2: 47–63.

    Article  Google Scholar 

  29. 29

    McManis BG, Baumgartner TA, West DA . Objectivity and reliability of the 90° pushup test. Meas Phys Educ Exerc Sci 2000; 4: 57–67.

    Article  Google Scholar 

  30. 30

    Figueroa-Colon R, Hunter GR, Mayo MS, Aldridge RA, Goran MI, Weinsier RL . Reliability of treadmill measures and criteria to determine VO2max in prepubertal girls. Med Sci Sports Exerc 2000; 32: 865–869.

    CAS  Article  Google Scholar 

  31. 31

    Patterson P, Bennington J, De La Rosa T . Psychometric properties of child- and teacher-reported curl-up scores in children aged 10–12 years. Res Q Exerc Sport 2001; 72: 117–124.

    CAS  Article  Google Scholar 

  32. 32

    Tong TK, Fu FH, Chow BC . Reliability of a 5-min running field test and its accuracy in VO2max evaluation. J Sports Med Phys Fitness 2001; 41: 318–323.

    CAS  PubMed  Google Scholar 

  33. 33

    Romain BS, Mahar MT . Norm-referenced and criterion-referenced reliability of the push-up and modified pull-up. Meas Phys Educ Exerc Sci 2001; 5: 67–80.

    Article  Google Scholar 

  34. 34

    Alricsson M, Harms-Ringdahl K, Werner S . Reliability of sports related functional tests with emphasis on speed and agility in young athletes. Scand J Med Sci Sports 2001; 11: 229–232.

    CAS  Article  Google Scholar 

  35. 35

    Johnston KN, Jenkins SC, Stick SM . Repeatability of peak oxygen uptake in children who are healthy. Pediatr Phys Ther 2005; 17: 11–17.

    Article  Google Scholar 

  36. 36

    de Greef MH, Sprenger SR, Elzenga CT, Popkema DY, Bennekers JH, Niemeijer MG et al. Reliability and validity of a twelve-minute walking test for coronary heart disease patients. Percept Mot Skills 2005; 100: 567–575.

    Article  Google Scholar 

  37. 37

    Taylor S, Frost H, Taylor A, Barker K . Reliability and responsiveness of the shuttle walking test in patients with chronic low back pain. Physiother Res Int 2001; 6: 170–178.

    CAS  Article  Google Scholar 

  38. 38

    Buckley JP, Sim J, Eston RG, Hession R, Fox R . Reliability and validity of measures taken during the Chester step test to predict aerobic power and to prescribe aerobic exercise. Br J Sports Med 2004; 38: 197–205.

    CAS  Article  Google Scholar 

  39. 39

    Bingisser R, Kaplan V, Scherer T, Russi EW, Bloch KE . Effect of training on repeatability of cardiopulmonary exercise performance in normal men and women. Med Sci Sports Exerc 1997; 29: 1499–1504.

    CAS  Article  Google Scholar 

  40. 40

    Lamb KL, Eston RG, Corns D . Reliability of ratings of perceived exertion during progressive treadmill exercise. Br J Sports Med 1999; 33: 336–339.

    CAS  Article  Google Scholar 

  41. 41

    van ′t Hul A, Gosselink R, Kwakkel G . Constant-load cycle endurance performance: test-retest reliability and validity in patients with COPD. J Cardiopulm Rehabil 2003; 23: 143–150.

    Article  Google Scholar 

  42. 42

    Balfour-Lynn IM, Prasad SA, Laverty A, Whitehead BF, Dinwiddie R . A step in the right direction: assessing exercise tolerance in cystic fibrosis. Pediatr Pulmonol 1998; 25: 278–284.

    CAS  Article  Google Scholar 

  43. 43

    Wallman K, Goodman C, Morton A, Grove R, Dawson B . Test-retest reliability of the aerobic power index component of the tri-level fitness profile in a sedentary population. J Sci Med Sport 2003; 6: 443–454.

    CAS  Article  Google Scholar 

  44. 44

    Ruiz JR, Ortega FB, Meusel D, Harro M, Oja P, Sjöström M . Cardiorespiratory fitness is associated with features of metabolic risk factors in children. Should cardiorespiratory fitness be assessed in a European health monitoring system? The European Youth Heart Study. J Public Health 2006; 14: 94–102.

    Article  Google Scholar 

  45. 45

    Bland JM, Altman DG . Measuring agreement in method comparison studies. Stat Methods Med Res 1999; 8: 135–160.

    CAS  Article  Google Scholar 

Download references


The HELENA Study was carried out with the financial support of the European Community Sixth RTD Framework Programme (Contract FOOD-CT-2005-007034). It is also being supported by grants from CSD in Spain (109/UPB31/03 and 13/UPB20/04), the Spanish Ministry of Education (EX-2007-1124; AP-2004-2745; AP2005-4358) and the ALPHA study, a European Union-funded study, in the framework of the Public Health Programme (Ref: 2006120). The researchers from the University of Zaragoza, Spain (GVR, JPRL) are complementarily supported by FUNDACIÓN MAPFRE (Spain). The content of this paper reflects only the authors’ views, and the European Community is not liable for any use that may be made of the information contained therein. Finally, we acknowledge all participating children and adolescents, as well as their parents and teachers for their collaboration. We also acknowledge our staff members for their efforts and great enthusiasm during the fieldwork. The authors thank Professor Olle Carlsson for his assistance with the statistical analysis of the data.

Author information




Corresponding author

Correspondence to F B Ortega.

Additional information

Conflict of interest

The authors state no conflict of interest.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ortega, F., Artero, E., Ruiz, J. et al. Reliability of health-related physical fitness tests in European adolescents. The HELENA Study. Int J Obes 32, S49–S57 (2008).

Download citation


  • fitness
  • reliability
  • Bland–Altman
  • adolescents

Further reading


Quick links