Physical tness in third grade of primary school: A mixed model analysis of 108,295 children and 515 schools

Children’s physical fitness development and related moderating effects of age and sex are well 2 documented, especially boys’ and girls’ divergence during puberty. The situation might be different 3 during prepuberty. As girls mature approximately two years earlier than boys, we tested a possible 4 convergence of performance with five tests representing four components of physical fitness in a 5 large sample of 108,295 eight-year old third-graders. Within this single prepubertal year of life and 6 irrespective of the test, performance increased linearly with chronological age, and boys 7 outperformed girls to a larger extent in tests requiring muscle mass for successful performance. 8 Tests differed in the magnitude of age effects (gains), but there was no evidence for an interaction 9 between age and sex. Moreover, “physical fitness” of schools correlated at r = 0.48 with their age 10 effect which might imply that "f it schools” promote larger gains ; expected secular trends from 11 2011 to 2019 were replicated.


Introduction 13
Children's development of physical fitness as well as the effects of moderating variables such as age 29 followed from age 9 to 12 years for endurance = cardiorespiratory endurance (i.e., 9 min run test), coordination 30 (i.e., running in a star like pattern), speed (i.e., 50-m linear sprint test), powerLOW = power of lower limbs (i.e., 31 triple hop test), and powerUP = power of upper limbs (i.e., ball push test). The insets show for each test score the 32 regression on age for the first assessment when children were between 9.00 and 9.99 years old. Also shown are 33 the means for groups of boys and girls binned into three age groups (i.e., 9.00-9.33; 9.34-9.66; 9.67-9.99). Error physical fitness over a period of four years, it may come as a surprise that none of the cross-sectional 48 age differences and none of the interactions with sex were significant within the respective 49 Endurance Coordination Speed PowerLOW PowerUP 9 10 11 12 13 9 10 11 12 13 9 10 11 12 13 9 10 11 12 13 9 10 11 12 13 6 contrasts between tests) no directed hypotheses were formulated given the usually low reliability within-school and within-cohort factors, providing us with the opportunity to detect reliable variance 104 components and correlation parameters for these factors. No directed hypotheses were formulated 105 for schools. However, secular trends are well documented, and we expected to replicate them.

Sex-related effects 149
The difference between lines in Figure 2 displays the expected differences between boys and girls for 150 the performance in the five physical fitness tests; the overall sex effect was estimated with b = 0.40, z 151 = 86.6. The third block of Table 1 lists statistics for the interactions between sex and the tests

Variance components and correlation parameters
159 Test scores 160 Table 2 lists estimates of VCs and CPs for the five test scores from a re-parameterized version of 161 LMM `m2` with the same goodness of fit and the same estimates for fixed-effects. The test-related 162 VCs were large children (0.69 to 0.77), of medium-size for schools (0.23 to 0.36), and small for 163 cohorts (0.03 to 0.06). VCs for the age-related gains (slopes 0.09) and the sex effect (0.05) were also 164 small for schools. It is noteworthy that the differences between schools in the age-related gain of 165 their children is larger than the differences between cohorts.   Table 2 for all CPs).

11
Third, child-related CPs (Table 2, top panel, below diagonal) were larger than child-based ZOCs (top, 202 above diagonal). This was a rather striking pattern because one might expect the opposite given that

203
ZOCs were confounded with large effects of age and sex. Conversely, CPs were larger despite 204 adjustment for sex and age differences in the fixed effects and for differences due to schools and

Effects of test contrasts 211
In the random-effect structure of LMM m2, estimates were returned for child-, school-, and cohort 212 related VCs for GM and the four test contrasts; VCs of age and sex were also estimated for school.

213
CPs for child and school reflect correlations between the contrasts (i.e., effect correlations). The 214 results are shown in Table 3. As in Table 2, CPs are reported below the diagonals and corresponding 215 ZOCs above the diagonals.

216
VCs for test contrasts were larger (0.48 to 0.72) for children and somewhat smaller, but still highly 217 reliable (0.29 to 0.38) for schools, especially when compared to VCs estimated for school-related age 218 (0.09) and sex (0.05) effects, and especially when compared to cohort-related effects (0.04 to 0.09).

219
CPs and ZOCs of effects are smaller than CPs and ZOCs based on test scores because, with the 220 exception of those involving GM, they are all based on difference scores.

221
There were two results of theoretical relevance. First, there was a negative CP for rGM.H4 for children

279
Overparameterization was observed only for the most complex LMM m4. Thus, with this exception, 280 the LMMs were supported by the data. Finally, we carried out residual-based diagnostics (e.g., q-q 281 plot, standardized residuals over fitted values, etc.) for the reference LMM m2 with CPs for effects 282 (Table 1, Table 3). These tests did not reveal any problems.

284
The aim of this study was to examine short-term ontogenetic cross-sectional developmental

316
The sex effect was also significantly stronger for powerLOW than for speed. PowerLOW is 317 determined much more by muscle mass where boys usually outperform girls 2,3,5 . In contrast, speed is 318 less influenced by muscle mass than by motor coordination where sex differences are comparatively 319 small 30 or were not found at all 15 . Therefore, the sex effect in powerLOW might be larger than in 320 speed. Obviously, the demand of coordination relative to power and cardiorespiratory endurance is 321 even larger in the star run test than in the 20-m linear sprint test 31-33 and this could be a reason why 322 the sex effect is smaller for the star run test than for speed.

17
(e.g., motor units) within the nervous system 34 . The more a test engages the brain, the less relevant 326 sex is a performance limiting factor. In summary, the decrease of size of sex effects across tests can

347
As far as the differential age effects between the components of physical fitness are concerned, tests 348 of coordination, speed, and powerLOW share the highest correlations among the five tests. These 349 three tests share the relevance of muscle mass yielding power, but they differ in the relevance of 18 linear sprint test < standing long jump test; see differences between lines in Figure 2)  with respect to its lower correlations with other tests, but also with respect to the size of the age 356 effect -by far the largest of the five tests (see Figure 2).

357
Obviously, the performance in powerUP was influenced by factors other than physical fitness. We within a year would be desirable, but such longitudinal data are not without their own problems. For 385 example, how would we separate learning effects due to repeated exposure separate from growth?

386
Motivational factors may also play a role. The longitudinal cardiorespiratory endurance data in Figure   387 1 suggest no further growth or even a decline in performance for 12 year old children 6 . This is

417
Physical fitness tests were carried out by physical education teachers during regular school hours.

418
The Brandenburg School Law requires that parents are comprehensively informed prior to the start 419 of the study. Consent is not needed given that tests are obligatory for both children and schools.

424
We started with data from 144,045 children. Of those, we included only healthy children who had 21 school enrolment they were at least 6.00 and at most 6.99 years old on September 30 th and, 427 therefore, varied between 8.00 and 8.99 years in the third grade (n = 110,669). In addition to early-428 entry (n = 2,664), late-entry (n = 30,457) and children without information about birthdate (n = 255),

429
we did not include children with signs of emotional (e.g., autism) and/or physical disorders (e.g.,  Coordination under time pressure was tested with the star run test (see Figure 5). Children had to

Power of lower limbs (PowerLOW) 479
PowerLOW was tested using the standing long jump test. Out of a standing frontal posture 480 the children had to jump as far as they could. The participants had to land with both feet 481 together. They were allowed to swing their arms prior to and during the jump, but after 482 landing the hands were not allowed to touch the floor. The distance in meters to the nearest 483 one centimeter between toes at take-off and heels at landing was determined using a 484 measuring tape; the better of two test trials was used in the analysis. The standing long jump 485 test was reliable (test-retest) in children aged 6 to 12 years with an ICC of 0.94 40 . 486

Power of upper limbs (PowerUP) 487
PowerUP was assessed through the ball push test. From a standing position the children had 488 to push a 1 kg medicine ball starting in front of the chest with both hands as far as they 489 could for two times; the better of two test trials of longest pushing distance was used in the 490 analysis. The maximal ball push distance in meters to the nearest ten centimeters was 491 determined with a measuring tape and used as dependent variable in the analysis. The ball 492 push test was reliable (test-retest) in children aged 8 to 10 years with an ICC of 0.81 39 .

503
For each test, we determined the ± 3 SD boundary separately for boys and girls. Measurement 504 outside these boundaries were usually implausible (i.e., recording errors) or extreme outliers. They 505 were treated as missing values (3%). Finally, we converted scores within tests (aggregated over boys 506 and girls) to z-scores to facilitate comparison of test, age and sex effects.

511
As fixed effects, we specified four sequential-difference contrasts for the five tests: (H1) coordination 512 vs. cardiorespiratory endurance, (H2) speed vs. coordination, (H3) powerLOW vs. speed, and (H4) and sex. Given the large number of observations, children, and schools, we adopted a two-sided z-516 value > 3.0 as significance criterion for the interpretation of fixed effects.

517
Child, school, and cohort were included as random factors. With three random factors there was a 518 need for selecting a random-effect structure that included theoretically relevant and reliable 519 variance components (VCs) and correlation parameters (CPs), but was also still supported by the data 520 (i.e., was not overparameterized). Tests varied within children, schools, and cohorts; age and sex 521 varied between children, but within schools and within cohorts. Therefore, in principle, VCs and CPs 522 of linear effects of age and sex could be estimated for schools and cohorts, but not for children.

523
Parsimonious model selection occurred in two major steps without knowledge or consideration of 524 fixed-effect estimates 46 ; details are provided in Supplement A. We started with a model including 525 Grand Mean (varying intercepts) for all three random factors and, given the large numbers of 526 108,926 children and 515 schools and the small number of nine cohorts, included also test-related 527 VCs and CPs for child and school and age-related and sex-related VCs and CPs for school, but not for 528 cohort. This LMM m1 was well supported by the data. In the second major step, we increased the 529 complexity of the random-effect structure for cohort by adding test-related VCs (LMM m2), then 530 test-related CPs (LMM m3), and finally age-and sex-related VCs and CPs (LMM m4).

531
LMM m4 was not supported by the data (i.e., the fit was singular) and did not significantly improve 532 the goodness of fit over LMM m3; delta χ² (13) =14.11, p = 0.37. LMM m3 improved the goodness of 533 fit over LMM m2 according to the likelihood ratio test, χ² (10) = 48.45, p < 0.001, but not when the 534 increase in model complexity is penalized according to BIC (i.e., LMM m2 = 1.27609e6 and LMM m3 = 535 1.27617e6). As we had no directed hypotheses relating to test-related CPs for the factor cohort, we 536 stayed with LMM m2 which represented a very large improvement in goodness of fit relative to LMM 537 m1; χ² (4) = 1489.57, p < 0.001. We also estimated LMM m2 with two alternative parameterizations 538 that did not change the goodness of fit, but yielded information about CPs between test scores