Cardiorespiratory fitness assessment using risk-stratified exercise testing and dose–response relationships with disease outcomes

Cardiorespiratory fitness (CRF) is associated with mortality and cardiovascular disease, but assessing CRF in the population is challenging. Here we develop and validate a novel framework to estimate CRF (as maximal oxygen consumption, VO2max) from heart rate response to low-risk personalised exercise tests. We apply the method to examine associations between CRF and health outcomes in the UK Biobank study, one of the world’s largest and most inclusive studies of CRF, showing that risk of all-cause mortality is 8% lower (95%CI 5–11%, 2670 deaths among 79,981 participants) and cardiovascular mortality is 9% lower (95%CI 4–14%, 854 deaths) per 1-metabolic equivalent difference in CRF. Associations obtained with the novel validated CRF estimation method are stronger than those obtained using previous methodology, suggesting previous methods may have underestimated the importance of fitness for human health.


Development and validation of CRF estimation method for the UKB-CRF test
We recruited 105 female (mean age: 54.3y ± 7.3) and 86 male (mean age: 55.0y ± 6.5) validation study participants (Supplemental Fig. 1).Participants completed a series of UKB-CRF tests and a submaximal steady-state test to characterise exercise HR response across different test protocols.VO 2 max was directly measured during an independent maximal exercise test (Fig. 1A).All participants contributed submaximal exercise test data.Some maximal exercise test data were excluded due to missing HR and VO 2 response data (n = 25) and failure to achieve predefined maximal exercise threshold criteria (n = 33).Participant characteristics were generally similar in these subsamples.
HR response features derived from the UKB-CRF test vary with ramp rate of the assigned protocol and, if unaccounted for, will result in biased CRF estimates (Fig. 1B).HR response features may also be of poor quality or missing, a common situation in exercise testing.We addressed these issues by integrating UKB-CRF test and steady-state test HR response features (Fig. 1C-E) into a multilevel CRF estimation framework (Supplemental Table 1).The framework uses these features to estimate the HR-WR relationship that would be established if a longer steady-state test had been completed instead of the short ramped UKB-CRF test.Maximal WR is then estimated by extrapolation to age-predicted maximal HR (HRmax).This framework approach minimises bias introduced by protocol assignment of ramp rate.It also enables the application of different estimation models to different data availability scenarios.We derived several estimation models, notated as M1 through M5 in order of comprehensiveness.The highest-level model (M1) used more HR response features to estimate CRF, while lower-level models (M2 through M5) used fewer.
An accepted method for estimating VO 2 max from submaximal cycle ergometry is to estimate maximal steadystate WR from HR response and convert the estimated maximal steady-state WR to VO 2 max using the American College of Sports Medicine (ACSM) metabolic equation for cycling 9 .Before applying this equation, we verified that maximal WR estimated from the UKB-CRF test by the multilevel framework corresponded to maximal steady-state WR by comparing WR estimates with WR measured at the respiratory compensation point (RCP) from the maximal exercise test.The WR at RCP is equivalent to maximal steady-state WR and is the WR above which anaerobic metabolism is needed to sustain exercise until exhaustion.Figure 2 shows agreement with WR at RCP using the most comprehensive model for each test type (M1: ramp tests; M4: flat tests).Agreement for remaining estimation models is shown in Supplemental Table 2. Across estimation models (M1 through M5), estimated WR were strongly correlated with WR at RCP (Pearson's r range 0.81-0.86)with no significant mean bias (Bias range in women: − 3.7 to 3.8 W; in men: − 5.2 to 0.1 W).Mean bias also did not differ between low and high ramped tests, but root-mean-square error was generally lower for flat tests.Correlations were higher when WR was computed using features from ramp-and recovery-phase data (models M1 through M3) compared to using only flat-phase data (models M4 and M5), although all models were relatively precise.We also compared estimated WR with WR measured at the lactate threshold (Supplemental Table 3) and WR measured at VO 2 max (Supplemental Table 4).
After confirming that WR computed from the multilevel framework corresponded to WR at RCP, we estimated VO 2 max by applying the ACSM metabolic equation.We then compared these VO 2 max estimates with VO 2 max directly measured from the maximal exercise test.Figure 3 demonstrates VO 2 max agreement using the most comprehensive estimation model for each test type (M1: ramp tests; M4: flat test).Agreement for remaining estimation models is shown in Supplemental Table 5.Estimated VO 2 max was correlated with measured VO 2 max (Pearson's r range 0.68-0.74)with no significant mean bias (Bias range in women: − 0.8 to 0.4 ml O 2 kg −1 min −1 ; in men: − 0.3 to 0.3 ml O 2 kg −1 min −1 ), establishing the multilevel framework's absolute validity.We also quantified the proportion of bias emerging from uncertainty when using age-predicted versus measured HRmax in the multilevel framework (Supplemental Table 6); agreement improved only slightly when using measured HRmax.As a sensitivity analysis, we also evaluated agreement between estimated and directly measured VO 2 max in all participants with usable maximal test data and relaxing the criteria for maximal effort (n = 178).Estimated and directly measured VO 2 max were not statistically significantly different across all comparisons in this analysis.
We next addressed whether the multilevel framework would yield similar CRF estimates from different UKB-CRF test protocols completed by the same validation study participant (i.e.internal validity).Within-participant, we compared VO 2 max estimates from low and high ramp tests across estimation models M1-M3 and M5, as well as between flat tests across M4 and M5 (Supplemental Table 7).VO 2 max estimates from different UKB-CRF test protocols were highly correlated across estimation models (Pearson's r range: 0.91-0.99).While mean bias was minimal across all comparisons (Bias range: − 0.6 to 0.0 ml O 2 kg −1 min −1 ), some were statistically significantly different from zero mean bias.We also examined differential bias by protocol ramp rate (from 0 W min −1 for flat www.nature.com/scientificreports/tests and 7.5-25 W min −1 for ramp tests), finding mean bias to be minimal across all ramp rates tested (Fig. 3, lower panel).
For method-comparison purposes, we also assessed the absolute and internal validity of a simple linear regression CRF estimation method used previously by our group and a two-point CRF estimation method used by other groups working with UKB data.These latter two methods relate exercise HR response to WR without accounting for protocol ramp rate.Both methods demonstrated overestimation bias and low precision when applied to ramped tests but had low bias when applied to flat tests (Supplemental Fig. 2).
To assess measurement consistency of the UKB-CRF test, we evaluated short-term test-retest reliability in validation study participants and long-term test-retest reliability in UKB participants (Fig. 4).In validation study participants with short-term repeat tests (within 2 weeks, n = 87), estimated VO 2 max values from the first and second UKB-CRF tests were highly correlated (r = 0.91) with no mean difference.Agreement was nearly as strong (λ = 0.79) in UKB participants with long-term repeat tests (about 2.8 years between tests, n = 2877).

Application of VO 2 max estimation method in UKB cohort
After establishing the absolute validity, internal validity, and test-retest reliability of the multilevel framework for interpreting UKB-CRF test data, we applied the framework to estimate CRF in UKB (Supplemental Fig. 3 and Supplemental Table 8) and examined prospective associations with health outcomes.In total, 42,351 women and 37,650 men from UKB were considered in this analysis.Baseline participant characteristics are shown by sex-specific and age-adjusted CRF tertiles, across half-decade age groups, in Supplemental Table 9.Estimated VO 2 max was higher in men compared to women, and in younger versus older adults.Participants in the middle and higher CRF tertiles also had better baseline measures of heart and lung function, lower body weight, and better self-perceived health than participants in the lower tertile.
To examine associations between CRF and prospective health outcomes in UKB, we used Cox proportional hazards regression to estimate hazard ratios per 1-metabolic equivalent difference in CRF (METs; 1 MET = 3.5 ml O 2 kg −1 min −1 ) for fatal and nonfatal events, excluding those experiencing the event in question in the first 2 years of follow-up.In total, 2670 participants died during a median 9.9 years (interquartile range 9.7-10.0) of follow-up (746,377 person-years).After adjustment for potential confounders, each 1-MET difference in CRF was associated with approximately 8% (95%CI 5-11%) lower all-cause mortality (Fig. 5); this equates to a difference in mortality of about 23% between top and bottom CRF tertiles.Associations were stronger for deaths from respiratory disease (RD; 14% lower per 1-MET, 95%CI 8-19%) and similar for deaths from cardiovascular disease (CVD; 9%, 95%CI 4-14%), and cancers (8%, 95%CI 5-12%).Higher CRF was more strongly associated with lower mortality risk in obese compared to non-obese participants (Supplemental Fig. 4).
For method comparison purposes, associations were also examined for CRF computed using the simple linear regression CRF estimation method.Associations were generally shallower and estimated with less uncertainty using simple linear regression, an analysis which also included fewer participants compared to the analysis using the multilevel framework.
We used cubic spline regression to examine natural variation and potential nonlinearity in dose-response associations between CRF and prospective health outcomes (Fig. 6; obesity-stratified result in Supplemental Fig. 5).For method comparison purposes, separate sets of dose-response curves were modeled for CRF when computed with the multilevel framework and when using the simple linear regression method.For mortality outcomes, CRF was inversely associated with death from all causes, CVD, RD, and cancer in the CRF range of 3-11 METs.Relationships were steeper at the low end of that range and shallower at higher CRF levels.The shape of estimated CRF dose-response relationships varied considerably across incident disease outcomes.In Differences between these associations and those observed using the simple linear regression to estimate CRF were most evident at the tails of the distributions.We performed several sensitivity analyses to determine whether health associations with CRF computed from the multilevel framework were altered by restricting the analytical sample or by using different estimation models in the multilevel framework to compute CRF.Compared to the main analysis, health associations with CRF were slightly weaker but estimated dose-response curves were qualitatively similar when the analytic sample was restricted so as to be matched with the sample used when the simple linear regression method was applied (Supplemental Figs. 6 and 7).To examine differences in health associations by estimation models in the multilevel framework, we split the analytic sample into two separate sets of analyses for those who either performed a ramp test (Supplemental Figs. 8 and 9) or a flat test (Supplemental Fig. 10).In the ramp test analysis, associations were generally stronger using more comprehensive estimation models (M1-3) compared to less comprehensive models (M5).Estimated dose-response curves were qualitatively similar for all outcomes.The association with incident AF, however, was a positive monotonic relationship at all estimation models in the ramp test subsample (event rate 217 per 100,000 person-years), rather than the U-shaped relationship found in the main analysis.In the subsample allocated a flat test (Supplemental Fig. 10), the association with incident AF was positive, non-significant, and at a higher event rate (460 per 100,000 person-years).The analysis in the flat test subsample is primarily a comparison of associations for CRF estimated by M4 and M5.Associations were generally similar between the two estimation models but non-significant due to the small sample size.We did not estimate dose-response curves in the flat test subsample for the same reason.

Discussion
We have developed a novel multilevel CRF estimation method based on HR response to short, individualised submaximal exercise tests and applied it in one of the largest and most inclusive population-based studies of CRF.We establish the validity and reliability of this new method in an independent validation study and demonstrate advantages over other methods applied in previous population studies.
The multilevel framework method estimates VO 2 max by modeling maximal steady-state exercise capacity from HR response to the UKB-CRF test.This approach minimises CRF estimation bias that may be introduced by the UKB-CRF test individualisation process.We demonstrate that for each 1-MET difference in CRF estimated using the multilevel method, all-cause mortality was 8% lower and CVD mortality was 9% lower; associations nearly twice as strong as those estimated when using a method that did not use external data to map between ramped and steady-state exercise 15 .Improvements in CRF estimation validity and disease outcome characterisation may have broad implications for future research in UKB.
We have reported on a range of associations between CRF and prospective health outcomes in UKB, demonstrating the protective effects of CRF on all-cause and cause-specific mortality and morbidity.Our findings are largely in agreement with numerous other populations-based studies 1-3 , however associations between CRF and non-fatal incidence rates of IHD, stroke, and AF were not significant.In previous work, dose-response associations were J-shaped for stroke 22 and U-shaped for AF 23 .Another study with longer follow-up time and more incident events reported inverse relationships between CRF and AF 24 .As for cancer mortality, we found an inverse relationship between CRF and all-cancer mortality, in agreement with previous studies 25,26 .Additional follow-up time is warranted to investigate CRF associations with site-specific cancers in UKB.Additional follow-up would, however, also increase the probability that participants may change their CRF level over time, which would dilute observed associations between a single baseline measure of CRF and health outcomes.The Cox proportional hazards model estimates the difference in hazards of the outcome in question by baseline exposure level, and if this exposure level is very stable over time, the estimated association will be a more accurate reflection of the importance of that exposure than if it were more variable.We show in this work that CRF has a long-term reliability of around 0.80, indicating good stability.However, if we were to formally apply correction for the element of instability using regression dilution bias methods, the point estimate of hazard ratios would be about 20% stronger than what we report here but with wider confidence intervals.
We have developed a novel multilevel estimation framework that optimises the validity of VO 2 max estimates from the UKB-CRF test.The key strengths of the framework are that it: (1) uses HR response features across all test phases to infer the relationship between HR and exercise intensity; (2) is flexible to the availability or absence of HR response features due to data quality issues; and (3) harmonises the inferential modeling of HR response features to the within-person invariant relationship between steady-state HR and exercise intensity.For these reasons, predicted CRF values and health associations we have presented diverge from previous reports of CRF in UKB.In a previous analysis 15 , we estimated CRF by using simple linear regression to establish the relationship between ramp test HR response and WR.This approach is currently the most common in the field but does not account for the protocol individualisation process of ramp rates used in UKB.Using external validation data, we show that this approach overestimates VO 2 max differentially by test protocol, thus limiting the ability to validly compare VO 2 max estimates from different UKB-CRF tests.Lacking protocol comparison data, previous work could only partially address the impact of this by a meta-analytic approach.Our new approach resolves this issue at the individual level and demonstrates stronger associations with all-cause and cause-specific mortality and morbidity compared to this previous non-validated method.We can have more confidence in these associations because validity of the exposure is now documented.Other approaches also use simple linear regression, but establish the relationship between HR and exercise capacity by relating resting HR to only a single measurement of HR during the test 16,17,[19][20][21][27][28][29][30] . HR meaurement noise will greatly decrease precision of this approach, and resulting CRF estimates are still subject to bias that will differ by protocol ramp rate.Another reported approach 18,[31][32][33] is to use the maximally achieved WR to infer CRF.As most participants completed their test, this approach merely reflects the protocol that participants were assigned according to age, sex, body size, resting HR, and exertional chest risk, with the latter feature most indicative of test risk-stratification.While prospective associations of such an exposure measure with CVD endpoints do to a degree validate the stratification of risk, it is not possible to interpret these results solely as associations with CRF.At best, it is a composite score of exercise capacity and cardiac risk.
Our CRF estimation approach may also have implications for exercise prescription in clinical environments.CRF testing in UKB has demonstrated that it is safe to obtain valid VO 2 max estimates in a population setting while including some individuals with contraindications to exercise.In practice, such individuals would be prescribed a less strenuous exercise test that could contain less information about their physiological state compared to a more strenuous test.The multilevel framework approach we describe in this work for interpreting such test results yields unbiased estimates of CRF, addressing a well-recognised limitation of exercise testing 34 .Future research is warranted to investigate whether the multilevel framework can be generalised to other ramp-style exercise tests.
The strengths of our study include independent validation work for our VO 2 max estimation approach prior to estimating dose-response relationships with disease outcomes in the UK Biobank.There are also some limitations.The exercise capacity of validation study participants was slightly higher than the average capacity of UKB participants.Furthermore, the comparatively relaxed testing conditions in the validation study may not directly match those in UKB, where testing was conducted in large testing centres and where a variety of additional exposures were examined under stringent time constraints.We also did not directly evaluate the validity of UKB-CRF test protocols with ramp rates at 2.5 and 5.0 W min −1 .Agreement for ramp rates above and below these untested rates were unbiased, however.Finally, we examined non-fatal health outcomes using only hospitalisation data; this does not necessarily capture all disease events in a given category.

Conclusions
We have demonstrated the absolute validity, internal validity, and test-retest reliability of a novel VO 2 max estimation method for individualised ramped exercise tests that can be safely and efficiently applied in population studies.Our analytic approach uses a generalised multilevel modeling framework that bridges the gap between steady-state and ramped incremental exercise, addressing a persistent problem in exercise physiology www.nature.com/scientificreports/and prescription.CRF estimated in this way is more strongly associated with mortality and other disease endpoints than previous methodology, strengthening the case for promoting CRF in the general population.

UKB-CRF test description
The UKB-CRF test protocol design and individualisation process are described in detail by the most recent test manual 35 .Briefly, participants were categorised into separate risk levels according to questions adapted from the Rose-Angina questionnaire.Participants with "minimal" and "small" risk completed an individualised ramp test, those with "medium" risk completed a flat test, and those with "high" risk did not complete an exercise test.Ramped tests began with a 2-min flat-phase at a single WR (30 W for females, 40 W for males) followed by a 4-min ramp-phase where WR increased continuously to a pre-specified target WR.The target WR was calculated as a risk-adjusted percentage (50% for those with "minimal" risk, 35% for "small" risk) of the maximal WR predicted from an equation derived from maximal exercise (cycle ergometer) testing data collected in the Danish Health Examination Survey 2007-2008 36 .The computed value for target WR was combined with participant sex ("F" for female, "M" for male) to notate different exercise protocols.For example, a male participant with "minimal" risk and predicted WR at VO 2 max of ~ 200 W would have a target work rate of 100 W and be individualised to UKB protocol "M100".Flat tests consisted of a single 6-min flat-phase.Participants cycled at a 60-rpm cadence while WR and heart rate (HR) were monitored.All tests ended with a 1-min recovery-phase where participants sat quietly and motionless on the ergometer.No adverse events were reported when this test was applied in UKB.

Validation study participants
We recruited a subsample of participants from the Fenland study, a population-based study in Cambridgeshire, UK 37 , using an age-, sex-and BMI-stratified random sampling procedure (Supplemental

Experimental procedure and equipment
Validation study participants were screened according to standardised procedures used for the UKB-CRF test 35 .Then, participants completed the UKB flat test, two UKB ramped tests at different ramp rates, a steady-state test (unique to the validation study), and another ramped test (validation only) to elicit VO 2 max (Fig. 1A).Tests were conducted consecutively, separated by at least 15 min of rest, and were specified according to the test that the participant would have been assigned had they been part of UKB (see Supplemental Table 11).The target (highest) WR for the second ramped test was at least 30 W greater than the first; thus, each participant completed a "low" and "high" ramped UKB test.The steady-state test consisted of four incremental 4-min flat-phases with each WR increment ranging from 10 to 20 W. For the ramped max test, participants were fitted with a face mask to measure respiratory ventilation and gas exchange and cycled while WR increased until exhaustion.VO 2 max was considered reached if two of the following criteria were met: a respiratory exchange ratio exceeding 1.20; no VO 2 increase despite increasing WR (< 2.5 ml O 2 kg −1 min −2 ); and no HR increase despite increasing WR.During data analysis, the levelling-off criterion was confirmed by inspecting whether the first differential of HR and VO 2 data approached zero over the last 1-min period of the maximal test.VO 2 max was expressed as the average of the two highest VO 2 measurements in the last forty-five seconds of the maximal exercise test.WR values were measured at VO 2 max (i.e.maximal work rate achieved on the test, WRmax), at the lactate threshold (LT), and at the respiratory compensation point (RCP).The work rate at LT was measured at the point when both ventilatory equivalent of oxygen (V E /VO 2 ) and end-tidal pressure of oxygen (P ET O 2 ) increased with no increase in ventilatory equivalent of carbon dioxide (V E /VCO 2 ).The work rate at RCP was measured at the point when both V E /VO 2 and V E /VCO 2 increased and end-tidal pressure of carbon dioxide (P ET CO 2 ) decreased (see Supplemental Fig. 11) 38 .Directly measured WR at LT and RCP were determined visually by three independent investigators, blinded to all other measures except the variables above needed for making direct measurements.The median value among investigators was considered the final value.
Cycling was performed on an electromagnetically-braked stationary bike (eBike ergometer, GE) while electrocardiography (ECG) was recorded using 4-lead ECG (Cardiosoft) on the forearms and a Actiwave Cardio device (CamNtech, Papworth, UK) on the chest with sampling frequency of 128 Hz.The 4-lead ECG leads were placed on the cubital fossa and ventral wrist of the left and right arms (mimicking the UKB protocol).Cycling WR were controlled by computer software.Respiratory gas measurements were conducted using a computerised metabolic system with Hans Rudolph face masks (Oxycon Pro, Erich Jaeger GmbH, Hoechberg, Germany) as validated elsewhere 39 .
All ECG signals were processed using the Physionet Toolkit implementation of the SQRS algorithm 40 , which applies a digital filter to the measured ECG and identifies the downward slopes of the QRS complexes 41 .The resulting inter-beat-intervals were converted to beats-per-minute values using the "ihr" package in the PhysioNet Toolkit, as described previously 15 .Pulmonary gas exchange data were sampled breath-by-breath.All data were linearly interpolated to derive quasi-continuous HR response and respiratory measures at 1 s time resolution.Sections of linearly interpolated HR data greater than 1 s in duration were removed prior to analysis.www.nature.com/scientificreports/CRF estimation framework: conceptual and modeling framework Our approach for estimating VO 2 max from UKB-CRF test HR response is illustrated in Fig. 1B-E.Here we first describe a VO 2 max estimation method for HR response to steady-state exercise.We then adapt this method to the UKB-CRF test by using a multilevel hierarchical framework of linear models to harmonise HR response features extracted from flat and ramped UKB-CRF tests to those extracted from steady-state exercise.
Conceptual framework.VO 2 max can be estimated from HR response to exercise at steady-state WR increments using linear extrapolation of the submaximal HR-to-WR relationship 42 .For this approach, an individual exercises at two or more submaximal WR increments while HR is recorded.The steady-state HR response at each test increment is then regressed against WR to establish a line-of-best fit for the observed HR-to-WR relationship (W bpm −1 ).This relationship can be represented as: where WR t and HR t are paired measurements at several test increments, β 1 ss is the linear regression slope repre- senting the steady-state HR-to-WR relationship, and β 0 ss is the intercept of that regression.The regression line is extrapolated to age-predicted maximal HR (HRmax) 43 to estimate the maximal steady-state WR that would be achieved if the exercise test was completed to exhaustion.VO 2 max is then estimated by converting the extrapolated maximal steady-state WR value to net VO 2 using a caloric equivalent of oxygen and adding an estimate of resting VO 2 plus the VO 2 required for unloaded cycling 44 .
The HR-to-WR linear extrapolation approach presents challenges when applied to ramped exercise HR response.Assuming HR and VO 2 responses are linearly related, the principal methodological issues are [45][46][47] : (1) within-participant, the VO 2 -to-WR relationship and total time delay for VO 2 response to achieve linearity after ramped exercise onset will vary across ramped tests as a function of ramp rate; (2) The ramped VO 2 -to-WR relationship decreases asymptotically with ramp rate and, as ramp rate approaches zero, becomes similar to values determined from steady-state exercise; (3) the VO 2 -to-WR relationship has high test-retest variability; and (4) the VO 2 -to-WR relationship diverges from linearity above RCP.Thus, the HR-to-WR linear extrapolation approach will induce VO 2 max overestimation bias as a function of ramp rate, demonstrate low test-retest reliability, and have poor precision if the WR computed at age-predicted HRmax is greater than the WR at RCP. Modeling framework.We addressed these methodological issues by constructing a multilevel CRF estimation framework that computes a participant's steady-state HR-to-WR relationship using features extracted from HR response to flat or ramp UKB-CRF test protocols.The framework was derived using a three-stage hierarchical linear model.The first stage equates WR computed from steady-state test HR response (Eq. 1) with WR computed from dynamic regression coefficients that vary between and within individual participants as a function of lower-stage hierarchical features.Within every ith individual participant, each having completed a set of p exercise protocols: Stage-1 (base-stage equating steady-state test HR response with UKB-CRF flat and ramped HR response): where (1) β 0 p[ss]i and β 1 p[ss]i are linear regression coefficients estimated from the steady-state protocol (  www.nature.com/scientificreports/ of HR-response and protocol features (P x p[UKB]i ) as well as different sets of participant characteristics (I x i ) .We leveraged this adaptability to derive five WR estimation models (notated as M1-M5; Supplemental Table 1), each using different combinations of HR response feature sets, so that our approach was robust to different data quality scenarios encountered when analysing HR response data in UKB.Additional details regarding the extraction of feature sets included in P x p[UKB]i and I x i are provided in Supplemental Methods.
Application of multilevel framework VO 2 max was estimated from the multilevel framework by extrapolating the linear fit defined by β 0 p[UKB]i and β 1 p[UKB]i to age-predicted HRmax 43 and converting the resultant maximal steady-state WR value to VO 2 max using the American College of Sports Medicine metabolic equation for cycling 9 .We also estimated WR and VO 2 max values using a simple linear regression approach 15 , a two-point estimation method 19 , and an approach for steadystate tests 15 (see 'Prediction of VO2max using alternative methods' in Supplemental Materials).

Agreement analyses
We used Bland-Altman analysis 48 to quantify agreement between estimated WR and VO 2 max values with those directly measured during the maximal exercise test.Correlations between estimated and directly measured values were quantified using Pearson's r and Spearman's rho.One-sample t-tests were performed to determine whether mean biases were statistically significantly different from zero mean bias.Estimation model precision was expressed as the root mean square error (RMSE) between estimated and directly measured values.ANOVA repeated measures were used to test differences between estimated and directly measured values across estimation models.

Short-term test-retest reliability
To assess short-term test-retest reliability, a subsample of 87 validation study participants completed a second UKB-CRF test within 2 weeks after main testing, identical to either the low or high ramped test at the main visit.Estimated VO 2 max values from first and second tests were compared using agreement analysis.

UKB participants
The UKB is a prospective cohort study of 502,625 older adults.Baseline data collection was conducted between 2006 and 2010 where a variety of physical measurements, biological samples, and health questionnaires were administered; repeat-measures visits were conducted between 2012 and 2013.The UKB-CRF test was offered approximately 100,000 times (last 79,209 participants from baseline and 20,218 from the repeat-measures visit).
The study was approved by the North West Multicentre Research Ethics Committee and participants provided written informed consent.

Implementation of CRF estimation in UKB
VO 2 max values were estimated in UKB participants, largely as described above for the validation study.Supplemental Fig. 12 describes specific criteria used to assign the multilevel framework estimation models for the primary analysis.Age-predicted HRmax was reduced by 20 beats-per-minute in those taking beta-blockers 49 .

Long-term test-retest reliability of CRF
To assess long-term test-retest reliability, we compared estimated VO 2 max values at baseline and the first followup test in those UKB participants with repeat tests (n = 2877, mean follow-up time 2.8 years).The follow-up UKB-CRF test protocol was re-individualised at the time of testing and therefore may have differed from the baseline protocol.
Health characteristics across CRF levels in UKB Health characteristics were described across age-adjusted and sex-specific CRF categories 50 .We age-stratified the UKB cohort in half-decades as < 50, 50-54, 55-59, 60-64, and ≥ 65 years, defined CRF categories by tertiles ("lower", "middle", and "higher") of estimated VO 2 max levels from each age group, and combined CRF categories from each age group to form CRF categories for the entire UKB cohort.Health characteristics were compared across CRF tertiles for men and women separately.

Survival analyses
Cox regression with age as the underlying timescale was used to estimate log-linear associations between estimated VO 2 max levels (in METs; 1 MET = 3.5 ml O 2 kg −1 min −1 ) and mortality and incident disease outcomes.We compared prospective associations between two VO 2 max estimation approaches: the multilevel framework developed in this study and the previously described method using simple linear regression.Vital status and hospital episodes of UKB participants were established by linkage to national registry data obtained from the Health and Social Care Information Centre (now NHS Digital) for England and Wales and the Information Services Department (ISD) for Scotland.The censoring date for mortality outcomes was 31st March 2020.Censoring dates for incident disease outcomes were 31st January 2018 in England and Wales, and 30th November 2016 in Scotland.International Classification of Diseases (ICD) 10th edition (ICD-10) codes were used to define health outcomes; heart failure (I50, I11.0, I13.0, I13.2), stroke (I60-166), ischaemic heart disease (IHD; I20-I25), atrial fibrillation (AF; I48), and chronic obstructive pulmonary disease (COPD; J44).Fatal outcomes were all-cause

Figure 1 .
Figure 1.Conceptual framework and design for validation study.(A) Overview of the five exercise tests performed by validation study participants (3 UKB-CRF tests (flat protocol, low ramp protocol, high ramp protocol), 1 steady-state test unique to the validation study, and 1 maximal exercise test to measure VO 2 max).X-axes: Time; Y-axes: Work rate (WR).Tests were completed consecutively, and work rates were individualised according to standardised criteria (See `Experimental procedure and equipment' in Methods).UKB-CRF test and steady-state test data were used for method development.Maximal exercise test data were withheld from method development and used for validation purposes only.(B) Conceptual plot of WR-to-VO 2 response during steady-state and ramped exercise tests.VO 2 increases linearly at a rate proportional to the rate of change in WR (i.e.ramp rate) until VO 2 max is reached (in an exhaustive test).The WR-to-VO 2 relationship (line slope) changes depending on the ramp rate of the test.As ramp rate decreases, the WR when VO 2 max is achieved approaches the maximal WR for an exhaustive steady-state test.Note that VO 2 is extrapolated to maximal values for demonstrative purposes, but in the validation study ramped and steady-states tests were non-exhaustive.(C) Exemplar HR data (blue scatter and grey line; upper panel), WR data (red line; lower panel), and test phase annotation for ramp test.(D,E) Feature extraction for ramp phase using simple linear regression model and for recovery phase using first-order exponential decay model (see Supplemental Methods).

Figure 2 .
Figure 2. Scatterplots (top row) and Bland-Altman plots (bottom row) demonstrating agreement between work rates measured at the respiratory compensation point (RCP) and work rates estimated from flat ramp tests (left column), low ramp tests (middle column), and high ramp tests (right column) using the most comprehensive prediction equation from the multilevel framework (M1 for ramp tests; M4 for flat test).r: Pearson's correlation coefficient, rho: Spearman's rank correlation coefficient.RMSE: Root-mean-square error.

Figure 3 .
Figure 3. Scatterplots (top row) and Bland-Altman plots (second row) demonstrating agreement between directly measured VO 2 max and VO 2 max estimated from flat tests (left column), low ramp tests (middle column), and high ramp tests (right column) using the most comprehensive equation from the multilevel framework (M1 for ramp tests; M4 for flat test).Below these (bottom row), the box plot demonstrates agreement across all ramp rates tested using estimates from the multilevel framework, the simple linear regression method, and the two-point estimation method.r: Pearson's correlation coefficient, rho: Spearman's rank correlation coefficient.RMSE: Root-mean-square error.

Figure 5 .
Figure 5.Hazard ratio and 95% confidence interval (CI) for prospective log-linear associations (Cox regression) between fatal and non-fatal outcomes in the UK Biobank with cardiorespiratory fitness in metabolic equivalents (METs, per 3.5 ml O 2 •kg −1 min −1 ) computed from the multilevel framework and simple linear regression methods.Event-rate per 100,000 person-years.AF: atrial fibrillation; COPD: chronic obstructive pulmonary disease; CVD: cardiovascular disease; IHD: ischaemic heart disease; RD: respiratory disease.COPD incidence mostly reflects severe COPD since only ~ 25% of cases end up in hospital.Mortality and incidence event-rates differ between fitness prediction methods owing to different inclusion criteria at the estimation level.

Figure 6 .
Figure 6.Hazard ratio and 95% confidence interval (CI) for nonlinear associations (cubic splines, Cox regression) between fatal and non-fatal outcomes in the UK Biobank with cardiorespiratory fitness in metabolic equivalents (METs, per 3.5 ml O 2 •kg −1 •min −1 ) computed from the multilevel framework and simple linear regression methods.Hazard ratios were computed relative to a fitness reference point of 8.0 METs.AF: atrial fibrillation; COPD: chronic obstructive pulmonary disease; CVD: cardiovascular disease; IHD: ischaemic heart disease; RD: respiratory disease.Mortality and incidence counts (superimposed histograms) differ between fitness estimation methods owing to different inclusion criteria at the estimation level. https://doi.org/10.1038/s41598-021-94768-3

Table 10
). Exclusion criteria were: heart pacemaker; unable to walk without aid; history of angina pectoris; blood pressure greater than 180/110 mm Hg; musculoskeletal injury that would impair cycling on the ergometer; pregnancy; and currently taking cardioactive drugs (e.g.beta-blockers, aspirin).Ethical approval was obtained by the University of Cambridge Human Biology Research Ethics Committee (Ref: HBREC/2015.16).All participants provided written informed consent.
WR tp[ss]i is a sequence of t steady-state WR values computed with β 0 p[ss]i , β 1 p[ss]i , and HR tp[ss]i (thus, a matrix representation of the line defined by Eq. 1); and (4) β 0 p[UKB]i and β 1 p[UKB]i are dynamic regression coefficients that, while unique to each UKB protocol (p[UKB]) and individual, converge to the values of β 0 p[ss]i and β 1 p[ss]i by their linkage with WR tp[ss]i .β0p[UKB]i and β 1 p[UKB]i are estimated at the second stage using combina- tions of HR-response and protocol-based features:Stage-2 (HR-response and protocol features extracted from flat and ramped UKB-CRF tests):where (1) γ 0x i and γ 1x i are sets of a fixed regression coefficients for HR-response and protocol-level features P x p[UKB]i ; and (2) γ 00 i and γ 10 i are the mean intercept and slope for the ith individual participant.γ 00 i and γ 10 i are estimated at the third stage: Stage-3 (pretest participant characteristics): p[ss]);(2) HR tp[ss]i is a sequence of t simulated steady-state HR values, equally spaced and spanning the submaximal intensity range;(3)where (1) δ 00x and δ 10x are sets of b fixed regression coefficients for participant characteristics I x i ; and (2) δ 000 and δ 100 are the model-invariant intercept and slope.β 0 p[UKB]i and β 1 p[UKB]i can be estimated using different sets (1)