Temporal convolutional networks predict dynamic oxygen uptake response from wearable sensors across exercise intensities

Oxygen consumption (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{\,{{\mbox{V}}}}{{{\mbox{O}}}}_{2}$$\end{document}V˙O2) provides established clinical and physiological indicators of cardiorespiratory function and exercise capacity. However, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{\,{{\mbox{V}}}}{{{\mbox{O}}}}_{2}$$\end{document}V˙O2 monitoring is largely limited to specialized laboratory settings, making its widespread monitoring elusive. Here we investigate temporal prediction of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{\,{{\mbox{V}}}}{{{\mbox{O}}}}_{2}$$\end{document}V˙O2 from wearable sensors during cycle ergometer exercise using a temporal convolutional network (TCN). Cardiorespiratory signals were acquired from a smart shirt with integrated textile sensors alongside ground-truth \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{\,{{\mbox{V}}}}{{{\mbox{O}}}}_{2}$$\end{document}V˙O2 from a metabolic system on 22 young healthy adults. Participants performed one ramp-incremental and three pseudorandom binary sequence exercise protocols to assess a range of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{\,{{\mbox{V}}}}{{{\mbox{O}}}}_{2}$$\end{document}V˙O2 dynamics. A TCN model was developed using causal convolutions across an effective history length to model the time-dependent nature of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{\,{{\mbox{V}}}}{{{\mbox{O}}}}_{2}$$\end{document}V˙O2. Optimal history length was determined through minimum validation loss across hyperparameter values. The best performing model encoded 218 s history length (TCN-VO2 A), with 187, 97, and 76 s yielding <3% deviation from the optimal validation loss. TCN-VO2 A showed strong prediction accuracy (mean, 95% CI) across all exercise intensities (−22 ml min−1, [−262, 218]), spanning transitions from low–moderate (−23 ml min−1, [−250, 204]), low–high (14 ml min−1, [−252, 280]), ventilatory threshold–high (−49 ml min−1, [−274, 176]), and maximal (−32 ml min−1, [−261, 197]) exercise. Second-by-second classification of physical activity across 16,090 s of predicted \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\dot{\,{{\mbox{V}}}}{{{\mbox{O}}}}_{2}$$\end{document}V˙O2 was able to discern between vigorous, moderate, and light activity with high accuracy (94.1%). This system enables quantitative aerobic activity monitoring in non-laboratory settings, when combined with tidal volume and heart rate reserve calibration, across a range of exercise intensities using wearable sensors for monitoring exercise prescription adherence and personal fitness.


I. INTRODUCTION
Cardiorespiratory fitness is an established risk factor for cardiovascular disease and all-cause mortality [1] and is an important determinant for endurance exercise performance [2]. Cardiorespiratory fitness is conventionally assessed by measuring the rate of oxygen consumption (VO 2 ) and its dynamic response to exercise. Biomarkers such as peak oxygen uptake (VO 2peak ) and the rate of adaptation to changes in exercise intensity provide important information about the integrative responses of the pulmonary, cardiovascular, and muscular systems [3], which can provide insights into different disease R. Amelard and E. T. Hedge  states [4]. Accordingly,VO 2 monitoring has become a crucial objective measure for advanced clinical therapies [5].
VO 2peak is often considered the gold standard metric of cardiorespiratory function. In heart failure,VO 2peak is a strong predictor of 1-year mortality [5], and is clinically used to select patients for advanced therapies. In cases where maximal exercise is infeasible, the dynamic response to sub-maximal exercise also provides important indicators of health [6], [7], [8], [9], [10] and fitness status [11], [12], [13], [14]. Despite its established importance, monitoringVO 2 in nonlaboratory settings remains challenging. Direct measurement ofVO 2 requires a metabolic cart and trained technician, which limits its applicability to laboratory assessment. Heart rate (HR) has traditionally been used as an inexpensive and nonintrusive proxy toVO 2 response to activity and estimate energy expenditure under the assumption that HR varies linearly withVO 2 [15], [16]; however, dynamicVO 2 and HR responses do not always have a direct correspondence with each other, such as following prior exercise [17]. Thus, ambulatory physiological monitoring using wearable sensors may provide early detection of sub-clinical biomarkers of disease and enable more widespread assessment of cardiorespiratory function [18].
Recent advances in wearable technologies and artificial intelligence have led to new developments in non-intrusive cardiorespiratory monitoring. These approaches are generally modeled as regression problems, where a machine learning model learns a transformation function between physiological inputs from wearable sensors andVO 2 measured using a gas analyzer system. Earlier work primarily used a combination of HR and activity-related inputs to predictVO 2 during various forms of exercise [19], [20]. Sensors embedded in textile fabrics have enabledVO 2 prediction in low-to moderateintensity exercise during activities of daily living [21], [22]. Recent studies have leveraged the time-dependent nature oḟ VO 2 by modeling the regression as a time-series, or sequential, prediction, where previous physiological states were used to predictVO 2 [23] and ventilatory threshold [24] during stationary cycling on a cycle ergometer. Although sequential prediction has been shown to model the temporal nature of oxygen uptake well, further investigations are needed to assess model efficiency and performance across a range of exercise intensities, different days, and demographics.
In this paper, we propose and evaluate a sequential deep learning model based on temporal convolutional networks (TCN) [25] for predictingVO 2 from physiological inputs arXiv:2105.09987v2 [cs.LG] 13 Oct 2021 Fig. 1: Network architecture. Temporal features (heart rate, heart rate reserve, breathing frequency, minute ventilation, and work rate) are processed through a series of residual blocks with causal convolutions and dilations for feature extraction, followed by a fully connected layer and linear activation to predictVO 2 at time t.
derived from smart textiles and a cycle ergometer. The model used causal convolutions to incorporate only past and present physiological response, and the system architecture was designed to provide a tunable effective history length, or receptive fields. We assessed the effect of receptive field and model complexity on prediction accuracy to investigate the temporal relationship between physiological inputs andVO 2 response to provide guidance on optimal model design and assessment across a range of different exercise intensities. Fig. 1 shows the data flow through the TCN network architecture. The four cardiorespiratory biosignals derived from the smart shirt (heart rate (HR), heart rate reserve, breathing frequency, and minute ventilation (V E )) and the work rate (WR) profile were used as inputs into a chain of residual blocks, followed by a dense layer and linear activation to predictVO 2 at each time point. Results of the TCN networks were compared against a stacked long short-term memory (LSTM) network [26] and random forest (RF) [27] prediction models. The stacked LSTM model was trained using the originally proposed features [23], as well as adding heart rate reserve andV E to the feature set, with a sequence length of 140 s approximating 70 breaths at low intensity exercise. The RF model was built using the optimal number of trees according to the validation loss (30 trees; see Supplementary Materials).VO 2 data were converted to ml·min −1 ·kg −1 to compute metabolic equivalent of task (METs) for quantifying physical activity levels (METs=VO 2 /3.5 [28]). METs were classified as light (<3.0 METs), moderate (3.0 to 5.9 METs), or vigorous (≥6.0) intensity exercise according to established guidelines [28].    Table I). Trend lines were generated using exponential weighting moving average (α = 0.5) for visualization purposes. Validation loss decreased to convergence with larger receptive fields, apart from the two-filter models which were insufficient for learning the prediction function, as evidenced by a flat and highly variable loss curve over all receptive fields. Performance increase was marginal beyond 8 filters. The optimal hyperparameters that produced the smallest hold-out validation loss were 24 filters and a receptive field of 218 s using a kernel size of 8 s and 5 dilations (2 residual blocks). However, a smaller receptive field and/or a smaller set of model parameters may be preferable to a marginal loss increase. Smaller receptive fields allow for reduced "cold start" time, and fewer parameters result in decreased computational load. Thus, we assessed the accuracy of models that were within 5% of the best (minimum) validation loss below.

C. Network Performance
We assessed the performance of models exhibiting validation loss within 5% of the minimum validation loss, as well as models with no history (TCN-VO2 NH) and using only HR as input (TCN-VO2 A(HR)), and compared the results to existingVO 2 prediction methods (Table II). There were eight models within the 5% performance threshold, of which we reported four with varied receptive fields and parameter sets. The best performing model according to hold-out validation loss (TCN-VO2 A) had a 218 s receptive field and 19921 parameters. TCN-VO2 B had marginally higher validation loss with 187 s receptive field, indicating minimal performance gain for additional history beyond 187 s. TCN-VO2 C and D had near identical validation loss (3% increase over model A), but required a receptive field of only 76 s and 91 s respectively, thus providing a reduced cold start period for initial prediction.
Repeated measures Bland-Altman analysis showed no systematic error between predicted and trueVO 2 across all TCN-VO2, LSTM and RF models, and no proportional error (Fig. 4). The limits of agreement (LoA) remained relatively constant across TCN-VO2 models during L-M, VT-H, and MAX data, but increased with decreased model complexity in L-H and combined data. Model bias on all combined data was less than 25 ml·min −1 in the best performing TCN-VO2 models and 72 ml·min −1 for 1 s receptive field (model NH), and the equality line fell within the confidence interval of the mean difference. TCN-VO2 models exhibited smaller LoA compared to LSTM in all exercise protocols, indicating a stronger overall fit and smaller error variance. Similarly, mean error ofVO 2peak was smallest in TCN-VO2 A (18 ml·min −1 ), strongly outperforming models with no history (RF: 134 ml·min −1 , TCN-VO2 NH: 223 ml·min −1 ) and HRonly (−220 ml·min −1 ), as well as exhibiting smaller LoA than LSTM models. Furthermore, TCN-VO2 models required substantially fewer network parameters to achieve comparable bias and lower LoA, requiring 1.1× and 4.1× fewer parameters in the TCN-VO2 A and TCN-VO2 C, respectively, and 60.5× fewer in TCN-VO2 NH compared to the stacked LSTM model. TCN-VO2 A(HR) exhibited high error bias and variance across protocols, showing that HR is not a robust estimator ofVO 2 .
TCN-VO2 with no history ("NH"), comprising 1 s receptive field, was compared against RF which performs point-wise predictions without input from previous states. TCN-VO2 NH and RF performed comparably, although the TCN-VO2 NH parameter set contained 825.8× fewer parameters than RF (quantified as the number of split nodes in the forest). Both TCN-VO2 NH and RF exhibited larger bias in L-H data com-pared to the other protocols due to overestimatingVO 2 during low WR transients (see Fig. 3 inset).VO 2 prediction during exercise with relatively small WR transitions was not as severely affected. Top performing TCN models were the only models to correctly predict the lowestVO 2 values during the off-transients, making them good candidates for assessing kinetics.

D. Quantifying Physical Activity Levels
PredictedVO 2 were used in accordance with globally established guidelines on exercise prescription for health [28] to classify second-by-second activity levels. These guidelines are often used in practice, alongside other biomarkers, for optimizing cardiovascular health during rehabilitation exercise prescription. Fig. 5 shows the METs data derived from the true (measured) and predictedVO 2 data, as well as the confusion matrix. All test data (4 protocols/participant across the 5participant test set) were classified on a second-by-second basis and visually concatenated into a single plot spanning a total of 16090 s of predicted metabolic activity across 20 individual exercise trials. Overall, 15147 s (94.1%) of the 1 Hz data were correctly classified into appropriate physical activity categories. The data spanned across light (2.7%), moderate (33.8%), and vigorous (63.5%) intensity exercise. During moderate-intensity exercise (5445 s), 4882 s (89.7%) were correctly classified, while 153 s (2.8%) and 410 s (7.5%) were classified as light and vigorous activity. During vigorousintensity exercise (10216 s), 10004 s (97.9%) were correctly classified, while 212 s (2.1%) were classified as moderate activity. Less than 3% of the data were in the "light" category (429 s), and of those data, 261 s (60.8%) were correctly classified and 168 s (39.2%) were classified as moderate activity. 90% of these data were within 0.5 METs of the threshold.

III. DISCUSSION
This work has shown that the complex dynamicVO 2 response to changes in exercise intensity can be accurately predicted using sequential deep learning models across a range of low-, moderate-, and high-intensity exercise, as well as maximal aerobic exercise. The baseline model TCN-VO2 NH (with 1 s receptive field) and point-wise RF regression [21] were both blind to previous states, using only the current time point for prediction. In both cases, we observed a large and  [21], and stacked long short-term memory network (LSTM) [23] across the four exercise protocols. Dashed gray line represents the participant's ventilatory threshold. similar bias in the L-H data, albeit being two fundamentally different model types. The source of error was largely due to overestimation during recovery in the off-transients. The overestimation is likely a result of the point-wise predictor's blindness to the previous state, as the models were unable to adequately learn the relationships between model inputs and outputs during on-and off-transients. This is an important consideration, as differences in HR dynamics have been observed between on-and off-transients, with the slower off-transients being amplified following high-intensity exercise [17], [29]. HR kinetics are also reported to be slower thanVO 2 kinetics during recovery [17]. Similarly, the ventilation response has been observed to be slower during recovery compared to exercise onset [30], with the rate of change in ventilation being markedly slowed when recovering from higher intensity exercise [29]. Furthermore, the ventilatory response is slower than theVO 2 response [31], [32]. Accordingly, without knowing the previous history of the system, these point-wise models are naive to status of the system leading to erroneouṡ VO 2 predictions. Thus, it appears that the temporal models were able to more accurately learn the relationships between HR and ventilation, andVO 2 during exercise onset as well as recovery.
The best performing models, when ranked by validation loss, were consistently those with receptive fields of 218 s, 187 s, 76 s, and 91 s. We observed marginal difference between the test errors of these models, although there were some differences in predictingVO 2 minima at the end of offtransients, which did not significantly affect the global loss function. Considering that the standardVO 2 time constant is typically 20-30 s in healthy populations [33], these receptive fields range from approximately 3-11 time constants, which suggests that the best models tend to use a receptive field that includes most, if not all, of the transient phase for a sustained step change in WR to achieve a new constantVO 2 . Receptive Both classes of sequential deep learning models (TCN and LSTM) exhibited strong predictive power. TCN architectures have become popular alternatives to recurrent neural network instances largely due to their data parallelism, flexible receptive field size, and stable gradients [25]. In this work, top performing TCN models were much smaller than the stacked LSTM models, yielding more computationally efficient prediction networks.
TheVO 2 predictions were converted to METs, which is an established metric for quantifying physical activity levels [28]. Quantifying activity levels is helpful for exercise prescription in cardiovascular disease management [34]. However, traditional patient recall may be affected by recall and/or social desirability bias [35], providing uncertainty in an important biomarker for cardiovascular health. This work showed strong accuracy across a range of METs categories, which provides supporting evidence for quantitative and objective at-home activity monitoring using wearable sensors. The errors in the light activity category were mainly during category transition, not sustained activity. In these data, the majority of light data was close to threshold. It would be expected that incorporating data from resting of very low-intensity exercise would lead to a more representative error profile.
The primary limitations impacting the widespread generalizability of these results stem from the dataset's constrained demographic (young healthy adults) and structured exercise protocol. We selected the three-stage PRBS protocol because it challenged the dynamic response of aerobic metabolism across a wide range of intensities that could occur in reallife situations. PRBS exercise provides an opportunity to directly quantify an index of physical fitness [36]; however, PRBS cycling exercise is a controlled laboratory protocol. Accordingly, our optimization results may not be directly applicable to unstructured exercise or different exercise pro-tocols (i.e., constant load exercise). Further investigations in different exercise situations with a more diverse participant sample, including cardiovascular-related diseases, are needed to assess its generalizability to different kinds of physical activity and out-of-sample populations. Additionally, the exercise in this study was performed in a temperature controlled environment. Further investigations are needed to evaluate model performance in specialized environments that may alter cardiovascular response to exercise (e.g., hypoxia, heat stress, etc.).
To conclude, using causal convolutions in a temporal deep learning model, the effect of receptive field (i.e., effective history) onVO 2 prediction was assessed by performing a grid search across hyperparameter values. The best performing models, according to validation loss, comprised receptive fields of 218 s, 187 s, 97 s, and 76 s. Results showed low prediction error across a wide range of exercise intensities with drastically reduced parameter sets compared to existing methods. Using HR as the only input feature into the same model architecture yielded substantially larger errors, reinforcing that HR alone is insufficient for predictingVO 2 , thus necessitating more complex approaches. Using the temporal prediction outputs, physical activity levels were quantified to provide a breakdown of time spent in light, moderate, and vigorous activity according to global health guidelines. These results suggest that cardiorespiratory function may be assessed in non-laboratory settings, when combined with tidal volume calibration, across a wide range of activity levels using wearable sensors and smart textiles.

A. Data Collection and Preprocessing
Twenty-two young healthy adults (13 males, 9 females; age: 26 ± 5 yr; height: 1.71 ± 0.08 m; mass: 70 ± 11 kg;VO 2peak : 42 ± 6 ml·min −1 ·kg −1 ) with no known musculoskeletal, respiratory, cardiovascular, or metabolic conditions volunteered to participate in the study. The study was approved by a University of Waterloo Research Ethics committee (ORE #32164) and conducted in accordance with the Declaration of Helsinki. All participants signed an informed consent before participating.
Participants visited the laboratory on four separate occasions to perform a ramp-incremental exercise test, and three different pseudorandom binary sequence (PRBS) exercise tests [37]. Each exercise session was separated by at least 48 h, and participants were instructed to arrive for testing at least two hours postprandial, and abstain from alcohol, caffeine and vigorous exercise in the 24 hours preceding each test. All exercise tests were performed in an environmentally controlled laboratory on an electronically braked cycle ergometer (Lode Excalibur Sport, Lode B.V., Groningen, Netherlands). Participants were instructed to maintain cadence at 60 revolutions per minute for all exercise tests.
On the first visit, 5 min of seated resting data were collected to determine each participant's resting HR. After the resting period, participants performed a ramp-incremental exercise test to exhaustion (25 W baseline for 4 min followed by a 25 W·min −1 ramp) to determine each participant's ventilatory threshold (VT) [38],VO 2peak , and the work rates (WR) for the PRBS exercise tests (Fig. 6). The test was terminated when the cadence dropped below 55 revolutions per minute despite strong verbal encouragement.VO 2peak was defined as the highestVO 2 computed from a 20 s moving average during the exercise test.VO 2 at VT was estimated by visual inspection using standard ventilatory and gas exchange indices, and their ratios, as previously described [38]. WRs at 90% VT, VT, and the midpoint between VT andVO 2peak (referred to as ∆50%) were estimated by left-shifting theVO 2 response by each individual's mean response time to align theVO 2 and WR profiles. Mean response time was determined by fitting a double-linear model to the ramp-incremental data, and finding the point of intersection between the forward extrapolation of the averageVO 2 during the 25 W baseline cycling in the 2 min prior to ramp onset, and the backwards extrapolation of the linear portion of the rampVO 2 response below VT [39].
In visits 2-4, participants performed one of three different PRBS exercise tests in a randomized order. WRs systematically alternated in the three PRBS exercise test between 25 W and 90% of the VT (low-to-moderate; L-M), 25 W and ∆50% (low-to-high; L-H), or VT and ∆50% (VT-to-high; VT-H). The time series for the changes in WR for PRBS protocols were generated by a digital shift register with an adder module feedback [40], [41], [42], [37]. This process pseudo-randomized the changes in WR and ensured that there would be sufficientVO 2 signal amplitude while performing non-constant load exercise. A single PRBS was composed of 15 units, each of 30 s in duration, totaling 7.5 min. Each complete PRBS testing session consisted of a 3.5 min warmup (the last 3.5 min of the 7.5 min PRBS), and then two full repetitions of the PRBS for a total of 18.5 min of continuous cycling per session. A portable metabolic system (MetaMax 3B-R2, CORTEX Biophysik, Leipzig, Germany) was used to measure gas exchange during all exercise tests. Participants breathed through a mask (7450 SeriesV2 Mask, Hans Rudolph Inc., Shawnee, KS, USA), and inspired and expired flow were measured using a bi-directional turbine. The turbine was calibrated before each testing session using a 3L syringe. Oxygen and carbon dioxide gas concentrations were continuously sampled at the mouth and were analyzed using a chemical fuel cell and nondispersive infrared sensor, respectively. Precision-analyzed gas mixtures were used to calibrate the oxygen and carbon dioxide gas concentrations.VO 2 and carbon dioxide output were calculated using standard breath-by-breath algorithms [3].VO 2 data were filtered using a sliding 5-breath median filter to correct spurious outlier breaths.
Participants wore a HR monitor (Polar H7, Polar Electro Oy, Kempele, Finland) that wirelessly communicated with the portable metabolic system, such that HR data were logged synchronously with the gas exchange data. Participants were also fitted with a wearable integrated sensors shirt that was sized to each participant based on manufacturer guidelines (Hexoskin, Carre Technologies, Montreal, Canada). The shirt contained a textile electrocardiogram to measure HR, and thoracic and abdominal respiration bands to obtain estimates of breathing frequency and minute ventilation (V E ) via respiratory inductance plethysmography. Estimates ofV E provided by the smart shirt were calibrated by linear regression to the knowṅ V E measured throughout each protocol with the bi-directional turbine. The estimates of HR,V E , and breathing frequency from the smart shirt have been previously validated [43]. Data recorded by the metabolic system and smart shirt were timealigned by cross-correlating the two different HR signals. After processing, all data were interpolated to 1 Hz to ensure signal synchronization and a constant sampling rate. The final set of physiological features were: work rate (W),V E (L·min −1 ), breathing frequency (breath·min −1 ), HR (bpm), and heart rate reserve (%). Heart rate reserve at time t was calculated as (HR t − HR rest )/(HR max − HR rest ), where HR rest and HR max were determined during 5 min rested seated baseline and the peak HR during ramp-incremental test, respectively.

B. Network Architecturė
VO 2 kinetics have a systematic, albeit complex, temporal response to exercise [44]. Exponential models for quantify-ingVO 2 kinetics during controlled exercise protocols have demonstrated that the intensity of the exercise being performed strongly influences the dynamicVO 2 response [45], [46], [36]. Thus, as an alternative to conventional convolutional neural networks that use a symmetric kernel about the current time (or space) location, we developed a sequential convolutional model using causal convolutions [25]. Specifically, given a sequence of time series inputs x 1 , x 2 , . . . , x T ∈ R n extracted from wearable sensors and a cycle ergometer, the goal was to predict theVO 2 at time t, denotedVO 2,t . Specifically, given a prediction model M : X T → Y,VO 2,t is predicted using historical temporal features x i ∈ R n only up to the current time point: where w is the effective history. Our model incorporated a temporal feature extraction network and a regression network. The feature extraction network Fig. 6: Schematic showing the incremental ramp and three pseudorandom binary sequence (PRBS) cycling tests. PRBS protocols were designed using work rates at 90% ventilatory threshold (VT), VT, and the midpoint between VT andVO 2peak (∆50%) using participant-specific VT andVO 2peak determined from the ramp test.
was implemented as a temporal convolutional network (TCN) with a tunable receptive field through multiple sequential layers with kernel dilation for multi-scale aggregation of the input data [25], [47], [48]. The TCN was implemented as a sequence of stacked residual blocks comprised of repetitions of dilated causal 1D convolution, layer normalization, rectified linear activation, and dropout (see Fig. 1). The input into the residual block x was added to the residual function F (x) through identity mapping (or 1x1 convolution in the first residual block when the number of channels did not match the residual function shape) to encourage learning of the residual modifications of the input data, which has been shown to improve the performance of deep networks [49]. Each residual block is composed of successive pairs of dilations, rather than applying the same dilation twice as in the original TCN description [25], for better control of the receptive field. In models with an odd number of dilations, the first three were grouped into a single residual block, inspired by ResNet, which learns a function over two or three layers [49]. Thus, the (causal) receptive field was determined by the kernel size k and the number of exponential dilations N : A splice function was used to extract the features at the last known time point t for input into the regression network. This network was defined simply as a fully connected dense layer and linear activation to predictVO 2,t .

C. Training and Hyperparameter Optimization
The data across all 22 participants were split into train (40 protocols), test (20 protocols), and validation (19 protocols) datasets. Each participant's whole data were only comprised within one of train, test, or validation, with no participant data spread to encourage generalizability. Data of specific sequence lengths (according to the receptive field) were extracted from whole exercise protocols through a sliding window method, where the first sequence of sequence length T was extracted from the onset of exercise to time T , and subsequent sequences were extracted until the end of the exercise was reached. This yielded a dataset of size R N ×T ×F , where N is the number of sequences, T is the sequence length, and F is the number of feature signals. The sequence length was determined by the network's receptive field. Each feature, except WR, was standardized to zero mean and unit variance according to the training data statistics. WR was normalized to [0, 1] due to its non-normal distribution.
The hyperparameters of importance are the number of convolutional filters and the receptive field, which is defined by kernel size, dilation rate and network depth (Eq. (2)). This has practical implication in that having a very long receptive field results in a delayed initial prediction (or "cold start") of that length of time. Hyperparameter search (number of filters, kernel size, dilation depth) was performed on the hold-out validation set. Table I lists the hyperparameter values that were searched. A grid search was performed on a distributed computing cluster using 108 cores and 36 NVIDIA T4 Turing GPUs across 9 compute nodes.
The network was trained using the Adam optimizer, 0.2 dropout rate, 32 minibatch size, and a learning rate of 0.0005 over 100 epochs. For each hyperparameter combination, the epoch with the lowest validation loss was saved. The network hyperparameters producing the lowest mean squared error in the validation set was chosen as the final network architecture.

D. Data Analysis
Signals were analyzed in MATLAB (2020b, MathWorks, Portola Valley, CA, USA). Statistical analyses were conducted in R (version 3.5.1). The agreement between the predicted and directly measuredVO 2 were assessed using repeated measures Bland-Altman analysis, which accounts for the within-participant variance of the repeated measures data [50].VO 2peak agreement was assessed using standard Bland-Altman.