Generalisable machine learning models trained on heart rate variability data to predict mental fatigue

A prolonged period of cognitive performance often leads to mental fatigue, a psychobiological state that increases the risk of injury and accidents. Previous studies have trained machine learning algorithms on Heart Rate Variability (HRV) data to detect fatigue in order to prevent its consequences. However, the results of these studies cannot be generalised because of various methodological issues including the use of only one type of cognitive task to induce fatigue which makes any predictions task-specific. In this study, we combined the datasets of three experiments each of which applied different cognitive tasks for fatigue induction and trained algorithms that detect fatigue and predict its severity. We also tested different time window lengths and compared algorithms trained on resting and task related data. We found that classification performance was best when the support vector classifier was trained on task related HRV calculated for a 5-min time window (AUC = 0.843, accuracy = 0.761). For the prediction of fatigue severity, CatBoost regression showed the best performance when trained on 3-min HRV data and self-reported measures (R2 = 0.248, RMSE = 17.058). These results indicate that both the detection and prediction of fatigue based on HRV are effective when machine learning models are trained on heterogeneous, multi-task datasets.


Cross-modal switching task (task switching experiment)
A modified version of the cross-modal switching task developed by Lukas et al. (2014) was used (Matuz et al, 2019. To indicate the relevant modality of the stimulus, each trial started with a visual cue (a white cross, 1.5 cm x 1.5 cm, visual angle of 1.25°) or an auditory cue (600 Hz tone, 45dB). The cues were presented for 200 ms. The number of consecutive trials repeating the same cue modality (repetition trials) varied between 2 and 5. After the cue, visual and auditory stimuli were presented simultaneously with a presentation time of either 100 ms (short) or 300 ms (long). Trials could be congruent if the duration of the auditory and visual stimuli were the same, or incongruent if the duration of the two stimuli differed. The visual stimulus was a white diamond (1.5 cm x 1.5 cm, visual angle of 1.25°) presented centrally, and the auditory stimulus was a 400 Hz tone (45dB). By key press on a response pad, participants were asked to indicate the duration of the cued stimulus (short or long). The stimulus-response mapping was counterbalanced across participants. A trial was terminated when a response was given or after 2500ms. The response-cue interval was constantly 1500ms. We emphasized the equal importance of speed and accuracy to the participants.

Gatekeeper task (2-back experiment)
A modified version of the Gatekeeper task developed by Heathcote et al. (2014Heathcote et al. ( , 2015 was used (Matuz et al, 2021). Gatekeeper task is a dual 2-back task with visual and auditory stimuli, and it has a game-like character given by the task instructions. We instructed participants that "they were in a training to become a nightclub doorperson, and that their task was to allow in only cool patrons. A patron tries to gain access through one of the three doors, as indicated by the door being highlighted, and by saying one of the three password letters" (see Heathcote et al., 2015, pp. 976).
On each trial, the visual (i.e. door images) and auditory stimuli (i.e. spoken letters) were presented simultaneously. For the visual stimulus, an image of 3 doors (5.58o x 7.65o visual angle) was shown in the center of the screen; one of the doors was always highlighted by light red color. For the auditory stimulus, one of three vowel letters was spoken by regular speakers (A, E, I; phonetic symbols: ɒ, ε, i:). Four different stimulus conditions were prepared: Dual target, Single visual target, Single auditory target, and No target. For the Dual target condition, both the visual and auditory stimuli matched with the stimuli shown two trials earlier (2-back match). For the two single target conditions, the 2-back match occurred in one of the stimulus modalities only: either for the auditory stimulus in the Single auditory target condition) or for the visual stimulus in Single visual target condition. For the No target condition, neither of the stimuli had a two-back match. The 50% of the trials were target trials (i.e. Dual target, Single auditory target, and Single visual target trials). A trial terminated by the response or by the lapse of 2.5s without response, and a new trial began after a 2.5s interval after response.
Participants were needed to indicate by key press on a response pad whether they block (in case of a two-back match in any stimulus modality) or allow the entrance of the patron (no two-back match in the stimulus modalities). The order of the keys was counterbalanced across participants. It was emphasized that both speed and accuracy are equally important.

Semantic Stroop test (Stroop experiment)
In the semantic Stroop test, two modality conditions were introduced. In the auditory condition, participants had to attend the auditory stimulus and ignore the visual stimulus, while in the visual condition, they had to attend the visual stimulus and ignore the auditory one. The modality condition changed after every 12 consecutive trials in an alternating fashion (i.e. after 12 trials the participants always had to attend to other modality). Visual cues accompanied by an auditory warning signal indicated the modality that had to be attended in the next 12 trials. The visual cue was either the word "Auditory" or "Visual" presented in the center of the screen for 1000 ms. The auditory warning signal was a 800 Hz tone presented for 100 ms with an intensity of approx. 45db. The auditory and the visual stimuli were spoken and written names of animals (birds and mammals), respectively, presented for 700 ms. The two stimuli were presented simultaneously. Participants were asked to judge whether the attended written or spoken name of the animal presented in the actual trial referred to a bird or a mammal. Participants responded in a time window of 1500 ms by pressing one key on the response box for birds or another key for mammals. The intertrial-interval varied between 500 and 3000 ms.

Sleep duration measurement in the experiments
In each experiment, participants were asked to have a decent sleep during the night prior to the experiment. Sleep duration was measured by self-report and by an actigraph (except for the Stroop experiment). The mean duration of sleep prior to the experiments was 7.7 hours (SD = 1.56) for the self-reports, while it was 7.79 hours (SD = 1.5) based on the actigraph measurement. Thus, the participants were well-rested before the fatigue-induction.

Feature selection for classification models
Feature selection was performed on the training set in three steps. First, the importance of each variable was computed by random forest classifier (number of estimators = 200).
Second, for highly correlated features (i.e. a Pearson's r-value greater than .7), the one with the lower importance was removed. Third and finally, recursive feature elimination with 5fold cross-validation (5-CV) was applied to select the best set of features. Importantly, this feature selection procedure was performed separately for each classification problem (i.e. training on task-related vs. resting HRV data) and each time window (i.e. 1-5 minutes).

Hyperparamters tuned for classification models
The hyperparameter space of the support vector machine algorithm consisted of linear and radial basis functions for kernel, the set {10 0 , 10 1 , 10 2 } for C and the set {10 0 , 10 -1 , 10 -2 } for γ.
For the k-nearest neighbors algorithm, k values from 1 to 20 were examined to identify the most optimal one. Finally, for random forest, the optimized parameters were maximum depth (ranging from 3 to 6) and the number of estimators (10, 50, 100 and 200).

Procedure for permutation tests
To conduct permutation tests for the classification models, the procedure described in Boeke et al. (2020) was followed. On each iteration, a model was trained on the training dataset with shuffled class labels (i.e. predictors and class labels were mismatched) and an AUC score was calculated based on the performance of the model on the (unshuffled) testing dataset. We thus generated the null-distribution of AUC scores, and a p-value was obtained by calculating the ratio of the number of cases that resulted in higher AUC scores compared to the actual model and the number of iterations.
For regression, we followed the same procedure described above. On each iteration, the model was trained on the shuffled training dataset (i.e. where the predictors and the outcome variable did not match) and the level of subjective fatigue was predicted in the (unshuffled) testing set. From the observed R 2 values, we generated the null-distribution of R 2 values and a p-value was obtained by calculating the ratio of the number of cases that resulted in R 2 values higher than the actual R 2 and the number of iterations.