Accuracy of heart rate variability estimated with reflective wrist-PPG in elderly vascular patients

Optical heart rate monitoring (OHR) with reflective wrist photoplethysmography is a technique mainly used in the wellness application domain for monitoring heart rate levels during exercise. In the absence of motion, OHR technique is also able to estimate individual beat-to-beat intervals relatively well and can therefore also be used, for example, in monitoring of cardiac arrhythmias, stress, or sleep quality through heart rate variability (HRV) analysis. HRV analysis has also potential in monitoring the recovery of patients, e.g. after a medical intervention. However, in order to detect subtle changes, the calculated HRV parameters should be sufficiently accurate and very few studies exist that asses the accuracy of OHR derived HRV in non-healthy subjects. In this paper, we present a method to estimate beat-to-beat-intervals (BBIs) from reflective wrist PPG signal and evaluated the accuracy of the proposed method in estimating BBIs in a cross-sectional study with 29 hospitalized patients (mean age 70.6 years) in 24-h recordings performed after peripheral vascular surgery or endovascular interventions. Finally, we evaluate the accuracy of more than 30 commonly used HRV parameters and find that the accuracy of certain metrics, for example SDNN and triangular index, shown in the literature to be associated with the deterioration of the status of the patients during recovery from surgical intervention, could be adequate for patient monitoring. On the other hand, the parameters more affected by the high-frequency content of the HRV and especially the LF/HF-ratio should be used with caution.

Unobtrusive continuous monitoring and automatic analysis of physiological variables is an emerging area that has the potential to improve the effectiveness of healthcare delivery by providing early indications in the changes of the patients' status, whether being treated in a hospital or staying at home. However, in order to be usable in practice, the data used by the automatic analysis algorithms needs to be reliable and accurate.
Reflective photoplethysmography (PPG) measured with a wrist-worn device, also called optical heart rate (OHR) monitoring, is a technique traditionally used mainly in the wellness application domain for monitoring heart rate level during exercise. In the absence of motion, the OHR technique is also able to estimate individual beat-to-beat intervals (BBIs) relatively accurately and has therefore recently emerged as an unobtrusive method for detecting cardiac arrhythmias [1][2][3] . Besides arrhythmias, the performance of wrist-worn OHR monitoring has also been studied, for example, in the assessment of psychological stress 4 and in sleep staging through heart rate variability (HRV) and movement analysis in healthy subjects 5,6 . Studies evaluating the performance in beat-tobeat heart rate monitoring and accuracy of HRV parameters have usually been performed in controlled situations during selected activities or at rest as in 7,8 or as reviewed in 9 . Further, studies evaluating the applicability of wrist-worn OHR technology in estimating HRV in hospitalised patients have been scantly reported.
A recent study reported poor performance of HRV estimation with commercial wrist OHR device in uncontrolled conditions 10 highlighting the need for improvements in the measurement technology or signal analysis methods. The most significant limitation of OHR technology is its high sensitivity to movement artefacts, which poses challenges for the signal processing algorithms to choose only those heartbeats or heartbeat intervals that are not affected by movements. In addition, factors such as poor superficial blood perfusion and skin color affect the quality of the obtained signal and consequently, the accuracy of beat-to-beat heart rate 11 www.nature.com/scientificreports/ The predictive value of HRV parameters measured through gold standard electrocardiography (ECG) in identifying patients at risk of post-surgical complications has been studied in various patient groups such as in hip fracture 13 , digestive surgery 14 , abdominal aortic surgery 15 , and cardiac surgery patients. In cardiac patients, a popular topic has also been the evaluation of the relation of long-term mortality and HRV after myocardial infarction. A number of studies focusing on cardiac patients and HRV have been aggregated in the reviews by Nenna et al. 16 and Huikuri and Stein 17 . HRV analysis has also been proposed for early detection of infections related to communicable diseases 18 , development of septic shock 19 , and also for several other purposes related to anesthesia and intensive care 20 .
Large number of HRV parameters have been found in the aforementioned studies to indicate a high risk of perioperative complications and post-surgery mortality. Short descriptions of these parameters and their abbreviations are found in Table 1. Ernst et al. found significant association with decreased RMSSD and total power of pre-operatively measured HRV and increased probability of post-operative complications as well as decreased VLF power and HF/LF-ratio and post-operative infections in hip fracture patients 13 . They, however, did not find association between pre-operative SDNN and post-operative complications. On the other hand, SDNN and HRV triangular index measured on post-operative day 1 were found to be statistically significantly lower in digestive surgery patients developing post-operative complications 14 . Both Nenna et al. as well as Huikuri and Stein identified scaling exponent α 1 of detrended fluctuation analysis as a non-linear HRV parameter with high prognostic value in predicting long-term cardiac mortality 16,17 . This parameter was also found to be the best predictor of complicated recovery after coronary artery bypass grafting 21 . In the studies that have also evaluated the heart rate level as a potential indicator, increased post-operative heart rate has been found to predict or be associated with post-operative complications.
In all studies presented above, HRV analysis for monitoring of post-surgery patients and detecting complications has been performed using ECG. However, in order to be suitable for continuous monitoring for the duration of several days, the monitoring method should be as unobtrusive as possible. ECG electrodes are usually attached to the body with adhesives and may cause medical adhesive related skin injuries 22,23 , especially if worn for long periods of time, and are thus not a suitable approach.
The OHR technology would provide a convenient and unobtrusive solution for the task. However, typically the absolute amount of heart rate variation is decreased in aged and fragile patients 24 . As these patients exhibit the highest risk of developing complications, even better accuracy is needed from the measurement method in order to obtain adequate relative accuracy for the HRV parameters of interest. Inaccuracies in estimated heartbeat intervals also affect different HRV parameters differently 25 . Therefore, it is an interesting question whether those HRV parameters that have earlier been found to predict the development of post-treatment complications, can be estimated with adequate accuracy with OHR technology. We hypothesize that not all commonly used HRV parameters can be estimated with the same accuracy. We further hypothesize that some parameters may exhibit error levels small enough to render them possible candidates for future studies.
We performed 24-h monitoring with 29 subjects who had undergone vascular surgery or endovascular treatment. Typically, these patients have several comorbidities such as diabetes, hypertension, dyslipidemia, coronary artery disease, cerebrovascular disease and may have reduced cardiac function. For all patients, longterm recordings (mean 17.72 h, range 4.64-22.96 h) including day-and nighttime were acquired. The accuracy of a large set of HRV parameters was evaluated based on 5-min segments of BBIs estimated from reflective wrist PPG and reference ECG data.

Methods
Heartbeat interval and data quality estimation algorithm. For BBI estimation, we used continuous local interval estimation (CLIE) algorithm 26 augmented with an iterative estimation approach (adaptive prior) 27,28 . The algorithm was originally developed for BBI estimation from ballistocardiography (BCG) data, which is often noisy and exhibits changes in morphology, rendering standard peak-detection strategies suboptimal. In a previous study, we have shown that the methodology is suitable for accurate, unbiased estimation of BBI intervals from clean (finger clip) PPG 29 : In short, intervals are estimated based on self-similarity of the underlying signal. For this, three estimators, namely short-term autocorrelation, maximum amplitude pairs, and mean absolute differences are fused. Estimation is performed iteratively, with the first iteration resulting in a prior signal that is used in the second pass. For details, the interested reader is referred to 28 .
As in the original approach, in addition to the estimated interval, a quality metric q is reported for each estimated interval that quantifies the level of self-similarity detected. In Fig. 1, three signal excerpts are shown: while the first row presents an excerpt of a signal with a mean quality of 0.2, the second row shows an excerpt with q = 0.3 , and the last with q = 0.4.
If a technology is used to detect slow changes in health status, it is often less critical to get continuous estimations, but more important that the accuracy of those is sufficient for the given task. Thus, a common approach is to exclude data based on a quality metric, thereby creating a tradeoff between the so-called (temporal) coverage and the error. In the particular case of HRV estimation, measurement protocols usually require the subject to be as calm as possible. As motion artifacts are the main source of error, it is particularly advantageous to exclude segments with low quality in such a scenario. In this work, four different quality metrics were used. First, we used a fixed threshold of q th = 0.3 to only accept intervals with q > q th in the subsequent analysis. In addition, for each 5-min window, the following rules were applied: www.nature.com/scientificreports/ • Only if the median quality q med of the window is above a threshold value q th,med , the window is accepted: q med > q th,med • Only if the quality inside the window exhibits small variability (i.e. a small relative standard deviation), the window is accepted: SD(q)/q < q th,var • Only if the ratio of accepted intervals inside the window N OK relative to the estimated mean interval B BI is large enough, the window is accepted: The threshold values can be chosen by the user based on priorities of the application i.e. large data coverage or accuracy (as will be shown in the results section, Figs. 4 and 5). The overall process is visualized in Fig. 2.  Overall flow of the BBI estimation approach. The BBI-time series is estimated using an iterative approach 28 . Only if three criteria (sufficient median quality q th,median , low variation in quality q th,var , ratio of intervals estimated R th,OK ) are met, the 5-min window is analyzed using the Kubios HRV analysis software.  30 and has beed used in more than 4500 scientific publications according to the manufacturer. In this study, the parameters presented in Table 1 were evaluated. In particular, the analysis includes several time-domain parameters (SDNN, SDHR, RMSSD), frequencydomain parameters associated with low-frequency components (LF Abs., LF Log.), frequency-domain parameters associated with high-frequency components (HF Abs., HF Log.), as well as relative (Poincaré SD 2 /SD 1 , LF/HF), statistical (NN50, pNN50), and nonlinear (ApEn, SampEn, PC SD 1,2 , DFA α 1,2 ) parameters. The 5-min analysis window was shifted in steps of 60 seconds and HRV parameter calculation repeated. HRV parameters estimated with Kubios from PPG-based BBI and corresponding ECG-based reference RR-interval windows were exported and comparisons were performed in MATLAB.
For each 5-min window of the data of each subject, a ground-truth HRV parameter and an estimated value exist. For evaluation, we performed per-subject analyses (Figs. 6 and 8) as well as combined gross analyses (all other figures and tables). In the per-subject analysis, error metrics are calculated for each subject individually. This implies that the relative error is calculated by normalization with the average of the ground truth of all windows of that subject. In the gross analysis, all data of all subjects are aggregated. Here, the relative metrics are based on the average ground truth values of all windows of all subjects. Thus, the gross analysis is biased Table 1. HRV parameters calculated using the "Kubios HRV Premium" software. www.nature.com/scientificreports/ towards subjects with more accepted datapoints. By comparing gross and per-subject analysis, information about inter-subject variability is obtained.

HRV parameter description Abbreviation
To provide information on the distribution of the error, several error metrics are used. For all metrics, the difference of estimation and ground truth ( = estimation − ground truth) of the respective HRV parameter forms the basis. The "Bias" is defined as the average of all , i.e. it indicates over-and under-estimation in absolute units. The relative bias "Bias (%)" is the bias normalized by the average of the ground truth. "P05" and "P95" are defined as the 5th and 95th percentile of and give an indication of the spread of the error in absolute units. Similarly, the standard deviation of , "SD" is a metric for the spread of the error assuming a Gaussian distribution. The mean absolute error "MAE" is defined as the mean of the absolute values of , while the root mean square error "RMSE" is determined by calculating the root of the mean value after squaring all individual . As the RMSE is more sensitive to outliers, the comparison of MAE and RMSE gives information about their presence. The relative quantities "MAE (%)" and "RMSE (%)" are calculated analogously to the relative bias, with the selection of the normalization depending on per-subject or gross analysis as described above. Finally, we analyze the "Relative Error", which is divided by the mean of the ground truth. While "Bias (%)" is equivalent to the mean of "Relative Error", we additionally provide median, 25th, and 75th percentile to obtain further information of the spread of the error in relative terms (Fig. 7).
Evaluation data. The patient recordings were performed at the vascular surgery ward of Tampere University Hospital between April and October 2018. Inclusion criteria for the study were at least 50 years of age and admission to peripheral arterial bypass operation, endarterectomy, aortic surgery, or carotid surgery. Patient with cardiac pacemaker were excluded from the study. The study was a descriptive pilot study for gaining initial knowledge about the performance and suitability of new sensor technology for patient monitoring and 30 successful patient recordings was determined as a suitable sample size. Altogether 36 patients were recruited for the study but seven subjects had to be discarded due to technical problems or due to short duration of the recording. In four cases the problem was with the reference device (poor quality ECG), in one case with the study device (recording had not started), and in two cases the patient was admitted to re-operation shortly after the beginning of the recording. Thus, the data of 29 postoperative vascular patients were included in the analysis. The subjects were monitored for approximately 24 h with a wrist-worn OHR prototype device manufactured by PulseOn Oy, Finland, Fig. 3. Long-term recordings were obtained to eliminate bias potentially arising from, for example, monitoring only sleeping subjects.
The PulseOn wrist device uses green color LEDs with peak wavelength of 573 nm and 25 Hz sampling rate. Before interval estimation, the PPG data were upsampled to 200 Hz using linear interpolation and bandpassfiltered using a 2nd-order Butterworth bandpass-filter with a passband of 0.5 to 15 Hz.
The reference ECG was recorded with a Faros 360 five-lead Holter monitor manufactured by Bittium Biosignals using 1 kHz sampling frequency. Ambu Blue sensor L-00-S electrodes were used for the ECG recording. The average age of the subjects was 70.6 years (SD: 8.5 years, range 50-87 years). Seven of the subjects were female. The subjects had undergone different vascular and endovascular procedures such as lower limb percutaneous transluminal angioplasty and/or stenting, abdominal aortic aneurysm endovascular repair, carotid or femoral artery endarterectomy, or femoropopliteal bypass surgery. Approval for this study was obtained from the Regional Ethical Committee of Pirkanmaa Hospital District (R17027). Informed consent was obtained from all subjects. The guidelines of the Declaration of Helsinki were followed in the study. The study was registered at ClinicalTrials.gov, identifier NCT03572751.
In this study, ECG and PPG were recorded with two independent, unconnected wearable devices. Devices that have independent system clocks and are not synchronized can exhibit drifts in sampling rate 31 . In longterm recordings, even small drifts can amount to large offsets, which would lead to the comparison of noncorresponding windows in this study. Thus, we adopted the same alignment process as in our previous work on BCG data 28 : The algorithm calculates a time-varying offset that minimizes the median BBI estimation error in a moving window with the size of 1500 beats, which corresponds to approximately 25 min. The offset-vector was additionally median filtered with the same filter size to remove outliers. Finally, an offset-vector is obtained that ensures that each window of the PPG data is compared to the matching ECG window. As expected, the coverage decreases monotonically with an increase in R th,OK , an increase in q th,median , and a decrease in q th,var (Fig. 4). The same general tendency holds for the estimation error. Fig. 5 shows the mean absolute error for SDNN as one example. Note that the distribution of the coverage is the same for all HRV parameters, whereas we observed varying dependencies on the three thresholding parameters for the different error metrics and the different HRV parameters (not shown). In the following, a target coverage of 30% was arbitrarily set. Out of the several combinations of thresholds that would result in this target coverage, the following set was chosen: • R th,OK = 171.22 • q th,median = 0.46 • q th,var = 2.11 This choice of thresholds is visualized with a red dot in Figs. 4 and 5. Note that while these parameters lead to an average coverage of 30 %, the coverage for individual subjects may vary. Further note that this choice is fixed for all following calculations and arbitrary in the sense that different combinations of thresholds would also lead to the same average coverage of 30% as can be inferred from Fig. 4. At the same time, it would lead to different error levels, as can be seen in Fig. 5. As a consequence, the arbitrarily chosen parameters mark the upper bounds of the errors of the individual HRV parameters. Table 2 shows the numeric comparison of 38 different HRV parameters calculated for all accepted 5-min windows and aggregated for all 29 (out of the original 36) patients included into the analysis (i.e. so-called "gross analysis", see Sect. 2.2). In Fig. 6, per-subject results are presented as boxplots, where each datapoint presents results for one individual subject. Note that the graph is clipped at 100 % for better readability.
Several observations can be made from the results. For one, the (relative) RMSE tends to be significantly higher than the (relative) MAE for many of the parameters, indicating the tendency for large outliers in the estimation results. Moreover, estimation quality varies greatly from patient to patient and strongly depends on the parameter. While mean, maximum, and minimum heart rate can be estimated with an average relative error below 1.2 %, other parameters show inferior results. In particular, the patient-wise relative error in NN50 and pNN50 has a 75th percentile at about 280 % (not shown in Fig. 6 due to clipping). For one, the nature of this parameter makes it relatively susceptible to outliers. For another, the population in this study group has extremely low NN50/pNN50 values close to zero. As can be seen in Table 2, the mean absolute error in pNN50 is only 1.55 percentage points (which would obviously still result in an infinite error for patients with a true pNN50 of 0 %).  On the other hand, the components RMSSD and Poincaré SD1 exhibit a systematic overestimation in the range of 10 % with an interquartile range of approximately ±10 % . This tendency is largely a result of arbitrary/sporadic large beat-to-beat interval estimation errors, which tend to affect parameters containing differentiations in a more severe way. Note that the same systematic overestimation can also be seen in the frequency-domain parameters associated with high-frequency components.
In the frequency domain, both absolute as well as relative power are estimated with large biases and spreads. If comparison is performed in the logarithmic domain, however, biases/spreads are small for VLF, LF, and HF components ( −4 ± 5 % , 0 ± 3 % , 3 ± 7 % ). As a general tendency, we observe that high-frequency associated parameters HF Abs. and HF Log. as well as PC SD 1 and RMSSD are systematically overestimated. While these biases may be small in absolute numbers ( 4.86 ms 2 and 0.27 log, 1.71 ms and 2.42 ms, respectively), they do lead to a comparatively large relative bias in the range of 10 % for our patient group. Consequently, parameters that analyze the ratio of LF to HF components (LF/HF, PC SD 2 /SD 1 ) show severe systematic underestimation.
Note that the per-subject analysis (Fig. 6) calculates the relative error as "average of the error of subject n divided by average of the ground truth of subject n". The aggregated gross analysis, on the other hand, calculates "average of the error of all windows of all subjects divided by average of the ground truth of all windows of all subjects" (Table 2). Interesting observation can be made comparing the two: First, strong inter-individual differences can be observed in Fig. 6 and the optical measurement clearly seems to be more suitable for certain individuals than for others. For the "best" 25 % of the subjects, the relative MAE of most of the HRV parameters is less than 10 %. Second, as the relative per-subject errors far exceed those of the gross analysis in Table 2, we can assume that estimation errors in subjects with very low HRV parameter values have a strong negative impact on the relative accuracy seen in Fig. 6.
These assumptions can be supported further by Fig. 8. Here, each 'x' marks the mean over all windows where an estimation from PPG was available ('Estimation' , right) and the mean over all corresponding reference windows obtained from ECG ('Ground Truth' , left).
The graph shows data from 28 patients, as the coverage for one patient was zero (see also Fig. 6, rightmost column). Indeed, several patients exhibit a pNN50 of (close to) zero. Nevertheless, five patients show pNN50-values way above 10 % that can clearly be distinguished also in the PPG-based estimation. The same tendency holds for SDNN and LF Log. and, in parts, for RMSSD and HF Log., although for these parameters the aforementioned over-estimation of small values becomes obvious. Finally, neither LF/HF nor PC SD 2 /SD 1 can be estimated with confidence due to strong and, more importantly, varying under-estimations as indicated by several crossing lines.
As can be seen in our HRV parameter estimation accuracy evaluation, from the aforementioned parameters, SDNN was estimated on average with 9 % and triangular index with 12 % relative MAE from the wrist device Based on the results obtained in the present study, it is uncertain whether or not these two groups could be separated using OHR: For DFA α 1 , we obtained an MAE of 0.150 and an RMSE of 0.204 across all data, which lies in the range of the differences between the two groups. On the other hand, Ushiyama et al. found mean ± standard deviation of SDNN for the "complicated" group being 48.7 ± 24.4 ms , whereas 71.2 ± 19.6 ms was found for the Table 2. Accuracy of the estimated HRV parameters aggregated over all analyzed 5-min windows of all subjects. The same argument can be made for the Triangular Index ( 13.3 ± 6.7 "complicated" group, 19.9 ± 6.5 "uncomplicated" group 14 ), for which we found an MAE of 0.659 and an RMSE of 0.866. In 19 , the development of septic shock was most clearly predicted by RMSSD but also by absolute and normalized LF power as well as HF power and LF/HF-ratio. However, median values of (as well as differences between) both groups were extremely low: For example, comparing "No septic shock" with "Septic shock", the median values of RMSSD for both groups were 3.8 ms and 7.3 ms, respectively 19 . It thus remains questionable if the accuracy obtained in the present study with OHR (3.55 ms MAE and 4.55 ms RMSE for RMSSD) would suffice for this application scenario. Approximate entropy has also been found to predict the onset of atrial fibrillation (AFib) after coronary artery bypass grafting 32 . The "control patients" exhibited mean ± standard deviation values   Although the presented study is, to the best of our knowledge, the most comprehensive one in terms of assessing the accuracy of HRV estimations obtained via OHR in hospitalized patients, all-encompassing statements about the most reliable parameters may not be possible. Nevertheless, we believe our findings in terms of parameter recommendations can be summarized as follows: • The time-domain parameter SDNN can be estimated with a relative error/absolute error/relative bias of 9%/2 ms/2% and we thus recommend its use. We also recommend the Triangular Index with 12%/0.66/1%. The parameters RMSSD and pNN50 exhibit large relative errors in this cohort (17% and even 28%) and systematic biases in the range of 12%, but also low absolute estimation errors (4 ms and 2%). The visualization in Fig. 8 further suggests that patients with low and high RMSSD/pNN50 might very well be separated using OHR technology. Thus, we believe these parameters should be investigated further. • In the frequency domain, the relative errors for LF Abs. and HF Abs. were 16% and 19%, while the corresponding values for LF Log. and HF Log. were only 5% and 9%. We thus recommend the use of the logarithmic quantities, although one needs to analyze whether they possess the same discriminative power as the absolute ones. Again, Fig. 8 suggests that separation of patients would be possible. • In their current implementation, the relative quantities (LF/HF, PC SD 2 /SD 1 ) show strong relative errors (36% and 19%) and biases (− 30% and − 18%) and cannot be recommended without improving their estimation.
Finally, the patient cohort consisted predominantly of white caucasian subjects. Thus, the generalisability of results has to be validated with a more diverse group. Although we do not expect a fundamentally different outcome in terms of parameter feasibility, overall estimation errors may be higher in subjects with darker skin tones.

Conclusion
In conclusion, the accuracy of HRV parameters estimated from the PPG signal of wrist-worn OHR monitoring device varies significantly between parameters and subjects. The accuracy of certain parameters, for example SDNN and triangular index, shown in the literature to be associated with the deterioration of the status of the patients during recovery from surgical intervention, could be adequate for patient monitoring. On the other hand, the parameters more affected by the high-frequency content of the HRV and especially the LF/HF-ratio should be used with caution. It should also be emphasized that the proposed data analysis method tries to discard such segments of PPG signal that likely produce less reliable beat-to-beat interval data resulting the HRV parameters being obtained only for approximately 30% of the time. This may limit the usability of the approach in some applications. To further improve the applicability of wrist-worn OHR monitoring in patient surveillance through HRV, more robust methods for beat-to-beat interval estimation and especially methods for mitigating the effect of estimation uncertainty on the HRV parameter values, e.g. robust spectral estimation techniques, should be investigated. Each 'x' marks the mean over all windows where an estimation from PPG was available (Estimation) and the mean over all corresponding reference windows obtained from ECG (Ground Truth).