Introduction

The COVID-19 pandemic has highlighted the acute need for technology to support remote health care1,2. Consultancy McKinsey3 reported a 40-fold increase in the use of telehealth services and a 40% increase in consumer interest in virtual health solutions when compared to pre-COVID-19 statistics. To provide an example, the ability to estimate vital signs from sensors available in smartphones and wearable devices could have a significant impact on the effective management of diseases (e.g., COVID-19, hypertension, diabetes). Frequent measurement of physiological parameters can help in managing medication dosages and understanding the effects of lifestyle changes on health.

The estimation of vital signs traditionally relies on customized sensors that measure physical or chemical properties of the body. For example, digital sphygmomanometers use sensors to measure the oscillations in the arteries to quantify blood pressure. Although accurate, such medical devices are far from ubiquitous, often are not easy to access and are uncomfortable to use for extended periods of time. An alternative approach, promoted by the field of ubiquitous computing is to leverage sensors already present in every day devices for estimating health parameters. For example, heart rate can be measured using a smartphone camera by analyzing subtle changes in skin color as the heart pumps blood around the body4,5. This technology is now available on billions of devices, Google Fit (https://www.google.com/fit/). Recent work has presented proof-of-concept measurement of oxygen saturation6, blood pressure7, and hemoglobin levels8 via smartphones.

Existing research work can be broadly divided into two categories: (1) approaches that are developed from first principles to imitate an established medical method for measurement or diagnosis9,10, and (2) approaches where input (sensor) data and corresponding gold-standard data are collected using a medical grade device and machine learning models are trained to discover a relationship between the input and output11,12. In this paper, we focus on the latter category. Although well-intentioned, such data-driven approaches ignore a principled analysis of whether the input data have the necessary information to predict the desired health measure. As a result, numerous human and compute hours are wasted in developing and training deep learning models for prediction tasks that may be ill-posed or not feasible.

We consider the task of predicting blood pressure (BP) non-invasively. Blood Pressure is the pressure applied on arterial walls as the blood circulates through the body. It depends on multiple factors, including blood volume, blood viscosity, and stiffness of blood vessels. Abnormally high or low blood pressure can result in heart attack, stroke, and diabetes13,14 thus it is recommended to measure BP frequently.

The methods to measure blood pressure non-invasively can be broadly categorized into two approaches: (i) The pulse transit time (PTT) method15,16,17 is a popular, non-invasive technique for measuring blood pressure based on the time delay for a pressure wave to travel between proximal and distal arterial sites. The PTT approach has strong theoretical underpinnings based on the Bramwell-Hill equation18, which relates PTT to pulse wave velocity and arterial compliance. The Wesseling model captures the relationship between arterial compliance and blood pressure19. However, it is important to note that, PTT can change independently of BP due to factors such as aging-induced arteriosclerosis, and smooth muscle contraction. Hence, it needs to be calibrated from time to time. (ii) Pulse Wave Analysis (PWA) is a method used to estimate blood pressure (BP) by extracting features from an arterial waveform. This is typically performed using a photoplethysmography (PPG) waveform. PPG is an optical signal obtained by illuminating the skin (common sites are the finger, earlobe, or toe20) with an LED and measuring the amount of transmitted, or reflected, light using a photodiode. PPG detects blood volume changes in the microvascular bed of tissue, as the blood volume directly impacts the amount of light transmitted/reflected. Unlike PTT, PWA has weaker theoretical underpinnings as the small arteries interrogated by PPG are viscoelastic15. Calibration is invariably necessary for PWA analysis methods to obtain reasonable results.

In this study, we concentrate on PWA measurement of BP. This method is beneficial because it only requires the use of a single sensor making it a more accessible solution. Predicting BP by analyzing PPG waveforms is an active area of research7,21,22,23,24,25 and is already used in consumer products (https://www.samsung.com/global/galaxy/what-is/blood-pressure/). However, we should note that “while these methods (PTT and PWA) have been extensively studied and cuff-calibrated devices are now on the market, there is no compelling proof in the public domain indicating that they can accurately track intra-individual BP changes20,26. Therefore, although the features extracted from the PPG signal correlate with blood pressure, the signal’s adequacy for accurately predicting blood pressure remains unclear.

The discrepancy between recent research27,28,29 claiming promising results on evaluation benchmarks for blood pressure, and other observational studies20,26 which indicate a lack of a concrete theory to measure blood pressure using PPG signals via PWA, raises important questions. To help resolve this apparant contradiction, we conduct a comprehensive examination of the existing PWA techniques in the literature (Table 1). Our analysis reveals that a significant portion of the prior papers contain one or more of four common pitfalls: (a) Data Leakage: where data samples from the same patient are present in both the train and test sets, (b) Overconstraining: where data far from normal range is discarded as outliers, which statistically simplifies the task, (c) Unreasonable Calibration: where the calibration method is not tested over longer (e.g.,> 1 day) time scales, and (d) Unrealistic Preprocessing: which filters a significant portion of the dataset terming it as noisy. We analyze these pitfalls in detail in our results section.

Our analysis reveal a somewhat surprising lack of improvement (modulo the pitfalls above) in PPG-based blood pressure prediction. This is in contrast to the substantive improvements in non-invasive prediction of other vitals such as heart-rate during this time. This raises the question as to whether there is a limit/ceiling on the prediction accuracy. In order to answer this, we propose tools to examine whether an input sensor signal (x) (e.g., PPG) can be a good predictor of the output health label (y) (e.g., BP). For this, we want to evaluate whether an underlying function f exists, which captures the relationship between x and y, such that \(y=f(x)\). We also want to measure the conditioning of this underlying function, and check whether it is well-conditioned or not? That is, whether small changes in x lead to small or large changes in y. It is important to ensure that (minor) noise in the sensor measurement (which is inevitable in a real-world setting) does not lead to significant error in the outputs. Our tool is based on information-theoretic notions of mutual information and multi-valued mappings. Using our proposed tool, we find that BP prediction using PPG has a high multi-valued mapping factor of 33.2% and low mutual information of 9.8%. In comparison, heart rate prediction using PPG, a well-established task, has a very low multi-valued mapping factor of 0.75% and high mutual information of 87.7%. This confirms that estimating BP from PPG is a challenging and an ill-conditioned problem and a more principled approach is needed in the future for framing such health measure prediction tasks.

Figure 1
figure 1

When designing end-to-end machine learning models researchers often use techniques such as: (A) providing the model with observations from similar patients, (B) constraining the task (e.g., limiting the distribution of labels), (C) calibrating models using data from a participant. When doing so it can often be difficult to identify how these steps impact the integrity of a model, or (D) preprocessing to filter out problematic samples (e.g., noisy inputs).

Table 1 The table summarizes the limitations of previous research and indicates whether the study exhibits specific pitfalls. The pitfalls are categorized into four categories: a) Data-split: Domain Overlap (denoted as D.O), Data Overlap (denoted as C.O), or Small test set (denoted as S.T). b) Over-constraining: SBP values and standard deviation (if provided) c) Unrealistic Pre-Processing: % of remaining dataset after pre-processing (if provided) d) Calibration: Correctly employed and justified for longer periods. The columns denote the presence or absence of each limitation, with Y (yes) indicating that the study has the limitation, N (no) indicating that it does not have that limitation, U (unknown) indicating that there is not enough information available, and “-” indicating that the research is not applicable to the pitfall. For additional information, please see Section Review of the results and limitations of prior work.

Results

In this section, we present a systematic review of prior work predicting BP via PPG PWA (Figure 1), followed by a principled analysis using our proposed tools.

Review of the results and limitations of prior work

To motivate our work, we analyzed recent research21,22,23,27,29,34,48,49,50,51 that reported results predicting BP via PPG PWA (see Table 1). These works relied on the MIMIC52 dataset (Appendix C.1) containing continuous PPG signals and the corresponding arterial BP values. They evaluated their performance against the AAMI53 and/or BHS54 standards (Appendix C.2). We found that they were prey to some common pitfalls, which resulted in misleading claims and over-optimistic results. For simplicity, we focus on the prediction of Systolic BP (SBP) rather than Diastolic BP (DBP), as SBP has a wider statistical range.

Before we begin, we should note that not all work (e.g.,35,36,37,38,50) followed the AAMI/BHS standards accurately. For example, some reported results on a test-set of fewer than 85 subjects. Moreover, although these works use the same MIMIC dataset, we found a lack of standardization in the train-test data splits and different BP ranges used for evaluation (due to differences in how the data were filtered) across the literature27,29. With the absence of official source code, it was difficult to reproduce prior results and compare different methods. Hence, we trained our own reference deep learning model (Figure 2), similar to the methods presented in prior research27,34,49. The reference network takes a three-channel input consisting of the original PPG waveform, along with its first and second derivatives, and outputs the predicted SBP value. The model consists of an eight-layer residual CNN55 with 1D convolutions, and is trained using a mean squared error loss. We also explored 2D convolution based CNN models, such as DenseNet-16128 and ResNet-10155, taking spectrogram of the 1D PPG signal27 and/or raw waveform as input. Among these, we found that the 1D CNN based architecture performed best.

Figure 2
figure 2

Our reference network, is used to evaluate the impact on performance due to the issues mentioned in Section "Review of the Results and Limitations of Prior Work". The network has 28M trainable parameters, takes a 3-channel input (PPG, VPG, APG), and outputs the SBP prediction. The model is optimized using a mean squared error loss.

Figure 3
figure 3

Every participant (P) has multiple data records (R), and each record is divided into multiple overlapping windows (W). Each window forms a data sample. In No-Overlap, the train and test data are split at the participant level, while in Domain-Overlap, the split happens at the record level, and in Data-Overlap, the split happens at the window level.

Data leakage

The goal of any machine learning model is to generalize well to test data that will be seen in real-world settings56. Even with a large training set, it is very unlikely that identical samples to those seen in the training set will appear at test time, thus generalization is crucial. Unfortunately, good performance on a training dataset does not always translate to good performance on a test set, as models can overfit. This is especially true for modern deep neural networks, which are highly over-parameterized and can easily memorize the training data57. Thus, evaluating test performance accurately is an important step in understanding how a model will function in the real world. For this, the test data needs to be pristine, i.e., without any contamination from the training data. Unfortunately, contamination can and does happen in several ways.

We observed two types of overlap between training and testing splits (Figure 1A): data-overlap and domain-overlap.

Data-overlap corresponds to overlap of actual segments from a sample between the train-test sets. Domain-overlap is more subtle, where although there is no direct overlap of samples, leakage may occur due to similarities in train-test data. In our case, it corresponds to using different records from the same patient in both the test and train sets (Figure 3).

Here, we consider a particular example from the literature, PPG2ABP21, where the authors propose a U-Net based architecture to predict the ABP (Arterial BP) waveform from PPG. They obtain impressive results with a bias of \(-1.19\) mmHg and error standard deviation (SD) of 8.01 mmHg (Note, there is an error in the computation of standard deviation in the PPG2ABP21 evaluation script. We report the corrected results here.) on the SBP prediction task (Table 2), which is close to the AAMI standard. However, while analyzing their source code, we found both data and domain overlaps.

Data-Overlap: The PPG2ABP21 data processing pipeline divides each PPG record (\(\sim\)6 mins long) into 10-second windows with an overlap of 5 seconds (URL: github.com/nibtehaz/PPG2ABP/blob/master/codes/data_processing.py) (Figure 3). Using overlapping windows helps, as it increases the size of the training data. However, the problem arises when these 10-second samples are randomly split into train and test sets. Since the overlapping windows are generated before the random train-test split, the train and test sets can have samples with the same overlapping regions (Figure 3). A deep learning model can memorize values based on these overlapping portions, leading to artificially high accuracy on the test set.

Domain-Overlap: Due to the physiological differences between individuals, person-dependent models often outperform person-independent models58. For example, for the BP prediction task, a model can learn the normal range of an individual’s BP and leverage that to provide more accurate predictions. Since the knowledge of an individual’s identity can impact a model’s accuracy, it is important that the identity of the subject is not leaked (even implicitly) between test and train sets, especially while building person-independent models. Since the PPG signature has been shown to identify an individual59, the presence of PPG signals from the same individual in both train and test data can thus leak identity. This turns out to be the case in the PPG2ABP work21, as they randomly split PPG records into test and train sets, resulting in different windows from the same patient present in both test and train sets (Figure 3).

To quantitatively evaluate the impact of data leakage, we compare the performance of the PPG2ABP network on three splits (Figure 3) – (1) No-overlap: the dataset is partitioned at the patient level with an 80-20% train-test split, (2) Domain-Overlap: each patient has multiple records (\(\sim\)6 mins long), and these records are randomly split 80–20% between the train-test set, i.e., records from the same patient can be present in both the training and test sets, and (3) Data-Overlap: We use the split provided by PPG2ABP21 which divides the records into overlapping windows followed by an 80-20% train-test split. All splits consist of 10-second windows with an overlap of 5-seconds to maintain consistency with the split proposed in PPG2ABP. Table 2 shows the performance of the PPG2ABP network over the three splits. Domain-overlap significantly increases the accuracy of the PPG2ABP network from a standard deviation of 23.1 to 16.2 mmHg; Data-Overlap further improves the standard deviation to 8.01 mmHg. This analysis clearly shows that leakages, however subtle, can lead to seemingly high but artificial improvements. Note that for all analysis in the rest of this paper, we use the No-Overlap split.

Table 2 Performance of PPG2ABP21 on different test-train splits with varying degrees of dataset overlap. Even subtle leakages can result in large (but artificial) accuracy improvements.

Overconstraining the task

Health-related data typically have non-uniform Gaussian distributions, with the highest data density near the “normal” (or healthy) range, and falling exponentially as we move away from the normal. We observe a similar trend for BP data in both the Aurora-BP60 (Appendix C.1) and MIMIC datasets (see Figure 4). While points far from normal are rare, they are often crucial events (abnormally low or high BP) indicating serious health issues requiring medical attention.

Figure 4
figure 4

The distribution of systolic BP values in the: (left) Aurora-BP dataset and (right) MIMIC dataset. In the MIMIC dataset, the SBP values lie in the range 65–200 mmHg, however prior works ignore samples with SBP values outside the range of 75–165 mmHg.

However, we found that researchers often discard so-called “outliers”22,27,29 (Figure 1B), arguing that such samples are unlikely or have occurred due to noise in the data collection process. For example, the MIMIC dataset has SBP values ranging between 65 and 200 mmHg (75-220 mmHg in Aurora-BP), but Schlesinger et al.27 ignored samples outside the range of 75–165 mmHg, referring to the discarded values as “improbable”. Similarly, Cao et al.22 and Hill et al.29 use a constrained range of 75–150 mmHg, while according to the British Hypertension Society literature, 140–159 mmHg is Grade-1 (mild) hypertension, 160–179 mmHg is Grade-2 (moderate) hypertension, and \(\ge\)180 mmHg is Grade-3 (severe) hypertension54.

Table 3 Performance of the reference network on different SBP ranges on the MIMIC dataset. Constraining the data range can result in significant (but artificial) accuracy improvements.

Constraining the data range has two problems. First, it leads to an incomplete evaluation, as the model is neither trained nor tested on samples from the discarded ranges. Second, since the statistical range of the output is reduced, this makes the prediction task artificially “easier” (i.e., a lower error can be achieved more easily), which may result in promising but misleading results. To quantitatively study the impact of constraining data ranges, we conducted an experiment using our reference network with different filtering of the data range. Table 3 shows the performance of our network when trained with three different SBP ranges: 65–200, 75–165 and 75–150 mmHg. Even small restrictions in the output range can lead to a significant (perceived) improvement in accuracy, e.g., reducing the SBP upper limit from 165 to 150 mmHg results in an \(\sim\)11.4% improvement in the standard deviation. This can be explained as samples at the extremes often result in the highest prediction errors (as models tend to predict closer to the mean of the distribution making predictions on samples with very high or low ground-truth BP values the most inaccurate).

The exclusion of samples with SBP measurements outside the range \(\ge\)165 mmHg and \(\le\)75 mmHg during the training of machine learning models may result in overlooking crucial physiological features, potentially concealing serious health conditions and introducing bias into the model. This practice not only limits the scope of the developed models but also hinders conclusions about their generalizability and real-world applicability, as they become less representative of the diverse patient populations they are intended to serve.

Unreasonable calibration

The relationships between health measures (e.g., PPG and BP) are often person dependent. For example, blood pressure (bp) is dependent on the patient’s heart rate (hr), blood viscosity (visc), stiffness of blood vessels (stif), etc., i.e., \(bp = f(hr, visc, stif, ...)\). While the PPG signal might capture heart rate well, it may not be able to capture viscosity- and stiffness-related information. To solve this problem, it is common to propose the use of a calibration step, wherein a few PPG samples from each patient along with gold-standard BP values are used to calibrate the function f for that patient (Figure 1C). The model then learns a calibrated function, \({\hat{f}}\), for a specific patient, i.e., \(bp={\hat{f}}(hr)\), where the patient-specific parameters (viscstif, ...) are folded into \({\hat{f}}\).

The literature does not offer a universally effective calibration strategy. Cao et al.’s22 method needs to be calibrated every time before a BP prediction to find the optimal fit on the wrist for the watch, while Schlesinger et al.’s27 model needs to be calibrated once to find the offset value between the model and the true prediction. As blood pressure may not change drastically within minutes (at rest) and significant trends might be observed only over the course of a few months owing to lifestyle changes or the influence of medication61, it becomes important to pay attention to questions such as: What is the frequency of re-calibration? Is the calibration approach prone to changes in other environmental factors? We believe that the calibration approaches reported in prior work risk over-fitting by memorizing patient-level local temporal characteristics, and that evaluation is incomplete given that they do not evaluate BP prediction over longer time scales.

To understand the influence of calibration, we evaluate the prediction performance under different calibration strategies. Naïve Calibration simply predicts a constant calibrated value for the entire record. The constant value is computed as the mean of the ground truth values of the first three windows of a record. Offset Calibration uses our reference network, but adds an offset to the predicted value. The offset is computed in the calibration step as the difference between the predicted and ground truth BP of the test record’s first window. We found the Naïve Calibration to perform very well (Table 4), with a standard deviation of 8.61 mmHg, close to the AAMI standard. However, predicting a constant BP value for a patient is clearly incorrect. This inconsistency underscores problems with the evaluation methodology. Since typical records in MIMIC have short time intervals (average length = 6 minutes) compared to the time scales at which BP changes, predicting a constant value gives deceivingly good accuracy. An appropriate evaluation of calibration methods should consider time scales spanning the intended re-calibration duration. For example, if re-calibration is planned every six months, the method should be evaluated with patients tracked over at least a six month time period. To demonstrate that calibration systems can quickly deteriorate over time, we analyzed the performance of Offset Calibration as the time from the calibration window increases. Although the method performs well for the first few days, the error rates increase dramatically after that (Figure 5A).

Table 4 Performance of different calibration-methods on the MIMIC dataset. The incorrect Naïve calibration methods perform very well, underscoring problems with the evaluation methodology.

Unrealistic preprocessing or filtering

The MIMIC dataset comprises ICU-patients data, with artifacts due to patient movement, sensor degradation, transmission errors from bedside monitors, and human errors in post-processing data alignment. The impact of these artifacts is visible in both the PPG and ABP waveforms as missing data, noisy data, and sudden changes in amplitude and frequency (Figure 6). To clean the signal, researchers27,29 have used band-pass filters to remove noise in the high frequency (\(\ge\)16 Hz) and low frequency (\(\le\)0.5 Hz) ranges, followed by auto-correlation to filter signals that are not strongly correlated with themselves. The auto-correlation step removes samples with uneven amplitude and/or frequency. After cleaning the MIMIC dataset (Figure 1D), Schlesinger et al.27 used less than 5% of the total data for training their neural network, while Hill et al.29 and Slapnicar et al.34 used less than 10% of the total MIMIC data. This suggests that “clean” data is rare. Although filtering datasets to remove some noise is often an essential step to train a machine learning model56, excessive filtering of data can result in overfitting. Models trained on such clean data might achieve high performance on a clean test set; however, they might fail in practice, as it is difficult to obtain such clean signals in a real-world scenario.

Figure 5
figure 5

(A) The offset calibration method’s performance falls off quickly after the first few days. (B) Performance of our reference network with different auto-correlation thresholds on the MIMIC dataset.

Figure 6
figure 6

Examples of poor-quality photoplethysmography signals from the MIMIC dataset.

To understand the impact of filtering on a dataset, we measure the performance of our reference network at different auto-correlation thresholds. Figure 5(B) plots the performance of our reference network in predicting SBP and the percentage of filtered data for each auto-correlation threshold. The performance of the network improves by 29.7% and the dataset size decreases by 63%, as we increase the auto-correlation threshold from 0 to 0.8.

Our proposed principled approach

We propose and utilize two tools—based on multi-valued mappings and on mutual information (Appendix B)—to estimate if the input signal is a good predictor of the output. Using our proposed tools we performed a principled analysis to study the relationship between PPG and BP. For comparison, we also used our tools on heart rate (HR) and reflected wave arrival time (RWAT) estimation for which it is known that the PPG signal is a strong predictor.

Checking for Multi-Valued Mappings: We use Algorithm 1 to find multi-valued mappings corresponding to data samples that are close in the input space but distant in the output space. As discussed in Section B.1, to compute the distance between two PPG inputs, we first align them using cross-correlation, followed by computing their Euclidean distance. We divide the dataset records into non-overlapping two-second windows and treat them as individual inputs. We set an input distance threshold of 1.0, which corresponds to a per-time sample threshold of \(4e-3\) (each 2s PPG window had 250 samples). For the output, we set thresholds of 8 mmHg, 8 bps, and 0.02s for the BP, HR and RWAT prediction tasks, respectively. We found very few multi-valued mappings for the HR and RWAT tasks, while a large number of mappings for the SBP task (Table 5). In the MIMIC dataset, for 33.2% of the 2-second windows, we found another window for the same patient who was close in the input PPG space but had a significantly different SBP output. When limiting the search to different patients, for 15.0% of the windows we could still find such matches. This implies that the task of predicting BP from PPG is ill-conditioned. Figure 7 shows examples of such multi-valued mappings, with highly similar input PPG waveforms but significantly different output arterial BP waveforms. In comparison, for the HR and RWAT tasks, the number of such matches is much smaller at 0.02% and 0.08% intra-patient, respectively, suggesting much better conditioning.

Table 5 Multi-valued mapping matches for the BP, HR and RWAT prediction tasks. For the BP task, there was a high match rate for both within the same patient records and across patients, suggesting an ill-conditioned problem. For the HR and RWAT tasks, the matches were much lower. Ground truth for RWAT is only available for the Aurora-BP dataset.

In the process of filtering multi-valued mappings, it is essential to consider the specificity of sensors and the methodologies employed in preprocessing the input data. Our analysis focuses on intra-patient and inter-patient multi-valued mappings within specific datasets, namely MIMIC and AURORA, rather than across different datasets. This approach ensures that our findings are not confounded by variations in sensor quality or the nuances of measurement techniques. Additionally, it enables us to apply preprocessing steps that preserve amplitude information.

Figure 7
figure 7

Multi-valued mappings. Examples of PPG waveforms (PPG\(_{i}\) and PPG\(_{j}\)) that are very similar and have corresponding arterial blood pressure waveforms (ABP\(_{i}\) and ABP\(_{j}\)) that are quite different. This highlights the existence of similar features that map to different targets, which makes the task of blood pressure prediction via PPG pulse wave analysis ill-conditioned.

Evaluating Mutual Information: To estimate mutual information (MI) between the PPG signal and the target output (BP/HR/RWAT), we use the K-nearest neighbours based approach proposed by Kraskov et al.62. We leverage dimensionality reduction to make MI estimation tractable, using handcrafted and auto-encoder learned feature representations. We report the mutual information of the input features and target variable, as well as the entropy of the target variable. Note that the target variable’s entropy is the maximum achievable mutual information. Thus, the ratio of MI and target variable entropy represents the target information fraction encoded by the input, which we call Info-Fraction. We found Info-Fraction to be a more intuitive measure than the absolute MI values, and use it to compare the predictive power of PPG across the different tasks.

Table 6 Descriptions of the handcrafted features used for the Mutual Information analyses.

Handcrafted Features: As suggested by Takazawa63 and Elgendi et al.64, we calculate handcrafted features (see Table 6) from the PPG waveform (Figure 8). Due to the absence of a time-aligned ECG waveform in the MIMIC dataset, we extracted the relevant handcrafted features only from the PPG waveform. Table 7 presents the MI of these individual features with respect to the BP prediction task for both the MIMIC and Aurora-BP datasets, along with the MI when all these features are combined and regarded as a single multi-dimensional input. We found that even the combined features set encode a small fraction of the total target entropy. For example, in the MIMIC dataset, the combined features’ Info-Fraction is just 9.5%, while heart rate itself contributes an Info-Fraction of 4.1%. Similar observations hold true for the Aurora-BP dataset. This hints that the PPG signal does not have enough information to predict BP in this dataset, and moreover the prediction is highly dependent on the heart rate.

For the Aurora-BP dataset we have the demographic data (age, weight, height) of the subjects, as well as time-aligned PPG and ECG waveforms. This allows us to calculate additional features, e.g., radial Pulse Arrival Time (rPAT) and other derived features60. Prior work7 has used PAT to estimate blood pressure. Moreover, the Aurora-BP dataset has multiple readings for each subject in different positions (e.g., sitting, at rest, and supinated) which helps us add delta features reflecting the difference between features in the two conditions. Despite this, we found the entropy results for the Aurora-BP dataset to be similar to the MIMIC dataset, with the handcrafted features able to capture only 9.8% of the entropy of blood pressure (Table 8). On the other hand, for the HR and RWAT prediction tasks, the handcrafted features captured 87.7% and 64.6% entropy, respectively (ground truth for HR is derived from the ECG sensor data and RWAT from the tonometric sensor data). This further strengthens our finding that the PPG signal even with additional information from the ECG waveform has limited information to predict BP.

Figure 8
figure 8

A visual description of the hand-crafted features calculated from the PPG and ECG waveforms. The systolic ramp time (\(\frac{dp}{dt}\)) is defined as \(\frac{y_{2}-y_{1}}{t_{2}-t_{1}}\).

Auto-encoder Features: As an alternative to handcrafted features, we train an auto-encoder on the raw PPG waveform to obtain a set of low dimensional features. We use a five-layer perceptron (MLP) auto-encoder with ReLU activation and a bottleneck layer of 20 neurons. The model was trained with the Adam optimizer (learning rate of 0.001) and a mean-squared error loss (with a stopping point when the loss saturated at <0.1). Training time on a single NVIDIA P100 was under an hour. Table 9 shows the MI of the combined bottleneck features with respect to the BP, HR and RWAT prediction tasks. Although the auto-encoder features are more comprehensive and have higher MI compared to the hand-crafted features, the Info-Fraction for BP prediction (12.9% for MIMIC and 8.7% for Aurora-BP) is still much lower compared to that for HR (92.2% for MIMIC and 93.1% for Aurora-BP) and RWAT (70.1% for Aurora-BP) prediction tasks.

There are two possible implications of these findings. First, it may suggest that PPG signals lacks adequate information for accurate BP prediction. Alternativelty, it could imply a limitation in the current sensor technology, making sensors susceptible to confounding factors like external noise and environmental variations, thereby hindering the accuracy of BP prediction.

Conclusion

Our results reveal that BP prediction via pulse wave analysis of the PPG signal is still an unsolved task and far from the acceptable AAMI and BHS standards. By performing a systematic review and accompanying experiments we found several issues being overlooked in the prior work that have led to seemingly over-optimistic results. These pitfalls can be categorized into data splits that leak information from test samples into the training set, heavy constraints on the task that remove challenging samples and reduce the range of target values substantially, calibration methods that seem to be practically problematic, and unreasonable preprocessing that filters the data to an unrealistic extent such that any noise is unacceptable. These pitfalls simplify the machine learning task, creating a deceptive perception of ease in model training, which results in inflated performance. Ultimately, this translates to models that overfit the training data, hindering their ability to generalize effectively and handle real-world data variations.

While research on non-invasive approaches to estimate health vitals such as heart rate and blood oxygen saturation has made tremendous progress, enabling these technologies to become ubiquitous in the last decade, progress in non-invasive cuffless BP estimation has been slow despite witnessing similar research interest. This has prompted us to question whether the problem itself is ill-conditioned and if the PPG signal contains enough information to predict BP in the first place. In order to answer these questions, we have proposed a set of tools based on multi-valued mapping and mutual information to check if an input signal is a good predictor of the desired output. The multi-valued mapping checker allows us to find samples close in input space but far in output space. We found many such samples in both the MIMIC and Aurora-BP datasets. Searching for multi-valued mappings was trivial once appropriate distance metric and thresholds were defined, qualitative and quantitative results show that almost identical PPG waveforms can have very different BP waveforms. Next, we looked at the entropy of the features by computing mutual information. MI was extremely low for both hand-crafted and learned auto-encoder features. In comparison, heart rate and RWAT prediction tasks from PPG PWA have much lower multi-valued mapping factors and much higher mutual information indicating that the task is relatively well conditioned compared to PPG PWA to BP. We believe that these tools are relevant for feasibilty analysis in similar tasks involving wearable data, such as predicting stress levels from PPG65,66,67 and estimating blood glucose levels from PPG68,69,70.

Our study does not aim to prove that blood pressure estimation from PPG PWA is impossible; however, it indicates that the task is very challenging, and evaluating performance fairly is non-trivial. To navigate this complexity, we present a set of tools that future research can leverage to avoid the pitfalls identified here. We hope our work can serve as a milestone and stimulate further discussion and exploration in the following areas: (1) Data Diversity: Collecting comprehensive datasets that represent subjects from diverse demographics and cardiovascular physiologies. (2) Multiple modalities: Exploring the integration of PPG with other physiological signals holds immense potential for enhancing prediction accuracy and providing a more holistic view of cardiovascular health. (3) Improved Sensors: Advancements in sensor technology are crucial to capture higher-fidelity PPG data with minimal external noise and environmental variables. We believe that focusing on these critical areas will lead to generalizable and scalable solutions, empowering a future where everyone can benefit from the accessibility and convenience of non-invasive cuffless BP estimation.