Learning fine-grained estimation of physiological states from coarse-grained labels by distribution restoration

Due to its importance in clinical science, the estimation of physiological states (e.g., the severity of pathological tremor) has aroused growing interest in machine learning community. While the physiological state is a continuous variable, its continuity is lost when the physiological state is quantized into a few discrete classes during recording and labeling. The discreteness introduces misalignment between the true value and its label, meaning that these labels are unfortunately imprecise and coarse-grained. Most previous work did not consider the inaccuracy and directly utilized the coarse labels to train the machine learning algorithms, whose predictions are also coarse-grained. In this work, we propose to learn a precise, fine-grained estimation of physiological states using these coarse-grained ground truths. Established on mathematical rigorous proof, we utilize imprecise labels to restore the probabilistic distribution of precise labels in an approximate order-preserving fashion, then the deep neural network learns from this distribution and offers fine-grained estimation. We demonstrate the effectiveness of our approach in assessing the pathological tremor in Parkinson’s Disease and estimating the systolic blood pressure from bioelectrical signals.

by real-valued continuous variables instead of discrete classes. One may argue that we can define more and finer classes to solve the problem, so that regression can be used instead of classification. However, obtaining precise ground truth is extremely difficult since in most cases, expert annotators label the physiological states based on their subjective clinical observation, rather than a ruler with precise scales.
In this work, we are interested in developing a general machine learning method to learn the fine-grained estimation of physiological state from coarse labels, which has notable benefits in exploring the rich details of physiological rhymes. Figure 1a illustrates the high-level framework, where we aim to build an estimator that takes the bioelectrical signal as input and predicts the fine-grained physiological states as real-valued continuous variables. The smooth color gradients indicate that the physiological states should change continuously. Figure 1b illustrates the classification labels that quantize the continuous physiological states (see Fig. 1c) into coarse-grained discrete classes. Developing machine learning algorithms that learn coarse labels to make finegrained predictions is difficult, because the information the algorithm aims to learn is not explicitly provided by the labels in the training set. In previous literature, the research most relevant to ours is using Gaussian Process based approaches to learning fine-grained estimation from aggregate outputs 17,18 . They assume that the aggregate output or group statistics (e.g., the average fine-grained label) of a bag of inputs is known. Nevertheless, in our scenario, what we have is only the coarse label of each input, rather than the aggregation (e.g., the average) of their fine-grained labels. Therefore, these Gaussian Process based methods are targeted at a task intrinsically different from ours and cannot be used to directly solve our task.
Our approach is based on a mathematically rigorous theorem that we propose, which proves that such a task is achievable if the predictions of the machine learning algorithm satisfy two conditions. Based on this understanding, we propose the distribution restoration and ordinal-scale learning method to train the machine learning model so that the two conditions can be approximately satisfied. The proposed method is easy to implement using simple loss functions, yet effective in diverse tasks. More details are described in the "Method" section. Extensive experiments have been conducted on tremor severity estimation, parameter estimation on synthetic signals and systolic blood pressure estimation. The quantitative results demonstrate the superior performance of our method in learning fine-grained estimate of physiological states from coarse-grained labels.

Remark 1
Physiological states as considered as continuous variables in this article. Typical examples include the severity of a disease, the blood pressure and the frequency of heartbeats. In some situations the states are discrete, but they still have their continuous measurements. For instance, heartbeats are discrete, but the frequency of heartbeats is continuous. By choosing a suitable definition of the physiological states, we can ensure that the states are viewed as continuous in these scenarios.

Results
We conduct comprehensive experiments on three specific tasks: (1) tremor severity assessment for Parkinson's Disease (PD) using surface electromyography (sEMG), (2) parameter estimation for synthetic sEMG signals and (3) systolic blood pressure estimation using photoplethysmography (PPG) and electrocardiogram (ECG) signals. To the best of our knowledge, this work represents a pioneering attempt on the fine-grained assessment of bioelectrical states by learning from coarse labels.

Tremor severity estimation.
Tremor is a typical movement disorder occurring on the limbs of patients with Parkinson's Disease. According to the universally accepted MDS-UPDRS 19 , the tremor severity is divided into 5 escalating levels, normal, slight, mild, moderate, severe, which we represent with integers {1, 2, 3, 4, 5} respectively. The main evaluation is done on PD-sEMG 9 dataset containing 10K sequences of single-channel sEMG collected from the upper limbs of 147 individuals at a sampling rate of 1 KHz. Figure 2 visualizes typical samples from the dataset. Each sample was annotated by multiple experts independently, and would not be used if the annotations were different, leading to generally unbiased labeling. In this experiment, the physiological state refers to the tremor severity.
Comparison to state-of-the-art methods. We cannot directly compare our approach with existing methods [10][11][12][13][20][21][22][23] , which focus on multi-class classification instead of fine-grained regression. For comparison, we round the predicted real-valued tremor severity to the nearest integers. We present the recall, precision for each class and the average accuracy in Table 1. Results are reported on the test set corresponding to 9 . FE and SL are short for feature engineering and similarity learning respectively. The proposed approach outperforms others under various evaluation metrics.
Fine-grained estimation of tremor severity. Our ultimate goal is not building a new multi-class classifier. Despite the classification performance shown in Table 1, we are not able to directly validate the correctness of the decimal part of our outputs because of the absence of fine-grained ground truth. However, it is still possible to  Table 1. Comparison to state-of-the-arts on PD-sEMG dataset. The best performance in each column is marked in bold.

Class 1 Class 2 Class 3 Class 4 Class 5 Class 1 Class 2 Class 3 Class 4 Class 5
Bayes www.nature.com/scientificreports/ estimate the lower bound performance, which occurs when the model is predicting an instance lying on the boundary between neighbouring classes. We implement this idea by assuming that the ground truth of class i is missing. We train the model using input signals of class {. . . , i − 1, i + 1, . . .} and test the model using those of class i, which lies on the boundary of class i − 1 and i + 1 because of the ordinal nature of classes. The evaluation results are shown in Table 2. Results are reported on the test set corresponding to 9 . The precision is calculated by rounding the network output to the nearest integer to obtain the classification results, and then counting the classification precision for Class i. The pseudo MAE refers to the average of |c i − c c i | , where c i is the prediction and c c i is the coarse label that is regarded as pseudo ground truth. The model is trained using labels of class {. . . , i − 1, i + 1, . . .} and evaluated on class i, the boundary between class i − 1 and i + 1 . This experiment indirectly verifies the fine-grained output c i . The slack L1 loss is always applied in training, and the two distribution losses both contribute to the performance according to the table. Even though the labels for class i are not provided, it is shown that the model still manages to predict the unseen instances of this class. Likewise, if we train the model using all the data of classes {1, 2, 3, 4, 5} , the decimal part of prediction c is supposed to match the inaccessible fine-grained ground truth.
Feature space interpretation. In Fig. 1d, we have pointed out that typical classification model tends to build a boundary separating the features of different classes. This intuition is supported by Fig. 3b, where the samples are grouped around class centers, leaving a large gap among those centers. In this sense, because the intra-class separability is weakened, the model only learns a coarse estimation of physiological state. On the contrary, by employing the distribution loss, the distance in feature space is well correspondent with the distance in clinimetric scale as is shown in Fig. 3a. The boundaries are almost eliminated, corresponding to the fact that physiological states are generally continuous instead of being separated by sharp boundaries.
Parameter estimation on synthetic signals. A disadvantage of the experiment of tremor severity estimation is that the fine-grained ground truth is unavailable, which means we can only indirectly examine the effectiveness of our approach. In this subsection we perform a new experiment using synthesized data whose fine-grained ground truth is accessible, which allows us to directly compare the network outputs with the ground truth. Based on the previous work 25 , we synthesized sEMG signals with controllable parameters, which are regarded as the fine-grained ground truth to be learned. The previous study 26 has revealed that the wavelength of sEMG signals is correlated with the physiological states such as the frequency of pathological tremor. Therefore, we choose wavelength as the parameter to be estimated. Within a sequence of synthetic signal (i.e., a training or testing example), the wavelength is set to a constant floating point number generating the signal. 90K training sequences and 10K testing sequences are generated with wavelength uniformly distributed in [150,250] . During training, we evenly divide the range into 5 intervals. The coarse label of sequences in an interval is set to the medium of the interval. In testing, the network is expected to predict the fine-grained label of each testing sequence. Table 2. Fine-grained estimation on PD-sEMG dataset. The best performance in each column is marked in bold. www.nature.com/scientificreports/ Evaluation results. We consider four baseline methods including the most commonly used L1 regression (L1) and L2 regression (L2), as well as the immediate-threshold regression (IT) 27 and all-threshold ordinal regression (AT) 27 . We evaluate the mean absolute error (MAE), root mean square error (RMSE) and the normalized inversion. Denote the fine-grained label of two sequences as c i and c j , and the predicted label as c i and c j . An inversion refers to the case where c i < c j and c i >c j , or c i > c j and c i <c j . The normalized inversion refers to the number of inversions divided by the maximum possible number of inversions that equals to n 2 , where n is the number of examples in the testing set. As is shown in the first row of Fig. 4, our method has the least fine-grained estimation error, exhibiting superior performance over the compared methods.

Loss
Systolic blood pressure estimation. Here we perform an experiment on real data where the fine-grained ground truth is known, instead of using the synthetic data. The task is regressing blood pressure from bioelectrical signals. We adopt the cuff-less blood pressure estimation 28 dataset where the Photoplethysmography (PPG), Electrocardiography (ECG) and corresponding Arterial Blood Pressure (ABP) are collected from at least 441 patients. Sampled data are visualized in Fig. 5. The network is to predict the maximum Systolic Blood Pressure (SBP) within a segment from the PPG and ECG signals. The fine-grained SBP ground truth is divided into 5 equally spaced coarse classes for training the deep neural network using the proposed framework. The baseline methods and evaluation criteria are the same as the experiment on synthetic signals.
Evaluation results. As is shown in the second row of Fig. 4, predictions made by the proposed method are the closest to the fine-grained ground truth and have the least estimation error. Our predictions also have the least normalized inversion. While a typical blood pressure value is between 80 and 120 mmHg, 92.97% of our prediction errors are less than 5 mmHg, while using the all-threshold based ordinal regression approach this number decreases to 76.56%.  www.nature.com/scientificreports/

Conclusions
In this article we propose a machine learning approach for predicting the fine-grained physiological states when only coarse-grained labels are given in the training data. Different from previous methods that aim to classify the physiological states into discrete classes, our method offers continuous and fine-grained estimation that are informative of even the slightest changes. Learning fine-grained predictions from coarse labels is intrinsically challenging due to the lack of supervision. Starting from mathematically rigorous proof, we reveal the possibility to solve this challenge by (1) restoring the continuous probability distribution of the fine-grained labels and (2) preserving the order of the fine-grained predictions. Then we propose a set of simple yet effective loss functions that enable the network outputs to approximately satisfy both conditions. The fine-grained estimation of physiological states is potentially useful in a wide range of applications such as monitoring the physical condition of patients. Take the Parkinson's Disease tremor as an example. The tremor is divided into 5 discrete levels by the MDS-UPDRS 19 . But this does not provide sufficient resolution to monitor the tremor severity in a fine-grained scale. After taking a medicine, the tremor might become less severe, but the difference could be too small to change the severity from one level to another. Our method reveals the possibility to automatically monitor the slight changes and provide more information, for example, on the effectiveness of medication.
To evaluate the proposed method, we conduct comprehensive experiments on tremor severity estimation using sEMG signals, systolic blood pressure estimation using PPG and ECG signals, as well as the parameter estimation from synthetic sEMG signals. Results have shown that the proposed approach can significantly reduce the regression errors. The effectiveness of each loss function we propose is also examined. Our algorithm has demonstrated potentials in automatically and precisely diagnosing diseases and monitoring the physical conditions of individuals in a more sophisticated way.

Method Problem formulation. Given a sequence of stochastic bioelectrical signal
pling time steps, our objective is to learn a mapping function f : V → c , where c ∈ [1, C] is the assessment of the physiological pattern that V reflects. For example, c can represent the severity of tremor and C equals to the maximum severity level. In a typical classification setting, c is an integer belonging to the set {1, 2, . . . , C} , as such the classification model is not designed to offer sufficient resolution to account for the intra-class variations, indicating that such classification is coarse-grained. Previous work discretized c into countable classes (e.g., 1, 2, . . . , 5) and modeled f : V → c as a classification function. This formulation cannot give a fine-grained estimate of physiological states. For instance, when two patients are both of severity level 2, their actual severity can be slightly different (e.g., 2.1 vs 2.3) but the classification method cannot distinguish them. Instead, our aim is to provide a fine-grained estimate of the physiological state that can differentiate the slight differences. This is potentially useful in various scenarios. Suppose that a patient of severity level 2 took a medicine and then the severity decreased to 1 after 10 h. Using our fine-grained estimation, we can measure the small changes of the severity overtime, which could help physicians to understand the effect of the medicine. On account of the importance of fine-grained estimation, we first replace the discrete classification with the continuous regression so that c is a floating point number, which is expected to reasonably correspond with the naturally continuous physiological states. Nevertheless, it causes difficulties in training for the lack of ground truth labels. After all, it is over demanding for doctors, the annotators, to score their observation in accurate floating point numbers. The available ground truth data are only discrete integers indicating the categories. Hence we consider the problem as learning an estimator function f that maps V to a continuous real-valued c using typical classification labels, which we refer to as coarse-grained labels.
Approximating the estimator function. We propose a convolutional neural network BioeNet (see Fig. 6a for details) to approximate the function f : V → c . BioeNet takes as input a batch of bioelectrical signals with length L and the number of channels D. It outputs a floating point number for each sequence of signal in a batch representing the estimated physiological state. The batch size is denoted as H. By using global max-pooling in the head layer, BioeNet is translation invariant to input signals. In the followings, we will show how the neural network can approximate the function f by learning from coarse labels. We will first propose a theorem that mathematical proves the feasibility of learning fine-grained estimation from coarse labels. Then we will propose two learning strategies, distribution restoration and ordinal scale learning, to implement the theory. Theorem 1 Two continuous probability density functions p A (x) and p B (x) are defined on x ∈ [θ 1 , θ 2 ] . p A (x) ≥ δ > 0 and p B (x) ≥ δ > 0 . S A = {a 1 , a 2 , . . . , a n } and S B = {b 1 , b 2 , . . . , b n } are independent random samples from p A (x) and p B (x) respectively. If satisfying (I) p A (x) ≡ p B (x) and (II) ∀i, j ∈ {1, 2, . . . , n} , a i ≤ a j ⇔ b i ≤ b j , then ∀i ∈ {1, 2, . . . , n} we have: Here we propose Theorem 1 as the mathematical basis of our method and prove it at the end of this section. Note that Theorem 1 does not rely on a specific network architecture (e.g., BioeNet). Based on Theorem 1 we conclude that when n is sufficiently large, we have www.nature.com/scientificreports/ which means S B almost equals to S A , or their element-wise error has zero expectation, if both conditions stated in Theorem 1 are satisfied. In our specific task, we consider S B as the network prediction and S A the fine-grained ground truth as floating point numbers. S A is inaccessible in training but is assumed to exist in reality. Our objective is to enable the network prediction S B to approximately comply with both of the conditions so that it is close to S A .

Distribution restoration.
Meeting the first condition of Theorem 1 requires the probabilistic distribution of network predictions accord with that of the intrinsic physiological state c. As a continuous variable, c obeys a continuous distribution that can be restored from the discrete distribution of training labels as is in Fig. 1e by interpolation. Mathematically, there are a infinite number of curves that can interpolate the discrete points. In order to obtain a smooth curve and reduce redundancy, we utilize cubic interpolation here. The justification of interpolating neighboring stages comes from the rationality of clinical practices. Clinical experts define these stages to thoroughly describe a physiological process. If the experts had observed that a phenomenon seriously undermined the smoothness of the stage change, they should have already defined a new stage based on this phenomenon. Therefore, the changes between neighboring stages should be generally gradual and smooth. The physiological states are continuous because of the natural continuity of most physiological processes. Figure 1f shows the continuous probabilistic density obtained from the discrete distribution of coarse classification labels in Fig. 1e. The essence of this distribution restoration is making use of the ordinal information of class labels. In physiological state classification, neighboring classes can be assumed to be in a natural order. It should be noticed that this may not be true in other classification tasks such as object recognition in computer vision and emotional semantic analysis in natural language processing. Let p C (x) denote the restored distribution. Although p C (x) is only an approximation, it still provides richer information than the discrete classification labels. During training, we sample a batch of floating point numbers to represent p C (x) statistically, then calculate its Maximum Mean Discrepancy loss L m with the network outputs where H is the batch size, c i ( c j ) denotes the network output and c s i ( c s j ) indicates the floating point numbers sampled from the distribution p C (x) ; and k(x, y) represents the Gaussian kernel. This loss item minimizes the   where c c i indicates the coarse classification label as integer; α is the tolerance range and β is utilized to smooth the gradients. When the difference |c − c c | is less than a tolerance threshold α , L l equals to zero so that the network does not get punished. The basic concept underlying Eq. (5) is that, the coarse-grained label c c only represents the integer part of the real ground truth, and thus the difference |c − c c | may not equals to zero when c is a floating point number. Therefore, we slack the objective to a tolerance range α to introduce more flexibility to the network, which learns a generalized ordinal regression without suffering from the discontinuity of training labels. Note that the order of instances within the same class is inaccessible due to the absence fine-grained ground truth. Even so, the network still manages to distinguish the intra-class order, which will be demonstrated in the experiment section. Fig. 6b. During training, a minibatch contains H sequences of bioelectrical signals, each with L timesteps and D channels. BioeNet predicts a score indicating the physiological state for each sequence of signal, thus yielding a H × 1 vector, whose Slack L1 loss with the coarse-grained labels is computed. Meanwhile, we sample H numbers from the restored continuous distribution and compute their maximum mean discrepancy loss with the network outputs. The parameters of the convolution kernels in BioeNet are optimized via gradient descent. During testing (inference), BioeNet takes a sequence of bioelectrical signal and directly predicts the physiological state in an end-to-end fashion. Implementation details. The proposed BioeNet is implemented using Tensorflow and Python. All the convolutional layers are followed by a batch normalization layer and ReLU non-linearity, except that the last layer is purely linear. We choose H = 512, L = 2048 in training, while the number of channels D depends on the acquisition instruments of different bioelectrical signal datasets. We utilize Adam optimizer to train for 60 epochs at a learning rate 10 −4 and then for 20 epochs at 10 −5 .

Pipeline overview. A systematic view of the learning pipeline is shown in
Proof of Theorem 1 The proof consists of two parts. For the first part, let y ∼ U(0, 1) and z ∼ U(0, 1) be independent stochastic variables. {y 1 , y 2 , . . . , y n } and {z 1 , z 2 , . . . , z n } are two sets of independent samples of y and z with elements sorted in an ascending order, i.e., ∀i ≤ j , y i ≤ y j and z i ≤ z j . It is well-known that y k and z k are the k th order statistic 29 of the standard uniform distribution and the variance.
Since y k and z k are independent, we have For the second part, we consider two continuous probability density functions p A (x) and p B (x) defined on x ∈ [θ 1 , θ 2 ] . p A (x) ≥ δ > 0 and p B (x) ≥ δ > 0 . S A = {a 1 , a 2 , . . . , a n } and S B = {b 1 , b 2 , . . . , b n } are independent random samples from p A (x) and p B (x) respectively, satisfying (I) p A (x) ≡ p B (x) and (II) ∀i, j ∈ {1, 2, . . . , n} , a i ≤ a j ⇔ b i ≤ b j , then ∀i ∈ {1, 2, . . . , n} . Let â k and b k be the k th smallest elements in S A and S B respectively, and F(x) be the cumulative distribution of p A (x) that is the same as p B (x) . Established on the fact that any continuous  www.nature.com/scientificreports/ distribution can be mapped to the standard uniform by its cumulative distribution function, we can conclude that S F A = {F(a 1 ), F(a 2 ), . . . , F(a n )} and S F B = {F(b 1 ), F(b 2 ), . . . , F(b n )} both follow U(0, 1) . Then we have which is based on Eq. (7). And thus: Since ∀i, j ∈ {1, 2, . . . , n}, a i ≤ a j ⇔ b i ≤ b j , the ranking of a i in S A equals to that of b i in S B . There exists a k that a i and b i are the k th smallest element in S A and S B respectively. The theorem is finally proved. .