Few-shot pulse wave contour classification based on multi-scale feature extraction

The annotation procedure of pulse wave contour (PWC) is expensive and time-consuming, thereby hindering the formation of large-scale datasets to match the requirements of deep learning. To obtain better results under the condition of few-shot PWC, a small-parameter unit structure and a multi-scale feature-extraction model are proposed. In the small-parameter unit structure, information of adjacent cells is transmitted through state variables. Simultaneously, a forgetting gate is used to update the information and retain long-term dependence of PWC in the form of unit series. The multi-scale feature-extraction model is an integrated model containing three parts. Convolution neural networks are used to extract spatial features of single-period PWC and rhythm features of multi-period PWC. Recursive neural networks are used to retain the long-term dependence features of PWC. Finally, an inference layer is used for classification through extracted features. Classification experiments of cardiovascular diseases are performed on photoplethysmography dataset and continuous non-invasive blood pressure dataset. Results show that the classification accuracy of the multi-scale feature-extraction model on the two datasets respectively can reach 80% and 96%, respectively.


Materials and methods
Data description. At present, the most commonly used methods of collecting pulse wave signals information primarily include PPG and pressure detection. PPG traces the pulsation state of blood vessels and measures pulse wave signals by measuring the attenuated light reflected and absorbed by human blood vessels and tissues 27 . Pressure-detection method obtains pulse wave signals by directly detecting changes in the pressure of human arteries over time. The cuff is tied directly to the patient's wrist, and the patient should maintain the same posture during measurement 28 . Patients feel that the wrist is sore and uncomfortable during long-term measurement. Compared with the pressure-detection method, the operation of PPG is simpler and suitable for long-term measurement, but it is susceptible to environmental interference during the measurement process, and its measurement accuracy and stability are lower than those of the pressure-detection method. For pulse wave signals obtained using different acquisition methods, to verify the generalization ability of the model, this study conducts experiments based on PPG dataset and CNBP dataset.
Considering that the clinically collected pulse wave signals contain substantial noise, pre-processing steps such as data filtering, beat division, etc. are necessary. The specific pre-processing details are described in the 'Results' section.
• Dataset I: The PPG dataset for the non-invasive detection of cardiovascular diseases contains 657 pulse wave records from 219 subjects 29 . The data covers the age range of 20-89 years old, and the sampling rate is 1000 Hz. Various diseases including hypertension and diabetes are recorded. The dataset provides four labels for normal blood pressure, pre-hypertension, and stage I/II hypertension. Table 1 shows the details of dataset I. • Dataset II: The CNBP dataset based on cardiovascular diseases is collected from the Fifth Affiliated Hospital of Zhengzhou University. The subjects who are 18-83 years old fill-up an informed-consent form before data collection. ZM-300 intelligent pulse wave signal collector is adopted for data acquisition, with a sampling frequency of 200 Hz. A pressure sensor based on a semiconductor strain gauge, with a sensitivity of 0.5 mV/g (bridge voltage of 6 V) and a pressure range of 0-1000 g, is used for the collection of pulse wave signals. The A/D converter is a 4-channel 12-bit converter. Dataset II contains 1326 pulse wave records from 221 subjects under six kinds of pulse pressures. It records various cardiovascular diseases, including hypertension and arrhythmia. Moreover, the dataset is collected under the control of standard experimental conditions and specifications. All methods are carried out in accordance with relevant guidelines and regulations, and all experimental protocols are approved by the Ethics Committee of Drug Clinical Trials of the Fifth Affiliated Hospital of Zhengzhou University. Table 2 shows the details of dataset II.  At the same time, the feature division of PWC is unclear, and no uniform marking standard exists. The machine needs to learn PWC features automatically. In view of the above problems, a solution is to establish a featureextraction model based on few-shot PWC. In this paper, a multi-scale feature-extraction model is proposed. The multi-scale feature-extraction model for few-shot PWC is shown in Fig. 1. N continuous PWC cycles are used as the network input, and then three network layers are connected in parallel to extract the multi-scale features of PWC. The recursive network layer is used to extract the length features of PWC; the periodic featureextraction layer is used to extract the features of PWC within a single cycle; and the rhythm feature-extraction layer is used to extract PWC features between multiple cycles. Finally, features extracted from the three network layers are combined in the inference layer to classify PWC.
Recursive network layer. PWC is composed of multiple cycles of pulse fluctuations and belongs to a one-dimensional time series. RNN is a network which is good at dealing with time series. By modeling PWC through RNN, the correlation between several cycles of PWC can be obtained. PWC is segmented in chronological order and sequentially input into RNN. The output state of the hidden layer represents the long-term memory feature extracted as the recursive network layer.
Considering the number of sample points and the complexity of network training, a recursive-network unit structure for few-shot PWC is designed. The forgetting and saving of information at each moment are controlled www.nature.com/scientificreports/ by the forgetting gate. Compared with the LSTM unit structure, the parameters are fewer. The specific structure is shown in Fig. 2.
In the recursive-network unit for few-shot PWC, the unit state vector c is controlled by the forgetting gate. The number of unit states of the previous-cycle PWC retained in the current-cycle PWC is determined by the forgetting gate. The forward calculation formula of the forgetting gate is as follows: where the weighted input of the forgetting gate is represented by net f ,t , the weight matrix of the forgetting gate is represented by W f , the operation of splicing two vectors into one vector is represented by [h t−1 , x t ] , the offset term of the forgetting gate is represented by b f , the forgetting gate is represented by f t , and σ refers to sigmoid function.
The weight matrix W f is formed by splicing W fh and W fx , corresponding to the previous-cycle PWC h t−1 and the current-cycle PWC h t : The input state c t of the current-cycle PWC is calculated from the unit output of the previous cycle and the input of the current cycle: where the weighted input of the PWC unit state is represented by netc ,t ; the weight matrix of the PWC unit state is represented by W c , which is composed of two matrices W ch and W cx ; and the bias term of the forgetting gate is represented by b c .
To update the cell state of the current cycle, the current memory c t needs to be combined with the long-term memory c t−1 , and new information needs to be added whilst forgetting some information. The unit state c t−1 of the previous-cycle PWC is multiplied with the corresponding position element of the forgetting gate f t , and the unit state c t of the current-cycle PWC is multiplied with the corresponding position element of 1 − f t to obtain the state output of the current-cycle PWC c t : The output is determined by the state of the PWC unit: As shown in Fig. 3, the recursive network layer is designed as a two-layer structure. The time of the main wave peak of the front PWC is used as the reference, and the single beat PWC of two adjacent troughs is used as the input of the unit. To mine the information that is opposite to the first-layer network-information transmission direction, a layer of reverse network is built on top of each layer of forward network, and the output of the where Bernoulli(p) represents the Bernoulli distribution according to probability p and has a value of 0 or 1; r t = [r 0 , r 1 , . . . , r n ] , n are the dimensions of h t , and x t is the input of the unit node at time t at the reverse network layer.
Periodic feature-extraction layer. The shape and change of PWC in a single cycle contain its main features.
Extracting the spatial features of PWC in a single cycle is necessary. However, PWC is sparse and contains little information at a single moment. If all points of PWC are used as input sessions, the number of network parameters becomes too large. At the same time, the training efficiency and accuracy of the model are greatly reduced by too many features.
To effectively extract the features of PWC in a single cycle, one-dimensional CNN is used. The convolution module has the functions of local connection and weight sharing and can extract the spatial position relationship at different time points of a single-cycle PWC. At the same time, it can reduce the amount of network parameters and reduce the model complexity.
The single-cycle PWC is a signal segment with a length of 235, and it is used as the input. The specific structure of the single-period feature of PWC is obtained through two convolution modules, which each module containing a one-dimensional convolution layer and a maximum pooling layer. To reduce the amount of computation in the process of model training and improve the computation speed, ReLU function is used as the activation function. The specific structure is shown in Fig. 4. .
Rhythm feature-extraction layer. PWC is formed by the periodic and regular pulse of pressure in arteries. However, slight differences exist between PWC cycles, which contain rhythm information and are difficult to  To extract the rhythm features of PWC, a two-dimensional CNN is used. Compared with single-cycle signals, multi-cycle signals have larger data dimensions, and a suitable convolution module scale is more important. Whilst ensuring that features are not discarded, reducing the dimension of data as much as possible is necessary. N adjacent periodic signals are spliced into a feature map, which is used as the input. The specific structure of the single-period features of PWC is obtained through two convolution modules, with each module containing one two-dimensional convolution layer and one max pooling layer. Same as the periodic feature-extraction layer, ReLU function is used as the activation function. The specific structure is shown in Figure 5.
Reasoning layer. The reasoning layer contains two fully connected layers, each neuron is fully connected to all neurons in the previous layer, and the local information with category discrimination in the network layer is integrated. The activation function of the fully connected layer adopts the ReLU function, and the training algorithm adopts the error back-propagation algorithm. The output of the last layer is passed through the SoftMax classifier to obtain PWC classification result.
Given that the gate-unit activation function in the recursive network layer is selected as the Sigmoid function, when the training result is close to the true value, the gradient operator is extremely small, and the convergence speed of the model is slow. The cross-entropy loss function is a logarithmic function. When it is close to the upper boundary, it can still be maintained at a high gradient state without affecting the convergence speed of the model. To improve the convergence speed of the model, cross-entropy is used as the loss function: where N is the number of samples, M is the number of PWC categories, y ij is the label of PWC, and p ij is the prediction result corresponding to each label.

Results
Evaluation index. We use Accuracy(Acc), Sensitivity (Sen), Specificity(Spe), Precision(Pre), and F1-score for model evaluation. The calculation formula for the five evaluation indicators is as follows 30 : where TP stands for true positive and is the number of samples predicting abnormal PWC; TN stands for true negative and is the number of samples predicting normal PWC as normal; FP stands for false positive and is the number of samples predicting normal PWC as abnormal; FN stands for false negative and is the number of samples predicting abnormal PWC as normal; Acc represents the overall classification accuracy of the overall  Preprocessing. Pulse wave signals collected in clinical settings is easily affected by noise, including motion artifacts, inherent noise of collecting instruments, and power-supply noise. Using end-to-end training directly reduces the classification accuracy. Therefore, the original signals need to be pre-processed before the experiment.
The frequency of pulse wave signals is primarily distributed between 0.5-2 Hz. Motion artifact is primarily caused by breathing, and the respiratory frequency of normal adults is about 0.2-0.3 Hz. The inherent noise of collecting instruments is above 90 Hz. The power supply noise is 50 Hz/60 Hz. Empirical mode decomposition (EMD) is a signal-processing method based on the time-scale features of the data itself without pre-setting any basis functions 31 . EMD has obvious advantages in dealing with non-stationary and non-linear data. According to the frequency-distribution features of the pulse wave signals, the EMD method is used to remove the noise.
In the process of EMD, cubic spline function is used to interpolate the maximum value sequence and the minimum value sequence to obtain the upper and lower envelopes. The cubic spline interpolation needs to use two adjacent points before and after. Therefore, there will be divergence at the two ends of the data, that is, the derivative of the intrinsic mode-function at the boundary increases, which makes the filtered signal has obvious distortion at the beginning and end. To avoid the transient phenomenon at the beginning and end of the filtered signal, the end points of the curve are added to the spline. The Pearson correlation degree is used to measure the degree of information loss before and after denoising, and the calculation method is shown in equation (16): where X represents the set of elements arranged in time in the original PWC, Y represents the set of elements arranged in time after denoising, and N represents the number of sample points.
The value interval of Person is [− 1,1]. When a person is close to 1, X and Y have a strong linear correlation, showing that the information loss is relatively small after denoising. In the experiment, only data with a Person coefficient greater than 0.93 after filtering is retained to ensure that the filtered data retains the original information to the greatest extent. The calculation result of the signal Person coefficient in Fig. 6 is 0.99.
Signals collected clinically are actually the slice data of pulse wave signals. Few records of pulse wave database itself exist; not every record has all categories, and even some categories appear only on a few records. In this case, to ensure that categories are not omitted as far as possible, the PWC should be divided separately into beats 32 . In the experiment, the PWC is divided into cycles according to the trough position. The signal segment of the adjacent valleys is regarded as one PWC cycle. The specific pre-treatment process is shown in Fig. 6. Setting of model parameters. The pre-processed PWC is randomly divided into training set and test set at a ratio of 4:1. Under the condition of a certain amount of data, the classification accuracy of the model no longer increases before the number of hidden nodes of the neural network increases to a certain extent 33 . Therefore, the number of hidden nodes in the recursive network layer is set to 32, the number of kernels in the periodic feature-extraction layer is set to 32 and 64, the number of kernels in the rhythm feature-extraction layer is set to 32 and 64. After verification through multiple experiments, the network training parameters are set, as shown in Table 3.
Experiment. The number of PWC cycles input of the model is expressed as a unit cycle. An increase in amount of cycles of model input leads to an increase in depth of the recursive network. Furthermore, the proportion of features extracted by the periodic feature-extraction layer and the rhythm feature-extraction layer increase. At the same time, the network dimension increases leads to decreased training speed of the model. To determine the optimal number of unit cycles for a suitable PWC multi-feature scale model, experiments are conducted for different unit cycles. Training set II is used in the experiment, and each group is trained 50 times. The final results are shown in Table 4. According to the table, when the number of unit cycles is 5, the model has the optimum effect.
On the premise of retaining the structure of the pre-processing layer and the reasoning layer, as well as the network parameters, the network layer is changed for comparison experiments. Zhan 34 Figure 7.
Performance comparison with other neural network methods under different datasets is shown in Table 5. According to the table, the multi-scale model has good classification performance on dataset I and dataset II.    36 were introduced for comparison. Comparative studies show that the multi-scale feature-extraction model outperforms the other classification methods in terms of identification accuracy, stability, and sensitivity, and the multi-scale feature-extraction model consumes less time for training. For the proposed novel PWC classification approach, the model is notably sensitive to the number of unit cycles, and we find that the best unit cycle is five. Also, the multi-scale feature-extraction model depends on the division result of the PWC cycle. Moreover, (a) the classification problem of few-shot PWC is well handled; (b) owing to the limitation of data volume, only binary classification is carried out; (c) PWC must be pre-processed (including EMD filtering and beat segmentation) before feature learning, which may lead to missing partial information. Finally, given the excellent performance of the multi-scale feature-extraction model in the few-shot PWC experiments, the classification of few-shot PWC is interesting and meaningful to study, and it will be further investigated in our future research.