A new, short-recorded photoplethysmogram dataset for blood pressure monitoring in China

Open clinical trial data provide a valuable opportunity for researchers worldwide to assess new hypotheses, validate published results, and collaborate for scientific advances in medical research. Here, we present a health dataset for the non-invasive detection of cardiovascular disease (CVD), containing 657 data segments from 219 subjects. The dataset covers an age range of 20–89 years and records of diseases including hypertension and diabetes. Data acquisition was carried out under the control of standard experimental conditions and specifications. This dataset can be used to carry out the study of photoplethysmograph (PPG) signal quality evaluation and to explore the intrinsic relationship between the PPG waveform and cardiovascular disease to discover and evaluate latent characteristic information contained in PPG signals. These data can also be used to study early and noninvasive screening of common CVD such as hypertension and other related CVD diseases such as diabetes.


Background and Summary
The incidence of cardiovascular disease (CVD) has risen around the world in recent years overtaking the mortality rate of cancer, making CVD the number one killer of humans. Many studies have been conducted using noninvasive early diagnosis and screening techniques for CVDs such as hypertension and coronary artery sclerosis in order to discover more convenient and effective methods for the early identification of CVDs. Of these methods, the photoplethysmography (PPG) has become widely recognized as a low-cost non-invasive detection technology for CVDs. The cardiovascular parameters detected using PPG technology include heart rate, blood oxygen saturation, blood pressure, assessment of arterial stiffness 1 , and pulse wave velocity, among others 2 . The PPG signal includes information on the hemodynamic process 3,4 , hemorheology 5 , and tissue status of the peripheral microcirculation system in the human body 6 . That is, the PPG signal is an aggregated expression of many physiological processes in the cardiovascular circulation system 7 . A physiological information database with high precision and a high sampling rate is urgently needed in PPG technology research in order to extract more cardiovascular parameters for the early screening and diagnosis of CVDs. We provide here a database containing physiological information and PPG waveform data collected over a year that can be used to research arterial blood vessel aging, arterial blood pressure detection 8 , and screening of hypertensive and diabetic patients based on PPG signals.
Electrocardiogram (ECG) signal and PPG signal can effectively estimate blood pressure which has been recently studied improved by researchers 9 . However, there are some limitations with collecting ECG and PPG simultaneously using a mobile phone 10 . For these reasons, and for simplicity, a few researchers have attempted to estimate blood pressure based using only PPG signals 11 . The concept of estimating blood pressure (BP) using only PPG signals seems to promising and is optimally implemented when the PPG signal is of high quality. Therefore providing a database that can help with estimating BP using only PPG will help further research in this area.
This PPG and BP (PPG-BP) database integrates the deidentified, comprehensive clinical data of patients admitted to the Guilin People's Hospital in Guilin, China. The openness of the data allows clinical studies to explore and improve the understanding of relationships between cardiovascular health and PPG signals, with the final goal of creating a simple, effective non-invasive detection technology that is easy to use and wearable. This dataset has been collected from 219 subjects, aged 21-86 years, with a median age of 58 years. Males accounted for 48%. The dataset covers several diseases including hypertension, diabetes, cerebral infarction, and insufficient brain blood supply.
In summary, this unique non-invasive detection dataset for cardiovascular disease can be used in a wide range of in-depth research. In the following section, we describe the database detail records, database usage descriptions, and we also explain how to fully use the dataset. We also provide an example of the basic properties of the database that allows researchers to conduct research.

Experimental design and data acquisition
The dataset collection program involved acquiring information on the basic physiology of individuals, extracting information on cardiovascular diseases from hospital electronic medical records, collecting PPG waveform signals, and detecting instant arterial blood pressure at the same time. The data acquisition was conducted at the Guilin People's Hospital.
A customized portable hardware platform was designed, and consisted of a PPG senor probe, microcontroller, and a matching app. Data were transmitted via Bluetooth. The PPG sensor model was SEP9AF-2 (SMPLUS Company, Korea), which contains dual LED with 660nm (Red light) and 905 nm (Infrared) wavelengths, with a sampling rate of 1 kHz and 12-bit ADC, and the hardware filter design is 0.5-12Hz bandpass. The microcontroller model was MSP430FG4618 (Texas Instruments company, USA) embedded on the probe's board to configure the ADC, fetch the data and send the data to the matching app via Bluetooth.
Waveform data is collected using a set of customized probes and a matching app that was developed based on Android Studio. The PPG detection probe used the infrared light and transmission method to collect fingertip PPG waveform data. These real-time data are transmitted to the matching app via Bluetooth. The app can control the detection probe, display the real-time waveform, and conduct a signal quality assessment of the PPG waveform in order to save the high-quality PPG wave segment. The arterial blood pressure is measured using the Omron HEM-7201 (Omron Company, Kyoto, Japan) upper arm blood pressure monitor, which is validated in ref. 12 .
The study was approved by the ethics committee of the Guilin People Hospital and the Guilin University of Electronic Technology in China. All participants gave written and informed consent before the study. They were compensated monetarily at 10 Yuan/h. Participants answered questions about age, gender, height, and weight and all initial data acquisition was conducted in a private, and comfortable clinical room.
As shown in Figure 1, before beginning with signal collected, each individual was asked to sit in an office chair in the most comfortable posture and to relax their arms on an empty desk. Each individual had 10 min to adapt to the environment and adjust their breathing after entering the data collection room. The specific collection settings were as follows: The PPG signal was collected at the fingertip of the left index finger, the arterial blood pressure was collected from the right forearm, all of which was www.nature.com/sdata/ SCIENTIFIC DATA | 5:180020 | DOI: 10.1038/sdata.2018.20 completed within three minutes. The arterial blood pressure measurement was performed by the hospital nurse.
During signal acquisition the sampling precision of waveform data was set to a sampling rate of 1 kHz, with 12 bits AD conversion precision. Three segments were recorded and saved per subject, each segment included 2100 sampling points, which corresponds to a length of 2.1 seconds. During the 3 min data collection phase, every PPG segment of a particular subject scored a Skewness SQI value; values greater than Zero 13 were saved, and if a value was less than Zero the app prompted the user to recollect the PPG signal. This step was developed to reduce including PPG segments with high noise and motion artifacts.
The BP collection device (Omron HEM-7201) requires at a minimum a 30 second waveform to detect the systolic and diastolic period. The BP reading represents the blood pressure value for the 3 minute data recording for each subject, as shown in Figure 1. During the data collection process, we aimed to collect the BP and PPG data immediately after each other respectively. Three PPG segments were saved during the data collection period in addition to the BP recording. Every participant was asked to breath as they normally would on day-to-day basis for practical applications. Note that we did not investigate the baroreflex response to stress. The dataset includes BP and PPG information from subjects that were diagnosed with normotension, prehypertension, and stage I/stage II hypertension, which can be helpful and valuable for researchers.

Patient characteristics
The dataset was collected from 219 adult subjects and currently contains 657 PPG waveform segments. The dataset covers individuals aged 21-86 years, and males account for 48% of participants. The dataset also covers several different CVDs, including hypertension, cerebral infarction, and insufficient brain blood supply and other related diseases such as diabetes. The statistical results are shown in Figure 2.

De-identification
In the process of creating the dataset, the first task was to de-identify each participant and remove personal information such as name, telephone number, address, date, and so on.

Data Records
The dataset has been fully uploaded to the network, and users can download them through the Figshare repository with the title PPG-BP Database and reference (Data Citation 1).
The dataset comprises 1 table file and 219 waveform data folders, which include three 2.1-secondlength infrared PPG signal text files and physiological information recording files. Among these, the PPG signal data is the 2.1-second-length 12 bits AD raw value. The ID_1, ID_2, and ID_3 text files represent three separate segments of waveform data Table 1 (available online only).
The "PPG-BP database.xlsx" table file contains aggregated subjects of physiological information and disease information. Information records include ID, sex, age, height, weight, systolic pressure, diastolic pressure, heart rate, and disease records.
Before the participant record is archived, it was required to conduct data integrity screening, data availability screening, and a waveform signal quality evaluation (to remove inconsistent, abnormal, and high noise data) in order to form a high-quality dataset. The detailed process of inclusion and exclusion, as shown in Figure 3, is described as follows: 1 Data integrity screening: This process includes the screening of missing and abnormal values for: basic physiological information, disease information, blood pressure, heart rate, and 3 segments waveform data. If one or more items are missing or if there was an abnormal value, the participant record was removed. 2 Data availability screening: This dataset is designed to focus on the clinical information for CVDs and other closely related diseases such as diabetes. Data from the CVD patients who were diagnosed with non-CVD diseases (except diabetes) were excluded during the screening process to ensure that the dataset only contains data from participants who were diagnosed with the disease of interest. 3 Waveform signal quality evaluation: All 3 segments for each participant went through a signal quality evaluation, and a robust signal quality index (SQI) method was applied in order to achieve this step. If the SQIs of the 3 segments in one subject were lower than the mean SQI calculated from the segments of all subjects, the subject data was removed.

Technical Validation
In order to thoroughly analyze the abundant information on cardiovascular physiology contained in the PPG signals, it was necessary to obtain the most authentic, high-precision, and high-quality PPG signal. In addition to controlling the process of data collection as accurately as possible, the PPG signal quality evaluation process was important in acquiring excellent quality signals 14,15 . This process can effectively prevent unfit waveform data from being saved and ensure that the saved data have complete heartbeat cycles, less noise, and lower drift and motion artifacts. Although the potential value of the PPG signal is well known 16,17 , it is difficult to acquire authentic and rich PPG signals and to extract subtle characteristics that signify problems such as difference in body tissue 18,19 , disturbance of motion 20-23 , control of the acquisition process 24,25 , etc. This is worth more research and exploration 26 . In order to avoid or eliminate the above mentioned issues, the data collection process was conducted in a controlled, quite environment with minimal interferences. Figure 4 shows an overview of the data collection process. Figure 5 shows a statistics table of the physiological records for all subjects in the database.
The process of data collection experiment consists of five stages. The stage I and stage II conduct some preparations of the customized hardware and software, training of operators and recruitment of participants. Stage III is the phase of data collection in hospitals, including the acquisition of basic physiological information of participants, hospital electronic medical records, PPG signals and blood pressure data. Stage IV is the data archiving part, including de-identification, format conversion, data matching, data inclusion and exclusion for all the collected data. Stage V is the public release of the PPG-BP dataset; researchers can download the dataset and validate their algorithms.
At present, Perfusion Indices (P SQI ) are regarded as the gold standard of PPG Signal Quality Indices (SQI). Various other signal quality evaluation methods have been proposed and studied in order to identify more simple and accurate evaluation methods for signal quality assessment. Elgendi 13 compared eight different signal quality indices: P SQI , Skewness (S SQI ), Kurtosis (K SQI ), Entropy (E SQI ), Signal-tonoise ratio (N SQI ), Zero-crossing (Z SQI ), Matching of multiple systolic wave detection algorithms (M SQI ), and Relative power (R SQI ). For varying lengths of PPG waveform recordings (i.e., from 2 s to 30 s), the S SQI method demonstrated better performance when compared to other methods (P SQI , K SQI , E SQI , N SQI , M SQI , Z SQI , and R SQI ) 13 . Moreover, PPG waveform classification is possible with 2 s length recording (excellent vs. unfit) using the S SQI index 13 . These results motivated collecting PPG signals with 2 s length.  Krishnan et al. 22 introduced and tested the Skewness statistic and found that S SQI had a certain connection with the quality of PPG waveform. Other researchers validated this observation, and Elgendi 13 found it to be the optimal method for assessing SQI in PPG signals. Skewness is used to measure the probability distribution of symmetric signals, which can distinguish the periodic, symmetrical, stationary signals and sudden jumps, periodic signals, and irregular signals. The specific definition is as follows: where N is the sample number of the PPG signal, andμ x and σ are the empirical estimates of the mean and standard deviation of x i , respectively.
In the process of data collection, the data is evaluated using the PPG signal quality before it was saved, and the evaluation method adopted the S SQI index. Each segment of PPG signal was evaluated against the classification threshold of excellent, acceptable, or unfit PPG waveform in order to determine whether it should be saved.
During the evaluation of signal quality for each participant, the S SQI for the three segments were compared. Among the three the segments, the segment with the highest S SQI was deemed as "high quality", the segment with the lowest S SQI was deemed as "low quality" and the remaining segment was deemed as "medium quality". Note, we are providing the PPG segments and their corresponding S SQI values to make it easier for investigators to select the segment with highest quality. Additionally, with the availability of the three S SQI values, researchers will able to analyze each segment, if needed, for validation, etc.

Usage Notes
The dataset is distributed in the normal standard file format (text, xlsx) and can be read and processed by a variety of packages, including Matlab, Python, and R. In addition, when using this dataset, users should first perform an effective waveform quality evaluation to select the high-quality waveform segment from the three waveform segments from each participant.
There are multiple potential uses for this dataset, the most obvious of which is to validate various CVD diseases and diabetes through pattern recognition and machine learning methods. About one-third of the dataset includes hypertensive or diabetic patients. A series of PPG characteristics can be defined and mined, the intrinsic association between characteristic and physiological process can be studied, and the prediction of blood pressure or classification of hypertension can be explored. It is important to analyze the physiological information contained in the PPG waveform comprehensively.

Example usage
This dataset can be used to analyze the definition and extraction of morphological information of the PPG waveform. The PPG wave mainly consists of the systole period and the diastole period. At the same time, depending on different human cardiovascular health statuses, the tidal and diastolic wave may also be shown. These characteristics 27 can be defined and extracted by means of digital signal processing such as time and frequency domain processing or wavelet transform. The time span, amplitude, area, ratio, frequency, and energy parameter are the main features that can be extracted from the PPG signal. 28 The accurate identification and extraction of these characteristics has potential value in analyzing vascular aging, blood pressure, and arteriosclerosis, among others.