OSASUD: A dataset of stroke unit recordings for the detection of Obstructive Sleep Apnea Syndrome

Polysomnography (PSG) is a fundamental diagnostical method for the detection of Obstructive Sleep Apnea Syndrome (OSAS). Historically, trained physicians have been manually identifying OSAS episodes in individuals based on PSG recordings. Such a task is highly important for stroke patients, since in such cases OSAS is linked to higher mortality and worse neurological deficits. Unfortunately, the number of strokes per day vastly outnumbers the availability of polysomnographs and dedicated healthcare professionals. The data in this work pertains to 30 patients that were admitted to the stroke unit of the Udine University Hospital, Italy. Unlike previous studies, exclusion criteria are minimal. As a result, data are strongly affected by noise, and individuals may suffer from several comorbidities. Each patient instance is composed of overnight vital signs data deriving from multi-channel ECG, photoplethysmography and polysomnography, and related domain expert’s OSAS annotations. The dataset aims to support the development of automated methods for the detection of OSAS events based on just routinely monitored vital signs, and capable of working in a real-world scenario.

(2022) 9:177 | https://doi.org/10.1038/s41597-022-01272-y www.nature.com/scientificdata www.nature.com/scientificdata/ Unfortunately, performing a PSG in an electrically hostile environment, such as a stroke unit, on neurologically impaired patients is a difficult task 13,14 , with the result that signals are often affected by noise; in addition, the number of strokes per day vastly outnumbers the availability of polysomnographs and dedicated healthcare professionals. Therefore, a simple and automated recognition system to identify OSAS cases among acute stroke patients is highly desirable. The continuous multiparametric recording of vital signs that is routinely performed in stroke units represents a relevant data source for a comprehensive assessment of a patient's health status. However, such data represents an insufficient amount of information for traditional, manual sleep scoring 15 .
The dataset presented in this work (named OSASUD, Obstructive Sleep Apnea Stroke Unit Dataset) is aimed at supporting the development of automated methods for the identification of OSAS episodes based on simplified monitoring system data. It is composed of overnight recordings of 30 patients that were admitted to the stroke unit of the Udine University Hospital, Italy. For each patient, recordings of multi-channel ECG and photoplethysmography (PPG) are reported, together with derived data including heart rate, oxygen saturation, pulsatility index, respiratory rate, and premature ventricular contractions.
In the literature, several OSAS datasets have already been published with a similar goal, the most important ones being Physionet's Apnea-ECG Database (35 training +35 test patients) 16 , SVUH/UCD St. Vincent's University Hospital/University College Dublin Sleep Apnea Database (25 patients) 17 , HuGCDN2014 Database (77 patients) 18 , and MIT-BIH Polysomnographic Database (18 patients) 19 . Nevertheless, they are all far from representing a real-world situation: their data are recorded in ideal conditions and on highly selected patients, with stringent exclusion criteria concerning the presence of cardiac, respiratory, and other comorbidities. As a result, models developed according to them are hardly generalizable to real-life scenarios, where they would be of actual use.
Another source of sleep-related data, although not focused at apnea detection tasks, is the National Sleep Research Resource, which provides a repository 20 of sleep study datasets, including the Sleep Heart Health Study (5804 subjects) and the MrOS Sleep Study (2911 subjects).
Our setting is quite different. The patients we consider show a considerably complex clinical situation, and the presence of comorbidities is the rule rather than the exception. In addition, recordings are affected by noise and missing data, as is typical in real-world monitoring systems. For these reasons, we believe that publicly sharing our dataset would represent a valid support for further advancing the research into OSAS detection. Figure 1 depicts the overall workflow of the study. Each patient underwent simultaneous overnight PSG and vital signs (ECG and PPG) recording. The collected PSG data was then annotated by a trained sleep physician against the presence of apnea and hypopnea events, at one second granularity. The PSG data and annotations were then temporally aligned with and matched against the recorded vital signs. The final dataset was assembled considering the physician's annotations and a relevant subset of the collected data. In the following, the different phases of the workflow are thoroughly described.

Methods
Participants. The study consists of 30 patients who were admitted to the stroke unit of the Clinical Neurology Unit of the Udine University Hospital for a suspected cerebrovascular event (ischemic stroke, transient ischemic attack, or hemorrhagic stroke) from August 2019 to July 2020. Exclusion criteria were the following: age <18 years, insufficient compliance to standard monitoring and/or PSG, aphasia of sufficient severity to limit comprehension of the study protocol and/or expression of informed consent, high risk of alcohol/drug withdrawal syndrome. Diabetes mellitus, atrial fibrillation, cardiac disease, obesity, and other medical conditions not listed above were not considered as exclusion criteria. Table 1 reports detailed information regarding each patient. As can be seen, the data is quite heterogeneous considering age, gender, AHI, and quality of the recordings.
Ethics declaration. All participants gave written informed consent prior to their participation to the study.
The regional Ethics Committee (Comitato Etico Unico Regionale) of Friuli-Venezia Giulia, Italy approved the anonymous publication of data recordings. www.nature.com/scientificdata www.nature.com/scientificdata/ Data collection. Each patient underwent simultaneous overnight PSG and vital signs recording. Recordings were performed during the first days after clinical onset (average 1.31.1 days, range 0-5), while patients were still monitored in the Stroke Unit. Table 2 summarizes all collected signals.
A level 3 PSG without video recording was performed using an Embletta MPR polysomnograph (Natus Medical Inc., Pleasanton, CA, USA), keeping track of the following channels: nasal airflow, blood oxygen saturation, snoring, body position, thoracic and abdominal movements, and ECG. Nasal airflow was derived from a dedicated pressure transducer connected to a nasal cannula; sampling rate was 20 Hz, whereas high-pass and low-pass filters were set at 0.1 and 15 Hz respectively. Blood oxygen saturation was measured by means of transmission PPG with red-infrared light-emitting diode and sensor positioned on the opposite sides of a finger. The sampling rate for the PPG curve was 75 Hz; arterial oxygen saturation and heart rate were measured from the PPG signal with a 3 Hz sampling rate. Snoring intensity was estimated by means of nasal airflow waveform analysis, with a 10 Hz sampling rate. Body position was recorded with an internal three-axis accelerometer with a 10 Hz sampling rate; position data were also used to detect major body movements. Thoracic and abdominal movements were recorded by means of two independent respiratory inductance plethysmography single-use sensor bands; the thoracic band was positioned midway between the manubrium of the sternum and the xyphoid process, whereas the abdominal band was placed midway between the xyphoid process and the umbilicus. The sampling rate was 10 Hz for both channels, with high-pass and low-pass filters set at 0.1 and 15 Hz respectively. ECG was recorded with a single-use Ag-AgCl electrode on the acromial head of each clavicle, akin to a lead I with proximal electrode positioning 21 . The ECG signal was recorded with a 500 Hz sampling rate and high-pass and low-pass filters set at 0.3 and 70 Hz respectively. Files were analyzed with Embla RemLogic software, version 3.4.1.2371 (Natus Medical Inc., Pleasanton, CA, USA). Recordings were exported as EDF files 22 with no gain adjustment or additional filtering; annotations were exported as separate TXT files with timestamps for each event.
Vital signs were collected by means of a Mindray iMec15 monitor connected to a Mindray Benevision CMS II central monitoring system (Mindray Bio-Medical Electronics Co., Ltd., Shenzhen). The following parameters were recorded: 12-lead ECG waveform (standard leads: I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6 21 ) with single-use Ag-AgCl electrodes, ECG-derived heart rate, ECG-derived premature ventricular contraction (PVC) www.nature.com/scientificdata www.nature.com/scientificdata/ rate, thoracic impedance waveform measured from lead II electrodes, thoracic impedance-derived respiratory rate, PPG waveform recorded with red-infrared light-emitting diode and sensor positioned on the opposite sides of a finger, PPG-derived pulse rate, PPG-derived blood oxygen saturation, PPG-derived perfusion index, oscillometric arm cuff blood pressure (systolic, diastolic, and mean). Sampling frequencies were 80 Hz for ECG, PPG, and thoracic impedance waveforms, 1 Hz for heart rate, PVC rate, pulse rate, blood oxygen saturation, perfusion index, and respiratory rate, and 1/hour for blood pressure. Recording bandwidths (−3 dB) were 0.5-40 Hz for the ECG channels and 0.2-2 Hz for thoracic impedance, both with a 60 Hz notch filter. All data were exported from the central monitoring system storage disk as comma-separated value (CSV) files.
Data annotation. All PSG data were reviewed with Embla RemLogic software, version 3.4.1.2371 (Natus Medical Inc., Pleasanton, CA, USA), that allows for signal processing, inspection and annotation. The dataset was annotated by a trained sleep medicine physician in accordance with the American Academy of Sleep Medicine sleep scoring rules 15 , and tagged against the presence of central/obstructive/mixed apnea and hypopnea events (which we refer to as anomalies), each identified by its specific time interval. Figure 2 shows a partial recording with its annotations, opened in Embla RemLogic. Data transformation. Since patients' data were simultaneously recorded by means of two different devices (the Embletta polysomnograph and the Mindray monitoring system), they needed to be temporally aligned. This is quite natural, as different devices may have slightly different clocks which they use to timestamp the data (i.e., the same timestamp, on different devices, might refer to slightly different real-world time instants). Given a patient, to determine the time shift between its two sets of recordings, we proceeded as follows. We considered Embletta's and Mindray's oxygen saturation and heart rate recordings, downsampling Embletta's data from 3 Hz to 1 Hz by means of averaging. We calculated the correlation between the two heart rate time series at different time offsets. We then repeated the same process for the two oxygen saturation time series. As a result, we obtained two different shift estimates. Starting from the smallest estimate, we ultimately fine-tuned the shift value by hand looking at different parts of the signals, obtaining the final alignment. Figure 3 shows the situation for the heart rate signal of one of the considered patients, before and after the alignment process.
Thanks to the alignment process, we were then able to correctly associate the PSG data and related apnea annotations performed by the physician to the Mindray data. Next, values of oxygen saturation below 50 or above 100 were considered to be artifacts, and assigned to null. The same approach was taken for values of heart  www.nature.com/scientificdata www.nature.com/scientificdata/ rate below 20 or above 200, and respiratory rate below 5 or above 40. Whenever oxygen saturation was set to null, we also set perfusion index to null. To the value of premature ventricular contractions we set an upper bound equal to the corresponding heart rate. As for PSG data related to airflow, snoring, body position, thoracic and abdominal movements, we standardized each signal individually for each patient. This allows us to improve their comparability, since different calibrations are expected to be used for different recording sessions. Finally, data were de-identified in order to preserve the privacy of the participants. Observe that no signal filtering was applied in this phase.

Data Records
The dataset OSASUD consists of a Pandas 23 DataFrame with 18 columns and 961357 rows, saved in Pickle format (file dataset_OSAS.pickle 24 ). Table 3 provides an overview of the columns. Observe that we consider only a subset of the originally recorded data. The reason is two-fold: (i) some signals are redundant, for instance, the PPG and ECG waveforms. In such cases, given the aim of our dataset, we favour Mindray data; and, (ii) some signals are only used for auxiliary tasks, for example, this is the case of thoracic impedance, from which respiratory rate is derived.
As a result, each row is characterized by an anonymous identifier of the patient and a timestamp that keeps track of the time instant at which the data was recorded, at one second granularity. As for the other columns, they report: • the ECG-derived heart rate, respiratory rate, and premature ventricular contractions per minute; • the PPG-derived oxygen saturation (in %) and perfusion index (in %);  Note that, given a patient, its data are contiguous from the start to the end of her/his overnight recording. In the event in which values were missing for a time instant, they were replaced by null (constant numpy.NaN 25 ) in order to maintain timestamp contiguity.

technical Validation
For each patient, we determined the amount of null values in PPG and ECG waveforms, and their derived attributes. Results are presented in Table 4. Annotations are always present (with 'NONE' as the default value), as well as PSG-recorded signals.
As for the non-null values, the acquired PPG and ECG waveforms, their derived data, and the PSG-recorded signals were carefully inspected by a trained physician jointly with the PSG-based apnea events annotation phase. From the inspection, it resulted that several recordings were affected by artifacts, either caused by the presence of comorbidities (e.g., atrial fibrillation) or sudden movements performed by the patients. Such artifacts are unavoidable and common during recordings in a clinical setting, especially in an electrically hostile environment such as an intensive care unit. Given our dataset's purpose of modelling a real-world scenario, we chose to keep all the data, without removing noisy or null segments. Figure 4 presents the value distribution for each patient and ECG-and PPG-derived attributes. Null values have been ignored. Each box extends from the first to the third quartile values of the data, with a line at the median. Whiskers extend to the smallest and largest observations which are not outliers (considering 1.5 times the interquartile range).
A final validation of the dataset comes from the successful development of a deep learning model for OSAS event prediction based on the considered data, recently presented in the literature 26 .

Usage Notes
We have successfully read the dataset loading it by means of Python's Pickle and Pandas packages. As each row of the Pandas DataFrame contains one second-worth of data pertaining to a given patient, in order to produce a machine learning training dataset to support the detection of OSAS based on data recorded by ECG and PPG, we suggest the following processing.
For each patient, concatenate all of its rows column-wise, so to obtain the full time series pertaining to each of the data columns. Since no filtering was performed on the waveform signals, we also suggest to apply a Butterworth bandpass filter of order 2, with 5 Hz highpass frequency and 35 Hz lowpass frequency on the ECG  www.nature.com/scientificdata www.nature.com/scientificdata/ and PPG waveform time series. At this point, divide each time series based on a s-second windowing approach, possibly with a certain degree of overlap between the windows. As a result, each instance is characterized by five s-second windows related to the ECG and PPG derived data, three (s·80)-second windows concerning the waveform data, and two s-second windows containing the apnea event labels, respectively coded as a string or a boolean value. Finally, possibly remove those windows in which all predictors exhibit more than a specific degree (e.g., 50%) of null values. Note that, since raw waveforms have been included in the dataset, derived data other than those already provided can be easily calculated. This is the case for instance of features pertaining to the QRS complex, or to the pulse transit time.
We encourage using the dataset for the development of automated (either statistical or machine learning-based) solutions, for instance in the following scenarios: • a model can be trained to predict the presence or absence of breathing anomalies based on the attribute anomaly, or to derive a more detailed classification by means of the attribute event. Note that, given the nature of the dataset, contrary to most previously published data sources, predictions at one-second granularity are possible, i.e., a model can be trained to determine the exact start and end time of each OSAS event; • observe that sleep-disordered breathing occurring in the acute setting of cerebrovascular disease may present different features from sleep-disordered breathing in the general population. This database offers researchers a tool to train or test models for the identification of respiratory events in this specific subset of patients. In addition, given the real-world connotation of our dataset, it could be used to develop and embed models in current monitoring systems, with the aim of identifying sleep-disordered breathing in acute stroke patients, without resorting to mass PSG screening; • an unsupervised model trained to detect unexpected signal variations emerging from the background variability may be considered, with the idea that such variations may act as a biomarker of clinical instability; • thanks to the inclusion of detailed polysomnographic data, the dataset may also support studies aimed at uncovering qualitative and quantitative relationships between PSG-, PPG-, and ECG-derived information.  www.nature.com/scientificdata www.nature.com/scientificdata/ It should be noted that our work still presents some limitations: first of all, the sample size is relatively small. Second, all recordings have been performed at a single center. Moreover, all included patients share some homogeneous characteristics regarding ethnicity, region of origin and reason for admission. Additionally, we performed PSG with a class III device, which does not include EEG channels. Therefore, we could not obtain a proper sleep staging nor identify arousals: periods of wake after sleep onset and hypopneas associated with arousals but without significant desaturation may have been missed. Finally, all recordings have been performed once within the first days after disease onset, with no follow-up recordings or later acquisitions for comparison. www.nature.com/scientificdata www.nature.com/scientificdata/

Code availability
To allow for an easier usage of our data, a Python Jupyter Notebook is also included with the dataset (file preprocess_dataset.ipynb 24 ). The notebook has been tested with the following packages versions: pandas = 1.3.3, numpy = 1.20.3, pickle = 4.0. The code performs a series of data pre-processing operations, that include: • loading the Pickle file that encodes the dataset as a Pandas DataFrame; • printing some validation results, including the values presented in Table 1; • generating a sample machine learning-ready dataset, in the form of a set of Numpy arrays.
The code provided significantly contributes to relieve the burden of data pre-processing, which typically absorbs a major part of time in the development and testing of machine learning solutions.