VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients

In modern anesthesia, multiple medical devices are used simultaneously to comprehensively monitor real-time vital signs to optimize patient care and improve surgical outcomes. However, interpreting the dynamic changes of time-series biosignals and their correlations is a difficult task even for experienced anesthesiologists. Recent advanced machine learning technologies have shown promising results in biosignal analysis, however, research and development in this area is relatively slow due to the lack of biosignal datasets for machine learning. The VitalDB (Vital Signs DataBase) is an open dataset created specifically to facilitate machine learning studies related to monitoring vital signs in surgical patients. This dataset contains high-resolution multi-parameter data from 6,388 cases, including 486,451 waveform and numeric data tracks of 196 intraoperative monitoring parameters, 73 perioperative clinical parameters, and 34 time-series laboratory result parameters. All data is stored in the public cloud after anonymization. The dataset can be freely accessed and analysed using application programming interfaces and Python library. The VitalDB public dataset is expected to be a valuable resource for biosignal research and development.

multiple monitoring devices applied simultaneously to one patient are recorded as time-synchronized data tracks and stored as a single case file. Automatic recording function of this program has enabled massive collection of intraoperative biosignals in our tertiary, university hospital. The Vital signs DataBase (VitalDB) was constructed using (1) de-identified case files that were automatically recorded by the Vital Recorder program during daily surgery and anesthesia, and (2) perioperative patient information retrieved from our EMR system.
Unlike the previously reported public multi-parameter biosignal datasets [10][11][12] , the VitalDB is the first public biosignal dataset specifically focused on perioperative patient care and is characterized by containing multi-parameter high-resolution waveform and numeric data 13 . Since the VitalDB dataset was first released in 2017, it has been used for various big data research such as: deep learning for arterial pressure waveform-based cardiac output algorithm, deep learning-based pharmacokinetic-pharmacodynamic study of intravenous anesthetics, machine learning for bispectral index algorithm, statistical analysis of the relationship between intraoperative bispectral index and postoperative mortality, and deep learning algorithm to predict intraoperative hypotension from arterial waveforms [14][15][16][17][18] .
Perioperative clinical information, laboratory results and surgical outcomes in this dataset may facilitate a variety of clinical outcomes or clinical decision support studies. Studies that elucidate the relationship between biosignal parameters and clinical variables will also be feasible. For instance, the effects of intraoperative variables such as hypotension, hypothermia, and low cardiac output on clinical outcomes such as acute kidney injury, the length of hospital stay, or in-hospital mortality can be examined. The physiologic effects of various interventions such as vasoactive drugs, fluids, anesthetics, and anesthesia machine settings may be sought from the dataset. This dataset may simply be used as data samples for developing signal processing algorithms. However, we argue that this big data is better suited for a training dataset for machine learning of biosignals or for external validation of biosignal algorithms created using other datasets.
A final point to mention is the limitation that our data are from a single institution and a single race (Asian). Researchers should be careful as this can lead to overfitting of algorithms. As multicenter data can be a solution to this problem, we have released the Vital Recorder program and the VitalDB dataset for free. We hope that multicenter biosignal research for the development of general algorithms will be widely implemented in the future.

Methods
The database includes vital signs data and related clinical information that were prospectively recorded during surgery. The patient information was retrospectively obtained from our hospital's EMR system after surgery.
Approval for data collection. The acquisition and free disclosure of the data was approved by the Institutional Review Board of Seoul National University Hospital (H-1408-101-605). The study was also registered at clinicaltrials.gov (NCT02914444). Written informed consent was waived due to anonymity of the data. Data collection was performed in accordance with relevant guidelines and regulations of the institutional Ethics Committee.
Study population. Data were obtained from non-cardiac (general, thoracic, urological, and gynecological) surgery patients who underwent routine or emergency operation at Seoul National University Hospital, Seoul, Korea from Aug 2016 to Jun 2017. Of the 7,051 eligible cases, cases with local anesthesia (239), incomplete recording (279), and loss of essential data tracks (145) were excluded. Finally, 6,388 cases (91%) who received general anesthesia, spinal anesthesia, and sedation/analgesia were included in the dataset (Table 1). Dataset development. These methods are expanded versions of descriptions in our related work 9 . All case files in this dataset were recorded using the Vital Recorder program (v 1.7.4). The laptop computer executing the Vital Recorder program was connected to multiple patient monitoring devices via serial cables (Fig. 1). Monitoring data from multiple anesthesia devices applied to one patient were recorded in one case file in a time-synchronized manner.
The same recording systems were installed in 10 out of 31 operating rooms to collect data over a year. The recording system operated for 24 hours every day, and case files of individual patients were automatically recorded separately. The case-by-case automatic recording was performed with the following method: • When both heart rate and percutaneous oxygen saturation signals are detected, patient monitoring is considered to have started and case recording begins immediately. • If the input of heart rate and percutaneous oxygen saturation signals disappears for more than 10 minutes according to the end of patient monitoring, the recording is automatically stopped.
The data collection process was remotely monitored every day in real-time through web monitoring, and the integrity of the case-matched vital files was reviewed on a weekly basis. After verification of case-matched vital files (detailed in the Technical Validation section), track processing was performed using code for verified vital files.
• Tracks with all 0 values or less than 10 data samples were deleted.
• Waveform tracks without corresponding numeric tracks were deleted.
• Track name changes have been made for improving the usability of the dataset.
• If a femoral arterial catheter was confirmed on the anesthesia records, the related arterial waveform and numeric tracks were renamed to from ART, ART_SBP, ART_DBP, and ART_MBP to FEM, FEM_SBP, FEM_DBP, and FEM_MBP, respectively.
• "PUMP" in the PUMP_RATE and PUMP_VOL tracks has been changed to specific drug names, obtained from infusion pump data or anesthesia records (eg. EPI_RATE, PPF20_VOL).
The demographic, surgical, anesthetic, preoperative, intraoperative and outcomes data of the patients were obtained from the EMR system and included in the dataset. The laboratory test results within 90 days before and after the anesthesia start time were extracted from the EMR, and all non-numeric characters were removed from the results. This information is organized in separate csv files in the dataset.
Finally, de-identification of the dataset was performed before the release of the dataset.
• Instead of the actual patient number, random surgery case identifiers (caseid) were assigned to the cases (1-6,388); Individual identifiers of the hospital ID (subjectid) was also added for reoperation case identification (1-6,090). • Since case-matched-and-renamed vital files no longer contain any patient identification information, only de-identification of the recording time was performed.
• The surgery start and end times, and the anesthesia start and end times were extracted from the EMR and integrated to the event track of vital files. • The starting time point of the recording was set to "0" and the other time were converted to the relative time to the start point.  www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The dataset consists of intraoperative vital signs data (6,388 vital files in vital format), perioperative clinical information (clinical information.csv) and the laboratory results (lab results.csv) of 6,388 surgery cases. All data is accessible from an open data repository (VitalDB Open Dataset in Open Science Framework) 19 .
In brief, the dataset has the following characteristics: • The dataset consists of intraoperative vital signs data and perioperative clinical information of 6,388 cases.
• Vital signs data includes up to 12 waveform and 184 numeric data tracks acquired from multiple anesthesia devices applied to patients during surgery. The total number of data tracks is 486,451 (average 87, range 16-129). • Vital signs data have various time intervals according to the anesthesia devices, with a time resolution of 1-7 seconds for numeric data and 62.5-500 Hz for waveform data ( Table 2). Each case file contains an average of 2.8 million data points. • Data is not pre-processed because the real-world noise in the vital signs data is very essential to the development of practical monitoring algorithms. • A total of 74 perioperative clinical information parameters and 34 time-series perioperative laboratory results are provided to help interpret the relationship with the intraoperative vital signs.
Since different anesthesia equipment was used for each patient, the data tracks are configured differently for each case file. Specifically, data from the patient monitor (Solar TM 8000 M, GE healthcare, Wauwatosa, WI, USA) was taken from all patients, and analog signal (TramRac-4A, GE healthcare, Wauwatosa, WI, USA) The clinical information file provides patient-related perioperative data to help interpret biosignal data (Table 3). This file consists of caseid and subjected, and 72 clinical parameters including case file information, demographic data, outcomes, preoperative laboratory data, and surgery and anesthesia related data. Among the parameters, "casestart" is the time the patient's case file recording started, and the value is always "0". All time-series data in the VitalDB dataset is anonymized in seconds using the casestart time as a reference point. Since the anesthesia start time (anestart) and anesthesia end time (aneend) are the times recorded at 5-minute intervals in the EMR, there may be a difference of several minutes from the start time (casestart) and end time (caseend) of the actual case recording.
Finally, the laboratory results file contains 928,448 time-series data for 34 blood tests from 3 months before surgery to 3 months after surgery. Laboratory results are provided as a list of case identifier (caseid), blood test time (dt), test name (name), and value (result) for each test. Since the test time is a relative time expressed in seconds with the cases start time as a reference point, preoperative tests have negative time values.
Detailed descriptions and data availability of all vital signs tracks, clinical information, and laboratory results are uploaded in the open data repository (Suppl 1. VitalDB Parameters and Data Availavility.xlsx) 19 .

Technical Validation
The case-matching and verification of the vital files was conducted as following: • During recording, the connection status of anesthesia equipment was frequently assured by real-time remote monitoring. • After surgery, the automatically recorded case files were retrospectively matched with the operation schedule retrieved from the EMR on a weekly basis.
• Since the vital file name is automatically generated in the format of 'operating room name_recording date_time (eg. OR1_170101_081005.vital)' , it is possible to specify the corresponding patient from the operation schedule.   www.nature.com/scientificdata www.nature.com/scientificdata/ The integrity of the vital files was validated as following: • All case-matched files were separately loaded into the Vital Recorder program and visually verified by four anesthesiologists (authors YP, SBY, SMY and CWJ). • If data tracks were found to be invalid during the data check, they were intentionally removed.
• Inhalation anesthesia-related parameters during total intravenous anesthesia were deleted.
• Waveform data tracks without corresponding numeric values have been removed.
• However, our dataset from real anesthesia partly contains the following signal noises. These noises have not been removed as they are essential elements for developing practical algorithms including data preprocessing.
• Data loss of bispectral index, cerebral oximeter, electrocardiography, and plethysmography due to temporary sensor detachment. • Abnormal values of arterial pressures during blood sampling.
• Electrocardiography and electroencephalography noises during electrocautery and electrophysiologic monitoring.

Usage Notes
The use of the dataset for research and development begins with download of the data from the OSF repository (a total data volume of 103.4 GB) 19 . In this case, the research can be conducted using the python package.

Python package (vitaldb).
The vital file is a binary file recorded with the Vital Recorder program and contain time-series records of vital signs. The specification of the vital file format is detailed a document in the open data repository (Suppl 2. Vital File Format.pdf) 19 . A python package "vitaldb" that helps reading and writing of vital files is freely available on the Python Package Index.
There is a function named "load_case" to load track data from a single case file. The "load_case" function can be detailed as following: • Description: Load multiple track data from single case.
• Usage: load_case (caseid, tnames, interval = 1) • Arguments caseid: caseid to load.  After reading the vital file with the "VitalFile" object, researchers can use "to_vital" method to save the data as vital file format again, or use "to_numpy" or "to_pandas" methods to get the samples of specific tracks as a numpy array or a pandas DataFrame.
Web-based API. The use of Web-based API and cloud data may facilitate the research. Web-based API is provided for downloading track data and track lists from the endpoint URLs. Data can be accessed by entering the address into a web browser. All data track files are provided in csv format compressed with GZip (a total data volume of 113.2 GB). Data tracks are compressed csv files of the time-series data tracks extracted from the original vital files. The data track represents numeric or waveform data and consists of two columns: Time and Value (Fig. 2). The Time column contains the acquisition times of the measurement, and the Value column contains the measured values. In the numeric data track, the missing value rows have been removed, so the data collection time interval may be inconsistent. In the waveform data track, the time column has only three values: start time (0), time interval, www.nature.com/scientificdata www.nature.com/scientificdata/ and end time. The times of the waveform data track can be calculated in monotonic increments using the time interval value. Unlike the numeric data track, missing values are not removed but left blank in the waveform data track.
The general sequence of research using the Web-based API is as follows.
• Download the clinical information file and select caseids that satisfy the inclusion/exclusion criteria of research and development. • Check the VitalDB API page for the names of the biosignal parameters (tname) needed for the research topic.
• Download the track list file, check the track identifiers (tids) of the data tracks that match the caseids and tnames, and download the actual track data using the API. • Since the data tracks are time-synchronized with the casestart time (0) as the reference point, conduct research by converting multiple track data into an array on the same time axis.

Code availability
An open-source project facilitating the use of VitalDB dataset has been launched, and a lot of code written in C/C++, Python, Javascript, and R languages is currently available from Zenodo (https://doi.org/10.5281/ zenodo.6321507) 20 . Python codes that can be used as references of algorithm research are also available from Zenodo (https://doi. org/10.5281/zenodo.6321522) 21 .
The examples of sample codes for statistical analysis are as following: -General Characteristic: vitaldb_tableone.ipynb -Mortality: asa_mortality.ipynb -Acute Kidney Injury -mbp_aki.ipynb The examples of sample codes for artificial intelligence algorithms are as following: -Drug effect estimation using Long Short-Term Memory: ppf_bis.ipynb -Hypotension prediction using Long Short-Term Memory: hypotension.ipynb -Mortality prediction using Gradient Boosting Machine: predict_mortality.ipynb