Transforming electronic health record polysomnographic data into the Observational Medical Outcome Partnership's Common Data Model: a pilot feasibility study

Well-defined large-volume polysomnographic (PSG) data can identify subgroups and predict outcomes of obstructive sleep apnea (OSA). However, current PSG data are scattered across numerous sleep laboratories and have different formats in the electronic health record (EHR). Hence, this study aimed to convert EHR PSG into a standardized data format—the Observational Medical Outcome Partnership (OMOP) common data model (CDM). We extracted the PSG data of a university hospital for the period from 2004 to 2019. We designed and implemented an extract–transform–load (ETL) process to transform PSG data into the OMOP CDM format and verified the data quality through expert evaluation. We converted the data of 11,797 sleep studies into CDM and added 632,841 measurements and 9,535 observations to the existing CDM database. Among 86 PSG parameters, 20 were mapped to CDM standard vocabulary and 66 could not be mapped; thus, new custom standard concepts were created. We validated the conversion and usefulness of PSG data through patient-level prediction analyses for the CDM data. We believe that this study represents the first CDM conversion of PSG. In the future, CDM transformation will enable network research in sleep medicine and will contribute to presenting more relevant clinical evidence.

Data source. OMOP CDM data obtained from SNUBH were used in this study. In particular, the data comprised de-identified EHR data based on OMOP CDM version 5.3.1 and accumulated over a period of 16 yearsfrom the opening of SNUBH with the full EHR system in May 2003, till June 2019. The EHR data of more than 2 million patients, including patient demographics, diagnosis, chief complaints, drug exposures, test orders/ results, vital signs, surgeries, family histories, and past medical histories, were converted to CDM.
This study was performed in accordance with the relevant guidelines and regulations of the SNUBH Institutional Review Board (IRB) and was approved by the SNUBH IRB. As it is an observational study and the data source was de-identified, this study was approved based on waivers of informed consent or exemptions by the SNUBH IRB (IRB No: X-2002-592-904).
Polysomnographic parameters. We considered all PSGs performed at the Sleep Center of SNUBH as target data to be converted into OMOP CDM, including full-night PSGs, split-night PSGs, PSGs for continuous positive airway pressure (CPAP) titration, and multiple sleep latency tests (MSLTs). In the case of split-night PSGs, the values of the parameters represented only the diagnostic portions in this study. No home sleep apnea tests were included because they are not popular in South Korea. The PSG parameters to be transformed into OMOP CDM included information related to sleep architecture, respiratory activity, positions during sleep, blood oxygen saturation, and limb movement.
We conducted PSGs using an Embla N 7000 (Embla, Reykjavik, Iceland) recording system equipped with standard electrodes and sensors, in the presence of a sleep technician. The entire PSG retinue consisted of electroencephalography, electrooculography, echocardiography, submental and limb electromyography, chest and abdominal plethysmography, nasal pressure manometry, oronasal thermistor, pulse oximetry, and a snoring sensor. Apnea was defined as a pause in the respiratory airflow lasting at least 10 s, and hypopnea was defined as a reduction in the airflow by 50% or more lasting at least 10 s, or the accompaniment of airflow reduction by arousal or an oxygen desaturation by 4% or more 13 . The PSG data were reviewed and scored by sleep experts using the Embla RemLogic PSG Software (Embla, ON, Canada). The study report from the Embla RemLogic PSG Software has the following parameter (variable) categories: patient information; sleep summary; summary graph; sleep information; arousal statistics; autonomic arousal (plethysmogram) statistics; apnea/hypopnea statistics; apnea-desaturation relation; Cheyne Stokes breathing statistics; breath statistics; snoring statistics; flattening statistics; respiratory mechanic instability statistics; SpO2 statistics; desaturation statistics; heart rate statistics; cardiac events; bruxism; rapid eye movement sleep behavior disorder information; rhythmic movement disorder information; periodic limb movement statistics; and position statistics. Among them, the sleep experts at our sleep center selected the PSG parameters that are commonly employed in the literature to make available in the PSG summary report of our EHR. The selected parameters were automatically exported and imported into our EHR in a structured format.
Strategy to convert PSG data into OMOP CDM. We designed and implemented the following extracttransform-load (ETL) process to transform the PSG data into the OMOP CDM format.
Despite being reported in a structured form, the EHR PSG results considered in this study had been revised approximately 11 times. Hence, we extracted the data corresponding to each revised form and integrated them within the CDM format via standardization. The procedural information for PSG order itself had already been converted into the CDM format. Thus, in this study, we linked the extracted PSG results and the corresponding existing orders in the CDM to connect the PSG procedures with their corresponding results.
The PSG parameters were manually mapped by sleep domain experts (J.-W. Kim  The extracted PSG data were transformed and loaded into measurement and observation tables with standard concepts. Observation data were linked to the corresponding PSG procedures via the observation_event_id and obs_event_field_concept_id fields. In order to link measurements with corresponding procedures, we used the new modifier_of_event_id and modifier_of_field_concept_id fields that have been proposed by the OHDSI Oncology Working Group 14 . The procedure_occurrence, measurement, and observation tables were linked to the person and visit_occurrence tables based on their foreign keys. The CDM tables associated with the PSG data are depicted in Fig. 1. After completing the ETL, we assessed the PSG data quality via exploratory data analysis and developed data quality check rules for data cleaning (please see Supplementary Table S3 for the detailed cleaning rules and the number of records filtered by the rules). Finally, the cleaned PSG data integrated into the existing CDM were utilized for a feasibility test.
Pilot feasibility test using open-source OHDSI analytic tools. We conducted a pilot feasibility test using only full-night PSG tests of patients 18 years or older. The feasibility test was designed to develop and validate a model to predict cardio-neuro-metabolic disease within a target population between a period of 1 day and 1095 days from the target cohort start date of the PSG test. A cardio-neuro-metabolic disease was defined as any condition involving International Classification of Disease, Tenth Revision (ICD-10) codes corresponding to the comorbidities listed in Supplementary Table S4. We included any occurrence of the defied ICD-10 codes without constraints on the frequency.
In the population setting for the patient-level prediction, varying minimum lookback periods of 30 days, 90 days, and 180 days were utilized for the prior observation periods of patients from the target population. Subjects without time-at-risk of 1094 days were also removed. Patients who had experienced prior outcomes were also not considered in this study.
Among the preexisting CDM data, we utilized multiple covariates, such as gender, 5-year age group, Anatomical Therapeutic Chemical (ATC) drug group, SNOMED CT condition group, procedure, measurement value, observation, visit concept count, the CHA2DS2-VASc (congestive heart failure, arterial hypertension, age > 75 years, diabetes mellitus, stroke/transient ischemic attack, vascular disease, age 65-74 years, sex category) score, diabetes complications severity index (DCSI), and the Charlson comorbidity score. Two different covariate settings were tested to determine which PSG parameters could be selected during the cardio-neuro-metabolic disease prediction. One setting (PSG-only covariates) used only gender, age group, and PSG parameters, and the other (all covariates) used all CDM covariates, including the PSG parameters described above as covariates. The observation time windows of the covariates for short, medium, and long terms were set as prior 7 days, 30 days, and 180 days before the cohort start date, respectively.
Three different machine learning models-Lasso Logistic Regression (Lasso), Gradient Boosting Machine (GBM), and Random Forest (RF)-were developed using 25% of the total data for training and 75% for testing. To evaluate the models, model discrimination was assessed using the area under the receiver operating characteristic curve (AUC).

Results
Conversion results of PSG parameters into OMOP CDM concepts. We converted data from a total of 11,392 tests corresponding to 11,797 sleep studies into the OMOP CDM format. These included 7,191 fullnight PSGs, 2,725 split-night PSGs, 1,474 CPAP titration PSGs, and 407 MSLTs. Among the PSG test results stored in EHR, the conversion target parameters converted into CDM are presented in Table 1. These included 7 pertaining to body measurements, 7 to sleep summaries, 6 to sleep stages, 16 to respiratory events, 4 to apnea or hypopnea duration, 8 to sleep position, 5 to arousals, 2 to limb movement, 5 to snoring, 8 to oxygen statistics, 1 to continuous positive airway pressure, 2 to questionnaires, 11 to MSLT, 1 to apnea level manometry test, and 3 to Friedman staging. A total of 85 PSG parameter concepts were converted to the measurement domain and one to the observation domain (Waist/hip ratio). Moreover, 20 (23.3%) PSG codes were mapped to the standard OHDSI vocabulary including LOINC and SNOMED CT, but the remaining 66 (76.7%) could not be mapped and were added as new custom standard concepts.
Characteristics of PSG data. The overall characteristics of the total sleep studies that were converted into OMOP CDM are presented in Table 2. Out of an aggregate of 11,392 sleep tests, 8363 (73.4%) tests were conducted on male patients and 3029 (26.6%) on female patients. There was an average of 1.2 tests per person. Tests of patients aged 40-49 years, 50-59 years, and 60-69 years accounted for approximately 65% of the total number of tests. The number of sleep studies conducted each year exhibited a progressive increment. The prevalence of AHI < 5, mild OSA (5 ≤ AHI < 15), moderate OSA (15 ≤ AHI < 30) and severe OSA (30 ≤ AHI) was 28.5%, 23.8%, 19.3% and 28.4%, respectively. The basic statistics of the associated PSG parameters are provided in Supplementary Table S5.
Performance of the prediction models. Corresponding to the best performance setting of each prediction models, the number of people eligible for inclusion into the target population, the outcome count, and the    Table 3. All three models-RF, GBM, and Lasso-performed better when all parameters, such as condition, drug, measurement, and comorbidity score, were utilized as CDM data along with PSG, rather than only the PSG parameters.
The top 20 covariates selected from the RF are presented in Table 4. Among them, 11 were PSG parameters, for example, AHI during right lateral (/h), central apnea index (/h), waking oxygen saturation (%), and snoring time (min). The top 20 covariates selected from the other models are included in Supplementary Table S6.

Discussion
To the best of our knowledge, this study represents the first attempt to convert EHR PSG data into ODHSI OMOP CDM, a standard format for health and medical data. Through this study, we successfully converted more than 11,000 PSGs stored in a tertiary hospital EHR into the OMOP CDM version 5.3.1 format. However, we were able   www.nature.com/scientificreports/ The most significant advantage of the standardization of EHR data into the CDM format is the speed and efficiency of large-scale analysis afforded to researchers and clinicians using the open-source analysis tools provided by ODHSI 10,12 . Furthermore, due to the inapplicability of OMOP CDM to PSG parameters till date, CDM studies using PSG and MSLT test results, which are the most important tests in sleep medicine, are yet to be conducted. In this context, conversion of PSG results into the CDM format also enables utilization of OHDSI's open-source analytical solutions in clinical studies involving PSG results. In addition, the OMOP CDM format has already been used to standardize a comprehensive collection of EHR data, including diagnostic information, specimen test results, imaging test information, procedure and intervention information, drug exposures, past medical histories, and family histories. Therefore, the standardization procedure attempted in this study enables researchers to conduct robust and scalable analyses involving PSG results in conjunction with pre-CDM-converted large-scale EHR data. Collaborative research across a growing number of sites participating in the standardized CDM network is expected to lead to higher performance in population-level estimation and patient-level prediction models that leverage sleep study parameters.
In this study, the performance of the pilot feasibility test in terms of patient-level prediction for cardioneuro-metabolic disease exhibited a significant improvement when the entire EHR data along with PSG was used, rather than solely the PSG data. This suggests the feasibility of utilizing all EHR data in the OMOP CDM format via CDM conversion of PSG data.
OSA is a broad-spectrum disease with several different subgroups or phenotypes, and each OSA phenotype is likely to be manifested with different levels of severity, both clinically and objectively 16 . Previous one-size-fits-all approaches based on apnea-hypopnea index suffered from insufficient consideration of these diverse phenotypic subtypes of OSA due to the imperfection of the apnea-hypopnea index as a diagnostic metric with respect to OSA-related symptoms and outcomes 17 . Several studies have demonstrated that each OSA phenotype exhibits different characteristics and varying risks of disease outcomes 16,18 . The most important data included in these studies were various metrics of PSG, including all the PSG results, which enabled the classification of OSA into various phenotypes via the phenotyping technique. One study that attempted a structured, data-driven approach www.nature.com/scientificreports/ based on multiple PSG features of approximately 2,000 OSA patients was able to identify seven subgroups (phenotypes). The aforementioned study also revealed that a unique phenotype that may have been missed during conventional OSA severity classification based on a single metric-apnea hypopnea index-could account for the risk of cardiovascular outcome more effectively 19 . In our previous study, we also identified four clusters based on various PSG features and there was a significant difference in disease outcome among the clusters, and such a difference could not be found in the standard classification of OSA based only on AHI severity 20 . Moreover, these characteristic phenotypes may exhibit different patterns depending on race, country, or individual. Therefore, to improve the ability to predict adverse OSA outcomes for a population or an individual, simply having a large number of PSGs is not sufficient-it is necessary to acquire PSGs across various data sources. Therefore, it is advantageous to use standardized data such as OMOP CDM to increase the reproducibility and statistical significance of the analyses. The conversion of data into the OMOP CDM format enables ATLAS, OHDSI's opensource analytic solution, to generate queries that can set the aforementioned OSA phenotypes as target cohorts and queries that can set OSA complications to be predicted as the outcome cohort. This enables verification of the reproducibility of outcome predictions of OSA phenotyping through analysis of the dataset including PSG with the same queries in multiple sleep centers where PSG-CDM standardization has been completed. In addition to the analysis of large-scale PSG data, the clinical relevance of the OSA phenotypes across various populations by region and race will be able to be also verified.
With the increase in CDM conversion of EHR data across medical institutions, research based on CDMformat datasets is expected to be pursued in various fields. However, unlike the CDM conversion of data such as clinical diagnosis results, laboratory sample test results, and drug exposure data, the CDM conversion of medical data based on patient-generated signals, including PSG, is still insufficient. Therefore, till date, CDM-based research has been actively conducted in fields where conversion to the pre-existing standard vocabulary is feasible. Domains where CDM research is most active include pharmacovigilance [21][22][23] and pharmacoepidemiology 24 . For example, a study assessing anti-seizure drug-related adverse reactions in 1344 target epilepsy cohorts determined that the detection rate of the adverse drug reaction based on CDM-format data was comparable to previously published results obtained using traditional data analysis techniques 21 . In addition, it is possible to implement various designs of research by constructing a target cohort corresponding to a study entry population and an outcome cohort corresponding to a disease outcome population 25,26 . Examples include a prognostic model validation study predicting hemorrhagic transformation of acute ischemic stroke within a CDM dataset of more than 600,000 patients via the OHDSI international network 25 , and a survival analysis study using 115 variables in 346 patients diagnosed with intrahepatic cholangiocarcinoma 26 .
Despite the significant implications, the present study has certain limitations. First, the rate of correspondence between ODHSI's standard OMOP CDM concepts and PSG parameters was as low as approximately 20%. This can be attributed to the fact that the pre-existing OMOP CDM standard vocabulary does not reflect all of the approximately 80 PSG variables considered in this study. The custom standard vocabulary developed to address this limitation in this study is expected to contribute to future studies that utilize PSG parameters in CDM-based EHR studies. When creating the custom concepts, we made it easy to find all PSG parameters by defining the relationship to the PSG order. For concepts that may have varying definitions, the definition of the concept is provided as metadata. For concepts (e.g., %Time of saturation < 60%, %Time of saturation < 70%) in which multiple criteria can exist, a concept was created in a way that has individual concept_ids. Since the MEASUREMENT table does not have a modifier attribute, it would be the best practice to create individual concepts for them. By doing this, the meaning of new concepts can be clarified. As the basic PSG parameters of the PSG recording systems of the various vendors are similar, we think other institutions will also be able to apply the new concept proposed in this study. In addition, we look forward to adding the new concepts to OHDSI's standard vocabulary. Second, in South Korea, insurance for CPAP began in July 2018; before then, it had been recorded in a different form of EHR rather than an order. Thus, in this study, only CPAP orders after July 2018 were converted to CDM and can be used as predictors for the pilot prediction models. There could be an issue where information on orders for CPAP, which may be an important variable in predicting cardio-neuro-metabolic disease, is not complete. However, as the purpose of this study was only to demonstrate the pilot feasibility of the prediction model using CDM including PSG data, predictors should be considered more elaborately when developing a prediction model in the future. Third, different sleep centers represent PSG databases in EHRs in different ways. Many centers store PSG results in EHR as an image file, or simply record OSA severity in a report format. Therefore, significant implementation effort and time is required to extract, transform, and load the PSG results into the CDM format. Furthermore, different levels of digitization of PSG data in different hospitals may cause concerns regarding the different levels of CDM conversion from PSG parameters. However, with the increase in CDM studies including PSG parameters, the electronic representation of PSG data in the EHR system is expected to be facilitated across hospitals. Finally, conversion of data into the CDM format is time-consuming, requiring a substantial amount of resources, in addition to the fundamental requirement of collecting native source data. The need to code subsets of data manually may limit conversion efforts. However, once the native data are converted to the CDM format, EHR systems in the network will be able to use the same queries to identify cohorts. Thus, conversion to CDM is expected to minimize the effort required to develop cohorts and analyze results across multiple sites.
The harmonization across different sites requires collaborative efforts from multidisciplinary experts, including clinical domain experts, terminology experts, and engineers from various sites. When other sites try to map their own PSG data, efforts should be made to use and propose the same vocabulary and the same concept as much as possible by using the mapping result proposed in this study or by participating in the OHDSI community. As the standard terminology for PSG data has not yet been established internationally, if a specific ontology for sleep study can be proposed as OHDSI vocabulary by reviewing previous efforts, such as the Sleep Domain Ontology and the National Sleep Research Resource, it is expected to be helpful in the conversion and expansion of CDM by other sites.

Conclusions
The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is a standard data format and has been applied to various EHR databases. However, its application to PSG data has not been attempted till date. To the best of our knowledge, this study represents the first attempt to transform PSG data into the OMOP CDM format. Well-defined large-volume OMOP CDM databases of PSG data can potentially enable the identification of clinically relevant OSA phenotypes, estimation of disease outcomes at the population level and prediction of outcomes at the patient-level. We expect the CDM mapping and CDM custom vocabulary of the PSG proposed in this study to contribute to the CDM conversion of PSG databases and future studies leveraging such databases.

Data availability
CDM data are designed to support a distributed research network. Thus, access to the data is restricted on internal private networks. Therefore, the data are not publicly available.