Abstract
Transparent and FAIR disclosure of meta-information about healthcare data and infrastructure is essential but has not been well publicized. In this paper, we provide a transparent disclosure of the process of standardizing a common data model and developing a national data infrastructure using national claims data. We established an Observational Medical Outcome Partnership (OMOP) common data model database for national claims data of the Health Insurance Review and Assessment Service of South Korea. To introduce a data openness policy, we built a distributed data analysis environment and released metadata based on the FAIR principle. A total of 10,098,730,241 claims and 56,579,726 patients’ data were converted as OMOP common data model. We also built an analytics environment for distributed research and made the metadata publicly available. Disclosure of this infrastructure to researchers will help to eliminate information inequality and contribute to the generation of high-quality medical evidence.
Similar content being viewed by others
Introduction
Numerous studies using routinely collected large healthcare data have provided invaluable evidence representing routine clinical practice1,2. Administrative data representing the nationwide population have been used for secondary analysis in healthcare research for various purposes, including consecutive monitoring of disease and medical expenditure, comparative effectiveness of medical interventions, and even machine learning3,4,5,6. The Korean National Health Insurance system is a single public insurance system for all citizens, and all medical institutions are applied as mandatory designation systems. The Health Insurance Review and Assessment Service (HIRA) establishes health insurance reimbursement criteria and reviews all medical claims for reimbursement. Therefore, the HIRA has accumulated a vast amount of claims data at the national level, and it can be used as a secondary data source for high-quality real-world evidence7. For example, statistics from the HIRA database are used in OECD statistics as representative statistics for Korea.
Administrative data, despite being a commonly used source for research, has drawn significant criticism predominantly due to concerns over the validity of its coded information. For instance, coding practices like “upcoding” can lead to inaccuracies; this is where providers code for a more severe illness than the patient actually has to receive higher reimbursement8,9. While the debate on coded information’s validity continues, less attention is being directed towards the stewardship of this extensive healthcare data. Chief among these are issues including: 1. Non-scalability and non-interoperability; 2. Ignored reproducibility; and 3. Protection of privacy of the national population. Such areas might pose even more profound implications for the utility and reliability of large healthcare datasets.
A distributed research system based on a common data model has emerged as a promising alternative to address the concerns surrounding the use of large healthcare datasets10. The Observational Medical Outcome Partnership Common Data Model (OMOP-CDM) is a standardized data model maintained by Observational Health Data Sciences and Informatics (OHDSI), which is a global, multi-stakeholder, interdisciplinary community. The OMOP-CDM was designed to enable the systematic analysis of large observational datasets from multiple data sources by providing a common structure and vocabulary for observational data. In response to the urgent requirement for coronavirus disease-2019 (COVID-19) research, the HIRA was the first institution in the world to standardize the data of patients with COVID-19 into OMOP-CDM, providing access to international researchers without compromising patient privacy11. This approach inspired other database owners, enabling researchers to conduct multiple high impact studies using the multi-national database in a timely manner12. However, thus far, the HIRA database has been standardized to OMOP-CDM for individual studies, and standardized data have not been maintained13.
We aimed to standardize HIRA data into OMOP-CDM, build infrastructure providing scalable accessibility and a flexible data analysis environment with privacy-by-design protection, and verify whether the infrastructure guarantees the reproducibility of research. The aim of this study was to enhance the FAIRness of the national healthcare database, which refers to its ability to be easily Findable, Accessible, Interoperable, and Reusable (FAIR)14. Specifically, in this study, the process of converting national claims data into research data to establish research infrastructure, mapping local code to standard vocabulary system, verifying data through type 2 diabetes mellitus (T2DM) cases and replicating previously published COVID-19 prediction study. In addition, external disclosure of the infrastructure by the FAIR principle was reviewed.
Results
Basic statistics of HIRA CDM
We extracted, transformed, and loaded (ETL) the HIRA database into the OMOP-CDM version 5.3.1. All tables specified by the OMOP-CDM conversion specifications were created. The number of converted claims specification and number of patients included were 10,098,730,241 and 56,579,726, respectively (Table 1). Among the converted data, the number of males and females was 28,439,311 (50.3%) and 28,140,325 (49.7%), respectively. All records of the source database were converted into CDM format without errors in classification by year, type of visit, and type of claiming medical institution (Table S1 in the Supplements). Among the CDM tables, the death table contained information of 3,804,948 people who had died over 11 years, accounting for 6.7% of the total population (Table 1). The condition, drug, and procedure tables, which are the main clinical information of the OMOP-CDM, included more than 99.0% of patients, and devices and measurements included more than 90.0% of patients (Table 1).
The results of vocabulary mapping from the Electronic Data Interchange (EDI) codes of Korea to the OMOP standardized vocabulary are shown in Table 2. Table 2 lists the number of EDI codes according to the OMOP domain, ratio of codes mapped to standard terminologies, and number of mapped records per source record. Regarding the ratio of mapped codes to source codes, condition (99.1%), drug (100.0%), observation (99.97%), and procedure (84.5%) were high, however, device (10.8%) and measurement (31.0%) were relatively low. However, the ratio of mapped records (mapped records per source records) was over 85.0% in all domains including device (87.6%) and measurement (91.5%).
Data quality and reliability
We compared the amount of original (source) and converted data for the condition/drug/procedure/device codes. The number of records from the source and converted data and their differences from the top 10 codes in each domain are presented in the Tables S2–S6 in the Supplements. The differences were due to (1) the multiple mapping of the source code, (2) the assignment to a different domain table from the source table, and (3) the absence of mapping to OMOP standardized vocabulary.
The number of patients with T2DM was extracted according to the same definition from the source and converted data, and the numbers of patients were 3,031,462 (21.3%) and 3,030,183 (21.3%), respectively (Fig. 1). The incidence of T2DM per 100,000 patients ranged from 550.1 to 650.9 and 549.9 to 649.7 in the source and converted HIRA CDM database, respectively (Table 3). In 2012, the difference in the number of patients with T2DM between the source and converted data was 590, and the difference in the incidence rate was the largest at 1.2 per 100,000 patients. The difference in the number of patients was 14, and the difference in the incidence rate was 0.0 in 2020. In addition, there were no differences in T2DM incidence by year-gender and year-age groups (Tables S7, S8 in the Supplements).
In the HIRA CDM database, by 2020, 32,633 outpatients were diagnosed with COVID-19. We could validate a previously published COVID-19 prediction model (COVER model) which developed based on the OMOP-CDM15. The performance of the COVER models to predict hospitalization for pneumonia, admission to the ICU or death from pneumonia, and all-cause death were 0.816, 0.891, and 0.892, respectively (Table S9 in the Supplements). We also tried to validate the models using newly updated sampled database. The HIRA 20% sample database until April 2022, 1,530,350 outpatients were diagnosed with COVID-19, and the performance of the model was 0.748 (hospitalization for pneumonia), 0.879 (admission to the ICU with pneumonia or death due to pneumonia), and 0.891 (all-cause mortality). Through version control of the database, we confirmed that predictive models developed earlier could be easily applied to databases of different versions with different periods.
Data analytic environment and open policy
We built a Docker-based analytic environment for the use of open-source tools even in an intranet environment (offline for Internet) of the HIRA and to enable the installation of statistical tools and frequently updated packages (Fig. 2)16. For data security, the data officer of the HIRA is responsible for managing access sessions and logs from database and analytic servers.
By implementing the open policy of the HIRA CDM, researchers can apply for research requests through the healthcare distributed research network (HDRN) platform operated by the Korean government (https://hcdl.mohw.go.kr/). The specific application method is as follows: (1) The researcher must request a review of their research hypotheses and plan for ethical feasibility through an institutional or public review board. (2) The research must submit an approval letter from the review board and the research protocol to the HDRN platform. (3) The HIRA reviews the appropriateness of research/data provision and decides whether to provide it. (4) The researcher writes an analysis query, code, or package based on the open sample data and environment and sends it to the HIRA. (5) The HIRA reviews queries and expected results and derives results by running queries/codes/packages. (6) After the results are reviewed and the protected health information checked for infringement, the results are exported to the researcher.
We followed all FAIR principles, and the results of applying each principle to the HIRA CDM are shown in Table S10 in the Supplement. Metadata, disclosure policy, and sample data of HIRA CDM have been made available to the public online (https://opendata.hira.or.kr/op/opb/selectNotice.do?sno=13906&ntfcIteDivCd=&searchCnd=&searchWrd=cdm&pageIndex=1).
Discussion
The HIRA CDM database is a useful national resource that encompasses abundant medical information of virtually all citizens and institutions in the Republic of Korea. An open research analysis system that complies with the FAIR principle was established to transparently utilize it for biomedical and healthcare research. While increasing researchers’ access to data resources, a distributed research system with privacy by design was established, such that national claims data across the country can be safely disclosed to external researchers without access to patient-level data. The established database and environment demonstrated the reproducibility and scalability of the research through comparative verification with source data and previously developed predictive models.
The Data Quality Dashboard17, the official quality assessment tool of OMOP CDM, was not performed because of limited hardware resources. This was because it was expected to take several months to be running to HIRA CDM, such a large size of data. In the comparison of T2DM incidence performed for quality assessment, there were differences between the original data and the CDM, however, which were attributed to changes in disease coverage during code conversion (source code to OMOP standardized vocabulary). This is an issue of mapping to different vocabulary systems and is not a data quality issue, however, researchers should be aware of such cases.
Facilitating transparent and reproducible research
The retraction of COVID-19 research from high-profile journals underscores the necessity for open and reproducible science in healthcare18, which is particularly important for promoting confidence in science during the global health crisis. The current scientific landscape relies heavily on researchers’ reliability and trustworthiness. However, significantly high rates of data fabrication, falsification, and false-positive findings occur in healthcare research using big data for secondary use, further highlighting the necessity for more transparent research practices19,20. The usual policy of ‘sharing data upon request’ may not be optimal as it may limit the accessibility and usability of the data21. The common challenge against open science in healthcare is that patient-level data are inherently highly sensitive, making it difficult to share such data while preserving privacy. This challenge must be addressed by developing innovative approaches and technologies that can ensure safe and secure sharing of patient data while promoting open and reproducible science. Distributed research based on standardized data and vocabulary may guarantee reproducibility of research while preventing p-hacking.
Scalable accessibility with privacy-by-design protection
Distributed research systems aimed toward data standardization guarantee scalable accessibility without privacy concerns because they enable privacy-by-design protection. Researchers cannot access patient-level data, and only anonymized data can be exported from the system to researchers. Despite being an internal environment with no Internet access, we utilized several open-source tools (most are Internet-dependent) to build our analytic environment. This unique approach uses the analytic codes or programs to perform the analysis instead of providing data to external researchers. Analytic queries, codes, and even a Docker-based analytic environment can be applied, enabling researchers to conduct reproducible analyses in the same local environment.
Interoperability across countries
HIRA data can be used as a common data model such as OMOP-CDM in various approaches. Depending on the characteristics of the claims data, they include the life cycle information of the entire population; thus, expansion into various fields, such as the calculation of national statistics, research for clinical effectiveness, health care policy, and AI algorithms, is possible. The Republic of Korea is in the process of introducing OMOP-CDM to 57 medical institutions through past large national funding, suggesting that HIRA data can be utilized in association with the EHR-based databases of medical institutions using various methods. In addition, internationally, it is possible to cooperate with large-scale projects such as OHDSI, N3C22, EHDEN23, and DARWIN-EU24 based on the OMOP-CDM. Furthermore, as a national data infrastructure, it is possible to promote data harmonization with other data standards such as Fast Healthcare Interoperability Resources (e.g., http://omoponfhir.org/).
FAIR research stewardship
As a custodian of nationwide healthcare data, the HIRA builds infrastructure for better research and data stewardship. Although data disclosure is important, the FAIR principle has rarely been applied to large-scale healthcare databases, owing to the sensitivity of personal data. In addition, the nature of the healthcare data provision process, in which researchers must rigorously vet data providers, often means that they do not provide sufficient information about the data. Providing metadata in accordance with FAIR can be part of a culture that improves access to information, and thus address information inequalities. For example, the structure of the database, original source of the data, time period of data, vocabulary, and application process for data access, etc.
Methods
Data source
HIRA claims data include complete information about medical services, such as patients’ visits to medical institutions, demographic information, medical service use, cost, disease conditions, and treatments including medications and procedures. The Republic of Korea introduced a mandatory national health insurance service to manage eligible citizens for health insurance from birth to death. In addition, a computerized system that enables the real-time linkage of medical records generated by medical institutions with the HIRA has been established. This study used the national claims data of the HIRA, which cover approximately 97% of the total population of the Republic of Korea (https://www.mohw.go.kr/eng/hs/hs0110.jsp?PAR_MENU_ID=1006&MENU_ID=100610). Furthermore, the HIRA data were linked to the national death registry of Statistics Korea; therefore, they were also included in this study. Data conversion and analyses were performed according to local laws and regulations and with approval from the respective scientific and ethics committees (Health Insurance Review and Assessment Institutional Review Board: 2022-014-001).
Mapping to standardized vocabulary
Health insurance details (for diagnoses, medical fees, medication, and therapeutic materials) are reimbursed using the EDI code system in Korea; therefore, all details in the HIRA database are stored as EDI codes. We established a standard dictionary for the EDI code to construct the OMOP-CDM and integrated the EDI system into the OMOP standardized vocabulary through previous research25. Vocabulary mapping was conducted from terms for the reimbursement/non-reimbursement list of the EDI to the standard concepts for each domain according to OHDSI standardized vocabulary, e.g., diagnostic codes were mapped to SNOMED-CT, medication codes were mapped to RxNorm system (https://github.com/OHDSI/Vocabulary-v5.0/wiki/General-Structure-and-Use). Two or more healthcare experts independently conduct vocabulary mapping, and in case where their results differ, a third-party review makes the final decision. The final mapping list has been transparently disclosed online (Basic medical examination and diagnosis fee: https://opendata.hira.or.kr/op/opb/selectRfrm.do?rfrmTpCd=&searchCnd=&searchWrd=%EC%9A%A9%EC%96%B4&sno=13305&pageIndex=1 and Operation and Procedure fee: https://opendata.hira.or.kr/op/opb/selectNotice.do?searchCnd=&searchWrd=%EC%9A%A9%EC%96%B4&sno=13603&pageIndex=1).
Because standardized analysis using the OMOP-CDM is based on a standard vocabulary, if the ratio of unmapped records is high, information loss may occur because it cannot be used in the analysis. Code mapping and mapping record rates were checked to evaluate the possible information loss according to the vocabulary dictionary.
Data conversion and quality assessment
In this study, approximately 10 billion claim specifications for 56 million patients from 2010 to 2020 were converted into the OMOP-CDM. The data included information on healthcare institutions and death registry data, as well as general information, diagnosis, care, and prescription details of billing specifications. The source data of HIRA was converted by referring to the specification of OMOP-CDM version 5.3.1 (https://ohdsi.github.io/CommonDataModel/cdm53.html). Six types of source data were converted into 25 data tables of five table domains (clinical data, health system data, health economics data, standardized derived elements, metadata) and the data loaded with OMOP standardized vocabulary tables (Fig. 3). HIRA data were linked to the national death registry of Statistics Korea by national identification number. Under the current OMOP-CDM 5.3 convention, the death table was populated with the date of death and only one representative cause of death (underlying antecedent cause of death) for deceased patients. The pseudonymized patient identifiers and visit identifiers in the source data are maintained for consistency of the future conversion.
After the ETL process, we evaluated the quality of the HIRA CDM by assessing the concordance of descriptive statistics from the source and converted data. Statistical concordance between the source and HIRA CDM was evaluated. We compared the size of the data (by year and by type of medical service), number of medical institutions, and number of records with frequent codes within each domain. In addition, the number of patients with T2DM and its incidence in the middle of the year were calculated using the source and CDM databases. The digital phenotyping of T2DM was defined as those that had corresponding codes to E11-E14 of the International Classification of Disease (ICD-10) and A10 (‘Drug used in diabetes’) of Anatomical Therapeutic Chemical (ATC) Classification system26.
A previously published clinical prediction model was applied to corroborate the usability of the database and infrastructure established in this study. The COVER model was developed in the 2020 OHDSI COVID-19 study-a-thon, and the subset of HIRA database has already been used for the model validation study15. In the previous study, HIRA data included information of the patients with COVID-19 from 1 January to 4 April, 2020; however, in this study, we re-validated using data from two different databases: (1) the HIRA CDM database; 1 January, 2020, to 31 December, 2020, (2) 20% sampled database which newly updated information of the patients with COVID-19 until 30 April, 2022.
The target population was patients with COVID-19 infection and was defined as COVID-19 diagnosis or severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) virus positive through the reverse transcription polymerase chain reaction (RT-PCR) test. The population was limited to adults (age ≥ 18) and without flu symptoms and pneumonia diagnosis within the previous 60 days. The outcomes to be predicted were as follows: (1) hospitalization for pneumonia within 30 days, (2) hospitalization for pneumonia requiring intensive care service or death after hospitalization for pneumonia from an index up to 30 days after the index, and (3) death within 30 days. The detailed model development process and evaluation method were performed in the same manner as described as in the previous publication.
Infrastructure and data open policy
To utilize the HIRA CDM as a national data infrastructure, we established an open analytic environment and data access process for external researchers. To establish the analytic environment, our aim was to ensure that the analytic package developed by an external researcher using open-source tools (e.g., R) was sufficiently run, even in the closed intranet network of HIRA. We established a data acquisition process for external researchers, and the HIRA CDM data were disclosed according to the principle of distributed research using metadata and sample data. In all processes, we followed the FAIR principle, published the metadata online, and performed version control of the database.
Data availability
The authors declare that the data supporting the findings of this study are available within the paper and its supplementary information files. Data on vocabulary mapping were disclosed on the HIRA website (Basic medical examination and diagnosis fee: https://opendata.hira.or.kr/op/opb/selectRfrm.do?rfrmTpCd=&searchCnd=&searchWrd=%EC%9A%A9%EC%96%B4&sno=13305&pageIndex=1 and Operation and Procedure fee: https://opendata.hira.or.kr/op/opb/selectNotice.do?searchCnd=&searchWrd=%EC%9A%A9%EC%96%B4&sno=13603&pageIndex=1). According to Personal Information Protection Act in the Republic of Korea, HIRA does not permit us to share patient-level source data or data derivatives with individuals and institutions.
The CDM data converted in this study is available as a distributed research network way upon an application through an online web portal (https://hcdl.mohw.go.kr). HIRA CDM is updated on an annual basis. Researchers can apply for research requests through the healthcare distributed research network (HDRN) platform operated by the Korean government. The specific application method is as follows: (1) The researcher must request a review of their research hypotheses and plan for ethical feasibility through an institutional or public review board. (2) The research must submit an approval letter from the review board and the research protocol to the HDRN platform. (3) The HIRA reviews the appropriateness of research/data provision and decides whether to provide it. (4) The researcher writes an analysis query, code, or package based on the open sample data and environment and sends it to the HIRA. (5) The HIRA reviews queries and expected results and derives results by running queries/codes/packages. (6) After the results are reviewed and the protected health information checked for infringement, the results are exported to the researcher. Detailed application process for data use is descripted in https://hcdl.mohw.go.kr/static/data/dataApplyStep.
Code availability
We stored the CDM data using open-source codes of OHDSI for conforming to the database structure of OMOP CDM (https://github.com/OHDSI/CommonDataModel).
References
Schneeweiss, S. Learning from Big Health Care Data. New England Journal of Medicine 370, 2161–2163, https://doi.org/10.1056/NEJMp1401111 (2014).
You, S. C. & Krumholz, H. M. The Evolution of Evidence-Based Medicine: When the Magic of the Randomized Clinical Trial Meets Real-World Data. Circulation 145, 107–109, https://doi.org/10.1161/CIRCULATIONAHA.121.057931 (2022).
Lo-Ciganic, W.-H. et al. Developing and validating a machine-learning algorithm to predict opioid overdose in Medicaid beneficiaries in two US states: a prognostic modelling study. The Lancet Digital Health 4, e455–e465, https://doi.org/10.1016/S2589-7500(22)00062-0 (2022).
Nosrati, E. Harnessing administrative data to study health inequality. The Lancet Public Health 7, e726–e727, https://doi.org/10.1016/S2468-2667(22)00172-4 (2022).
Portuondo, J. I., Harris, A. H. S. & Massarweh, N. N. Using Administrative Codes to Measure Health Care Quality. JAMA 328, 825–826, https://doi.org/10.1001/jama.2022.12823 (2022).
Sarrazin, M. S. V. & Rosenthal, G. E. Finding Pure and Simple Truths With Administrative Data. JAMA 307, 1433–1435, https://doi.org/10.1001/jama.2012.404 (2012).
Kim, J.-A., Yoon, S., Kim, L.-Y. & Kim, D.-S. Towards Actualizing the Value Potential of Korea Health Insurance Review and Assessment (HIRA) Data as a Resource for Health Research: Strengths, Limitations, Applications, and Strategies for Optimal Use of HIRA Data. J Korean Med Sci 32, 718–728, https://doi.org/10.3346/jkms.2017.32.5.718 (2017).
Suissa, S. & Garbe, E. Primer: administrative health databases in observational studies of drug effects—advantages and disadvantages. Nature Clinical Practice Rheumatology 3, 725–732, https://doi.org/10.1038/ncprheum0652 (2007).
Steinbusch, P. J. M., Oostenbrink, J. B., Zuurbier, J. J. & Schaepkens, F. J. M. The risk of upcoding in casemix systems: A comparative study. Health Policy 81, 289–299, https://doi.org/10.1016/j.healthpol.2006.06.002 (2007).
You, S. C., Lee, S., Choi, B. & Park, R. W. Establishment of an International Evidence Sharing Network Through Common Data Model for Cardiovascular Research. Korean Circ J 52, 853–864, https://doi.org/10.4070/kcj.2022.0294 (2022).
Rho, Y. et al. COVID-19 International Collaborative Research by the Health Insurance Review and Assessment Service Using Its Nationwide Real-world Data: Database, Outcomes, and Implications. J Prev Med Public Health 54, 8–16, https://doi.org/10.3961/jpmph.20.616 (2021).
Burn, E. et al. Deep phenotyping of 34,128 adult patients hospitalised with COVID-19 in an international network study. Nature Communications 11, 5009, https://doi.org/10.1038/s41467-020-18849-z (2020).
You, S. C. et al. Association of Ticagrelor vs Clopidogrel With Net Adverse Clinical Events in Patients With Acute Coronary Syndrome Undergoing Percutaneous Coronary Intervention. JAMA 324, 1640–1650, https://doi.org/10.1001/jama.2020.16167 (2020).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018, https://doi.org/10.1038/sdata.2016.18 (2016).
Williams, R. D. et al. Seek COVER: using a disease proxy to rapidly develop and validate a personalized risk calculator for COVID-19 outcomes in an international network. BMC Medical Research Methodology 22, 35, https://doi.org/10.1186/s12874-022-01505-z (2022).
Merkel, D. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, Article 2, https://doi.org/10.5555/2600239.2600241 (2014).
Blacketer, C., Defalco, F. J., Ryan, P. B. & Rijnbeek, P. R. Increasing trust in real-world evidence through evaluation of observational data quality. Journal of the American Medical Informatics Association 28, 2251–2257, https://doi.org/10.1093/jamia/ocab132 (2021).
Ledford, H. & Noorden, R. V. in Nature (Nature, 2020).
Ioannidis, J. P. A. Why Most Published Research Findings Are False. PLOS Medicine 2, e124, https://doi.org/10.1371/journal.pmed.0020124 (2005).
Fanelli, D. How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data. PLOS ONE 4, e5738, https://doi.org/10.1371/journal.pone.0005738 (2009).
Miyakawa, T. No raw data, no science: another possible source of the reproducibility crisis. Molecular Brain 13, 24, https://doi.org/10.1186/s13041-020-0552-2 (2020).
Haendel, M. A. et al. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. Journal of the American Medical Informatics Association 28, 427–443, https://doi.org/10.1093/jamia/ocaa196 (2021).
Puttmann, D. et al. Assessing the FAIRness of databases on the EHDEN portal: A case study on two Dutch ICU databases. International Journal of Medical Informatics 176, 105104, https://doi.org/10.1016/j.ijmedinf.2023.105104 (2023).
Arlett, P., Kjaer, J., Broich, K. & Cooke, E. Real-World Evidence in EU Medicines Regulation: Enabling Use and Establishing Value. Clinical pharmacology and therapeutics 111, 21–23, https://doi.org/10.1002/cpt.2479 (2022).
Seong, Y. et al. Incorporation of Korean Electronic Data Interchange Vocabulary into Observational Medical Outcomes Partnership Vocabulary. Healthc Inform Res 27, 29–38, https://doi.org/10.4258/hir.2021.27.1.29 (2021).
Ko, S. H. et al. Past and Current Status of Adult Type 2 Diabetes Mellitus Management in Korea: A National Health Insurance Service Database Analysis. Diabetes Metab J 42, 93–100, https://doi.org/10.4093/dmj.2018.42.2.93 (2018).
Acknowledgements
This work was supported by the Health Insurance Review and Assessment Service (HIRA).
Author information
Authors and Affiliations
Contributions
J.W.K. and Y.L. collected data. J.W.K., C.K., R.W.P. and S.C.Y. conceptualized and designed the study. J.W.K., C.K., D.H.Y. and H.B. conducted data analyses including data conversion, developing analytic infrastructures, and proof-of-concept analyses. All authors interpreted the results. J.W.K. and C.K. drafted the manuscript and K.H.K., J.Y., R.W.P. and S.C.Y. made critical revisions to the manuscript. R.W.P. and S.C.Y. finalized the manuscript.
Corresponding authors
Ethics declarations
Competing interests
RWP reports grants from the Ministry of Trade, Industry & Energy (MOTIE, Korea) and the Ministry of Health & Welfare (Korea). SCY reports being CTO of PHI Digital Healthcare.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kim, JW., Kim, C., Kim, KH. et al. Scalable Infrastructure Supporting Reproducible Nationwide Healthcare Data Analysis toward FAIR Stewardship. Sci Data 10, 674 (2023). https://doi.org/10.1038/s41597-023-02580-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-023-02580-7