Introduction

Electronic health record (EHR) adoption, which is nearly universal within the US healthcare system1,2, has increased adherence to evidence-based clinical guidelines3 and facilitated greater patient communication4 resulting in significant improvements in care5. EHRs contain a myriad of systematically collected, longitudinal, patient-level information and are a valuable resource for population-level research6. The cornerstone of medicine, diagnosis or clinical phenotyping, aims to identify empirically observable traits exhibited by patients (i.e., signs and symptoms) known to be characteristic of a specific disease7. Computational phenotyping is the process of converting clinical phenotypes into computer-executable algorithms in order to identify relevant patients from large sources of clinical data, usually EHRs8. One promise of EHR-based computational phenotyping is the ability to perform population-level investigations of mechanistic drivers of disease in diverse patient populations9,10. Despite significant progress, this objective remains largely aspirational6,11,12,13,14.

Traditionally, computational phenotypes have been imprecise due to their exclusive reliance on EHR data, which has been shown to be insufficient at capturing the phenotypic heterogeneity present in most complex diseases15,16,17,18. Deep phenotyping, or “the precise and comprehensive analysis of phenotypic abnormalities in which the individual components of the phenotype are observed and described”7, is a fundamental component of precision medicine that requires timely synthesis of multiple types of patient data19,20. Deep phenotyping has been successfully applied to rare disease and genetic disorders21,22,23,24,25,26,27,28,29,30,31,32,33, cancer34,35,36,37,38,39,40, and pregnancy41,42,43 using a variety of clinical and -omic data. Despite large-scale biobanking efforts and resources like the UK Biobank (https://www.ukbiobank.ac.uk) and the All of Us (AoU) Research Program (https://www.researchallofus.org), most EHRs do not systematically integrate nor have the infrastructure to integrate patient-level genomic data or other forms of external knowledge (e.g., scientific literature) with clinical data44,45,46.

Within an EHR, most data used for research (i.e., structured data) are stored using clinical terminologies or vocabularies. A clinical vocabulary is a standard representation of preferred terms which may or may not be hierarchical or have formally defined relationships and is designed to facilitate meaningful and unambiguous information exchange within the medical domain47,48,49. Hundreds of clinical vocabularies have been developed and their use differs by hospital and country. Examples include the International Classification of Diseases (ICD)50, the Logical Observation Identifiers, Names and Codes (LOINC)51, the Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT; https://www.snomed.org), and RxNorm52. Most clinical vocabularies were not designed to be integrated or interoperable with other vocabularies, which is one of the long standing barriers preventing the secondary use of EHR data for research46. Common data models (CDMs) like the Observational Medical Outcomes Partnership (OMOP)53 have solved many of the challenges of standardizing, representing, and utilizing clinical EHR data. Unfortunately, most CDMs and associated terminology management systems are not yet able to integrate and interpret genomic data or other sources of external knowledge or publicly available data54.

Similar to clinical vocabularies, ontologies are classification systems that provide detailed representations of our knowledge of a specific domain49. Ontologies, like those in the Open Biological and Biomedical Ontology (OBO) Foundry, exist for nearly all scales of biological organization and when combined, can provide a semantically rich and biologically accurate representation of molecular entities and mechanisms55,56,57. Unlike clinical vocabularies, ontologies are semantically computable and interoperable with formally defined relationships, which means they can be logically verified and integrated with data from basic science and clinical research49. Mapping clinical vocabularies to ontologies has been recognized as a fundamental requirement for use in deep phenotyping20,46,49,58. An example of how aligning these resources improves deep phenotyping was recently demonstrated by Zhang et al. who mapped LOINC to the Human Phenotype Ontology (HPO)59, which enabled the harmonization of laboratory tests with different clinical codes to common HPO concepts60.

Due to the time-consuming manual effort required to map clinical vocabularies to OBO Foundry ontologies, no comprehensive mapping across commonly used ontologies currently exists. While automated mapping approaches exist, they largely remain unable to correctly capture the complex semantics underlying clinical data and the knowledge encoded by clinical vocabulary concepts61. For example, when mapping the concept “Peptic Ulcer without Hemorrhage AND without Perforation but with Obstruction” (SNOMED-CT:54157007) to the HPO, most automated approaches would return a single best mapping, most likely “Peptic Ulcer” (HP:0004398). This HPO concept is much broader in meaning than the clinical concept. A more precise mapping would explicitly capture the presence and absence of all relevant phenotypic features: “Peptic Ulcer” (HP:0004398) and “Gastrointestinal Obstruction” (HP:0004796) and NOT “Gastrointestinal Hemorrhage” (HP:0002239) or “Intestinal Perforation” (HP:0031368). To the best of our knowledge no existing mappings or mapping algorithms are capable of capturing this type of complex semantics.

Building on LOINC2HPO, the goal of this work is to develop OMOP2OBO, an algorithm that enables semantically interoperable mappings between clinical vocabularies in the OMOP CDM to OBO Foundry ontologies (Fig. 1). The resulting mappings will enhance the semantic interoperability of the data represented by the OMOP concepts and have the potential to advance deep EHR-based phenotyping by enabling the identification of relevant patients using existing knowledge of the molecular mechanisms underlying disease rather than billing codes which are prone to error and subject to bias. Using OMOP2OBO, we created healthcare system-scale mappings between clinical vocabularies in the OMOP CDM and eight of the most widely used OBO Foundry ontologies56 spanning diseases, phenotypes, anatomical entities, organisms, chemicals, vaccines, and proteins. The mappings were evaluated on: (1) accuracy, examined by a team of domain experts; (2) generalizability, examined through comparison to a large set of mapped concepts used at least once in clinical practice from 24 hospital systems; and (3) clinical utility, examined through the identification of patients with an undiagnosed rare disease. OMOP2OBO is open source (https://github.com/callahantiff/OMOP2OBO) and includes a custom built interactive dashboard (http://tiffanycallahan.com/OMOP2OBO_Dashboard).

Fig. 1: Overview of the OMOP2OBO algorithm.
figure 1

The OMOP2OBO algorithm consists of three components: (1) Process input data. The algorithm takes as input a table of Observational Medical Outcomes Partnership (OMOP) concepts and a list of one or more OBO (Open Biological and Biomedical Ontology) Foundry ontologies. For both data types, the algorithm expects concept or class identifiers, source codes or database cross-references, labels, synonyms, and ancestor concepts or classes. (2) Map OMOP vocabulary concepts to OBO Foundry Ontology concepts. OMOP concepts are automatically mapped to OBO Foundry ontology concepts. The algorithm includes several different approaches (e.g., concept alignment and concept embedding), prioritizing those that result in high-confidence mappings. (3) Synthesize and output mapping results. The mapping results from the prior component are post-processed to include a mapping category and human-readable evidence. Post-processed mappings are serialized and able output to a variety of file types.

Results

Supplementary Table 1 lists the acronyms and definitions used in the paper. The resources used to build and evaluate the OMOP2OBO algorithm and mappings are described in Supplementary Tables 2 and 3.

OMOP2OBO mapping data: OMOP data

The OMOP2OBO mappings were created using a de-identified pediatric dataset from the Children’s Hospital of Colorado (CHCO) normalized to the OMOP CDM (referred throughout the manuscript as “CHCO OMOP Database” and described in detail in Supplementary Table 3)53,62. Standardized vocabularies are a fundamental component of the OMOP CDM, which serve as primary vocabularies within each OMOP domain; all other vocabularies within a specific domain are aligned to a standard vocabulary using mappings provided by the CDM63. The standard vocabularies used in this work included: SNOMED-CT (the OMOP Condition domain for diseases and clinical findings), RxNorm (the OMOP Drug domain for drug products and vaccines), and LOINC (the OMOP Measurement domain for laboratory tests and assessment scales). Concepts from these three vocabularies, including labels, synonyms, source codes (i.e., standard vocabulary codes), and ancestor concepts obtained from the OMOP CDM, were extracted and used as input to the OMOP2OBO mapping algorithm. Using the CHCO OMOP Database, concepts were organized into two data waves according to whether or not they had been used at least once in clinical practice (i.e., “Concepts Used in Practice”) or not (i.e., “Concepts Not Used in Practice”). Only “Concepts Used in Practice” were manually mapped.

The counts of concepts eligible for mapping by OMOP domain and data wave are shown in Table 1. There were 109,719 condition concepts (Concepts Used in Practice: n = 29,129; Concepts Not Used in Practice: n = 80,590) and 11,803 drug ingredient concepts (Concepts Used in Practice: n = 1693; Concepts Not Used in Practice: n = 10,110) available to map. For measurements, there were 4083 concepts, representing 11,269 measurement results (Concepts Used in Practice: n = 1606 concepts [4425 results]; Concepts Not Used in Practice: n = 2477 concepts [6844 results]) available to map. With respect to the Concepts Used in Practice, the 29,129 conditions had a median frequency of 25 (range 1–544,618), the 1693 drug ingredients had a median frequency of 251 (range 1–2,267,866), and the 1606 measurement concepts had a median frequency of 313.5 (range 1–56,823,139).

Table 1 Clinical data used for input to OMOP2OBO mapping algorithm.

OMOP2OBO mapping data: OBO Foundry ontologies

Under the guidance of domain experts, eight OBO Foundry ontologies were selected to represent the following domains: diseases (Mondo), phenotypes (HPO), anatomical entities (Uber Anatomy Ontology [Uberon]64; Cell Ontology [CL]65), organisms (National Center for Biotechnology Information Taxon Ontology [NCBITaxon]66), chemicals (Chemical Entities of Biological Interest [ChEBI]67), vaccines (the Vaccine Ontology [VO]68), and proteins (the Protein Ontology [PRO]69). Each set of ontology concepts also included metadata, which was obtained by querying each ontology for labels, definitions, synonyms, and database cross-references (i.e., codes from other vocabularies and ontologies). The amount of metadata available for mapping is shown in Table 2 and varied across the OBO Foundry ontologies, with NCBITaxon containing the most metadata and Uberon containing the least (visualized in Supplementary Fig. 1). A chi-square test of independence with Yate’s correction revealed a significant association between the ontology and the amount of available metadata (\({\chi }^{2}\)(14) = 2,664,853.8, p < 0.0001). Post hoc tests with Bonferroni adjustment confirmed the ontologies provided significantly different amounts of metadata (p < 0.0001 for all significant comparisons).

Table 2 Open Biological and Biomedical Ontology Foundry ontologies used for input to OMOP2OBO mapping algorithm.

OMOP2OBO mappings

Figure 2 includes example mappings and illustrates how the OBO Foundry ontologies were used to map concepts from each OMOP domain. As illustrated by this figure, OMOP conditions were mapped to HPO and Mondo, OMOP drug ingredients were mapped to ChEBI, NCBITaxon, PRO, and VO, and OMOP measurements results were mapped to HPO, Uberon, NCBITaxon, PRO, CheBI, and CL. As illustrated in the bottom panel of Fig. 1, each mapping consists of four elements: (1) the approach used to create it (i.e., “automatic”, “manual”, or “cosine similarity”); (2) cardinality (i.e., one-to-one or one-to-many); (3) level (i.e., concept or ancestor); and (4) evidence, which consists of pipe-delimited free-text phrases that explain what fields were used to construct the mapping. Supplementary Table 4 provides additional details on and examples of the OMOP2OBO mapping categories. The mapping procedures and resources are described in the “OMOP2OBO Algorithm” sections of the Methods.

Fig. 2: OMOP2OBO mapping examples by OMOP domain.
figure 2

This figure illustrates which OBO (Open Biological and Biomedical Ontology) Foundry ontologies were used for each OMOP (Observational Medical Outcomes Partnership) domain and provides example mappings. OMOP conditions were mapped to HPO and Mondo. OMOP drug ingredients were mapped to ChEBI, NCBITaxon, PRO, and VO. OMOP measurements were mapped to ChEBI, CL, HPO, NCBITaxon, PRO, and Uberon. UMLS Unified Medical Language System, HP Human Phenotype Ontology, MONDO Monarch Disease Ontology, CHEBI Chemical Entities of Biological Interest, NCBITaxon National Center for Biotechnology Information Taxon Ontology, PR Protein Ontology, VO Vaccine Ontology, UBERON Uber-Anatomy Ontology, CL Cell Ontology.

OMOP2OBO mappings: conditions

Unified Medical Language System (UMLS)70 concept unique identifiers (CUIs) were found for 96.6% of condition concepts (n = 105,976) representing 69 unique Semantic Types71. The mapping results for each OBO Foundry ontology are displayed in Fig. 3 and detailed in Supplementary Table 5. Of the 109,719 available concepts, 66.9% (n = 73,417) mapped to 5654 unique HPO concepts (Concepts Used in Practice: 83.9%; Concepts Not Used in Practice: 60.8%) and 57.8% (n = 63,374) mapped to 9637 unique Mondo concepts (Concepts Used in Practice: 68.9%; Concepts Not Used in Practice: 53.8%). Only 50 concepts we attempted to map (excluding purposefully unmapped concepts) were unable to be mapped to at least one OBO Foundry ontology concept.

Fig. 3: OMOP concept mapping results by domain, concept type, mapping category, and ontology.
figure 3

This figure features a Sankey Diagram illustrating the mapping flow implemented by the OMOP2OBO algorithm beginning with OMOP (Observational Medical Outcomes Partnership) concepts from the Conditions, Drugs, and Measurements domains, which were grouped by Data Wave (i.e., whether or not the concept has been used at least once in clinical practice), and organized by mapping category. The flow lines in the diagram are weighted by the count of OMOP concepts from the Children’s Hospital Colorado pediatric OMOP database. OBO Open Biological and Biomedical Ontology, HPO Human Phenotype Ontology, Mondo Monarch Disease Ontology, ChEBI Chemical Entities of Biological Interest, NCBITaxon National Center for Biotechnology Information Taxon Ontology, PRO Protein Ontology, VO Vaccine Ontology, Uberon Uber-Anatomy Ontology, CL Cell Ontology.

The frequency distributions of the Concepts Used in Practice by mapping category and ontology are visualized in Fig. 4. The majority of automatic mappings were one-to-one at the concept-level for Concepts Used in Practice (HPO: n = 3601; Mondo: n = 4836) and Concepts Not Used in Practice (HPO: n = 1166; Mondo: n = 4261). The majority of the manual mappings were one-to-many (HPO: n = 10,328; Mondo: n = 2835). Cosine similarity-scored concept embeddings enabled 1374 HPO (Concepts Used in Practice: median 0.5, range 0.2–1; Concepts Not Used in Practice: median 0.4, range 0.2–1) and 667 Mondo (Concepts Used in Practice: median 0.8, range 0.2–1; Concepts Not Used in Practice: median 1, range 0.2–1) mappings (Supplementary Fig. 2a). On average, more evidence was found for mappings to Concepts Not Used in Practice than Concepts Used in Practice for HPO (8.9 vs. 3.9) and Mondo (12.4 vs. 10.6).

Fig. 4: Condition concept frequency of use in clinical practice by mapping category and ontology.
figure 4

This figure presents the frequency distributions of OMOP (Observational Medical Outcomes Partnership) condition concepts used at least once in clinical practice (log 10 scale) in the Children’s Hospital Colorado pediatric OMOP database by mapping category and OBO (Open Biological and Biomedical Ontology) Foundry ontology. In each boxplot, the box extends from the first to third quartile of the data with a center line used to indicate the median. Whiskers extend from each box by 1.5x the interquartile range and outliers that extend past the whiskers shown as dots. The x-axis labels are numbers which correspond to the OMOP2OBO mapping categories: (1) Automatic One-to-One Concept; (2) Automatic One-to-One Ancestor; (3) Automatic One-to-Many Concept; (4) Automatic One-to-Many Ancestor; (5) Cosine Similarity One-to-One Concept; (6) Manual One-to-One Concept; (7) Manual One-to-Many Concept; and (8) Unmapped. HPO Human Phenotype Ontology, Mondo Monarch Disease Ontology.

OMOP2OBO mappings: drug ingredients

UMLS CUIs were found for 99.3% of drug ingredient concepts (n = 11,716) representing 23 unique Semantic Types. The mapping results for each OBO Foundry ontology are displayed in Fig. 3 and detailed in Supplementary Table 6. Of the 11,803 available concepts, 37.4% (n = 411) mapped to 4072 unique ChEBI concepts (Concepts Used in Practice: 100%; Concepts Not Used in Practice: 26.9%), 21.5% (n = 4661) mapped to 2535 unique NCBITaxon concepts (Concepts Used in Practice: 23.9%; Concepts Not Used in Practice: 42.1%), 2.1% (n = 4249) mapped to 142 unique PRO concepts (Concepts Used in Practice: 10.5%; Concepts Not Used in Practice: 0.7%), and 1.3% (n = 154) mapped to 132 unique VO concepts (Concepts Use in Clinical Practice: 6.9%; Concepts Not Used in Practice: 0.4%). All of the OMOP concepts were mapped to at least one ChEBI concept.

The frequency distributions of the Concepts Used in Practice by mapping category and OBO Foundry ontology are visualized in Fig. 5. The majority of automated mappings were one-to-one at the concept-level for Concepts Used in Practice (ChEBI: n = 959; NCBITaxon: n = 20; PRO: n = 1; VO: n = 90) and Concepts Not Used in Practice (ChEBI: n = 2192; NCBITaxon: n = 135; PRO: n = 42; VO: n = 18). The majority of the manual mappings were one-to-one (ChEBI: n = 321; NCBITaxon: n = 230; PRO: n = 157; VO: n = 21). Cosine similarity-scored concept embeddings enabled 109 ChEBI (Concepts Used in Practice: median 1, range 0.3–1; Concepts Not Used in Practice: median 1, range 0.3–1), 4241 NCBITaxon (Concepts Used in Practice: median 0.6, range 0.3–1; Concepts Not Used in Practice: median 0.6, range 0.3–1), 18 PRO (Concepts Used in Practice: median 0.8, range 0.4–1; Concepts Not Used in Practice: median 1, range 0.6–1), and 17 VO (Concepts Used in Practice: median 1, range 0.4–1; Concepts Not Used in Practice: median 0.8, range 0.4–1) mappings (Supplementary Fig. 2b). On average, more evidence was found for mappings to Concepts Not Used in Practice than Concepts Used in Practice for ChEBI and PRO, excluding NCBITaxon and VO (ChEBI: 7.6 vs. 7.6; PRO: 3.9 vs. 1; NCBITaxon: 1.2 vs. 1.1; VO: 3 vs. 4.1).

Fig. 5: Drug ingredient concept frequency of use in clinical practice by mapping category and ontology.
figure 5

This figure presents the frequency distributions of OMOP (Observational Medical Outcomes Partnership) drug exposure ingredient concepts used at least once in clinical practice (log 10 scale) in the Children’s Hospital Colorado pediatric OMOP database by mapping category and OBO (Open Biological and Biomedical Ontology) Foundry ontology. In each boxplot, the box extends from the first to third quartile of the data with a center line used to indicate the median. Whiskers extend from each box by 1.5x the interquartile range and outliers that extend past the whiskers shown as dots. The x-axis labels are numbers which correspond to the OMOP2OBO mapping categories: (1) Automatic One-to-One Concept; (2) Automatic One-to-One Ancestor (3) Automatic One-to-Many Concept; (4) Automatic One-to-Many Ancestor; (5) Cosine Similarity One-to-One Concept; (6) Manual One-to-One Concept; (7) Manual One-to-Many Concept; and (8) Unmapped. ChEBI Chemical Entities of Biological Interest, NCBITaxon National Center for Biotechnology Information Taxon Ontology, PRO Protein Ontology, VO Vaccine Ontology.

OMOP2OBO mappings: measurements

UMLS CUIs were found for 94.8% of measurement concepts (n = 3869) representing a single Semantic Type. The mapping results for each OBO Foundry ontology are displayed in Fig. 3 and detailed in Supplementary Table 7. Of the 11,269 measurement results, 96.6% (n = 10,888) mapped to 1115 unique HPO concepts (Concepts Used in Practice: 92.4%; Concepts Not Used in Practice: 99.4%) and 45 unique Uberon concepts (Concepts Used in Practice: 92.4%; Concepts Not Used in Practice: 99.4%), 76.8% (n = 8657) mapped to 425 unique NCBITaxon concepts (Concepts Used in Practice: 64.4%; Concepts Not Used in Practice: 84.9%), 42.6% (n = 4804) mapped to 172 unique PRO concepts (Concepts Used in Practice: 35.5%; Concepts Not Used in Practice: 47.2%), 87.9% (n = 9904) mapped to 443 unique ChEBI concepts (Concepts Used in Practice: 78.9%; Concepts Not Used in Practice: 93.7%), and 9.9% (n = 1114) mapped to 38 unique CL concepts (Concepts Used in Practice: 15.3%; Concepts Not Used in Practice: 6.4%). Only 13 concepts we attempted to map (excluding purposefully unmapped concepts) were unable to be mapped to at least one OBO Foundry ontology concept.

The frequency distributions of the Concepts Used in Practice by mapping category and OBO Foundry ontology are visualized in Fig. 6. The majority of the automated mappings were one-to-one at the concept-level for Concepts Used in Practice (HPO: n = 17; Uberon: n = 1793; NCBITaxon: n = 444; PRO: n = 44; ChEBI: n = 264; CL: n = 182) and Concepts Not Used in Practice (HPO: n = 3; Uberon: n = 3589; NCBITaxon: n = 444; PRO: n = 12; ChEBI: n = 515; CL: n = 186). The majority of the manual mappings were one-to-one (HPO: n = 3902; Uberon: n = 406; NCBITaxon: n = 2300; PRO: n = 1267; ChEBI: n = 1377; CL: n = 319). Cosine similarity-scored concept embeddings enabled 113 HPO (Concepts Used in Practice: median 0.4, range 0.3–0.8; Concepts Not Used in Practice: median 0.4, range 0.3–0.9), 142 Uberon (Concepts Used in Practice: median 0.3, range 0.3–0.8; Concepts Not Used in Practice: median 0.4, range 0.3–0.7), 150 NCBITaxon (Concepts Used in Practice: median 0.4, range 0.3–0.7; Concepts Not Used in Practice: median 0.4, range 0.3–0.7), 132 PRO (Concepts Used in Practice: median 0.4, range 0.3–0.7; Concepts Not Used in Practice: median 0.4, range 0.3–0.6), 476 ChEBI (Concepts Used in Practice: median 0.4, range 0.3–1; Concepts Not Used in Practice: median 0.3, range 0.3–0.6), and 102 CL (Concepts Used in Practice: median 0.4, range 0.3–1; Concepts Not Used in Practice: median 0.4, range 0.3–1) mappings (Supplementary Fig. 2c). On average, more evidence was found for mappings to Concepts Used in Practice than Concepts Not Used in Practice for HPO, Uberon, and PRO (HPO: 1.03 vs. 1.02; Uberon: 2.3 vs. 1.9; PRO: 1.1 vs. 1; NCBITaxon: 1.3 vs. 1.4; ChEBI: 2.7 vs. 2.9; CL: 2.5 vs. 2.8).

Fig. 6: Measurement concept frequency of use in clinical practice by mapping category and ontology.
figure 6

This figure presents the frequency distributions of OMOP (Observational Medical Outcomes Partnership) measurement concepts used at least once in clinical practice (log 10 scale) in the Children’s Hospital Colorado pediatric OMOP database by mapping category and OBO (Open Biological and Biomedical Ontology) Foundry ontology. In each boxplot, the box extends from the first to third quartile of the data with a center line used to indicate the median. Whiskers extend from each box by 1.5x the interquartile range and outliers that extend past the whiskers shown as dots. The x-axis labels are numbers which correspond to the OMOP2OBO mapping categories: (1) Automatic One-to-One Concept; (2) Automatic One-to-One Ancestor; (3) Automatic One-to-Many Concept; (4) Automatic One-to-Many Ancestor; (5) Cosine Similarity One-to-One Concept; (6) Manual One-to-One Concept; (7) Manual One-to-Many Concept; and (8) Unmapped. HPO Human Phenotype Ontology, Uberon Uber-Anatomy Ontology, NCBITaxon National Center for Biotechnology Information Taxon Ontology, PRO Protein Ontology, ChEBI Chemical Entities of Biological Interest, CL Cell Ontology.

Validation: accuracy

The goal of this task was to verify the accuracy of randomly selected sets of manual one-to-one and one-to-many OMOP2OBO mappings from each OMOP domain through domain expert review. Of the 2000 condition mappings, 73.9% were correct (n = 1477). Of the 116 reviewed drug ingredient mappings, 70.7% (n = 82) were correct. Upon review, it was found that 165 (31.6%) of the incorrect condition and 14 (41.2%) of the incorrect drug ingredient mappings could be improved by creating more specific mappings through adding new concepts to the OBO Foundry ontologies or by replacing multiple mappings to broad ancestor concepts with a single best representative ancestor concept. Measurement concepts were reviewed at the result-level using a survey and manual domain expert review. On the survey, 92.9% (n = 251) of the mappings were found to be correct. Of the 1350 measurement results, 97.3% (n = 1314) were correct.

In addition to expert review, each mapping was inspected at least twice by a member of the research team (T.J.C.). If we assume that the automatic one-to-one mappings created using resources provided by the UMLS, OMOP CDM, and OBO Foundry ontologies are correct and exclude mappings that occur at the ancestor level (assuming those are too broad) and unmapped concepts, then the following concepts received at least one form of review: (1) Conditions: 18.4% of Mondo and 9.9% of HPO; (2) Drug Ingredients: 95.3% of NCBITaxon, 90.3% of VO, 85.3% of ChEBI, and 33.3% of PRO; and (3) Measurements: 79.2% of HPO, 50.8% of Uberon, 48.5% of CL, 12.7% of ChEBI, 10.6% of NCBITaxon, and 3.9% of PRO.

Validation: generalization

The goal of this evaluation was to characterize the generalizability or coverage of concepts in the OMOP2OBO mapping set to a set of OMOP standard concepts that are commonly utilized in clinical practice. The Observational Health Data Sciences and Informatics (OHDSI) Concept Prevalence Study contains OMOP standard concepts that are commonly utilized in practice from several independent study sites across the OHDSI network (see Supplementary Table 3 for more information)72,73,74,75. For this evaluation, we leveraged data (referred throughout the remainder of the manuscript as the “OHDSI Concept Prevalence Data”) from 24 independent study sites, which included hospitals, academic medical centers, and claims databases. For this analysis, the OMOP2OBO mappings were filtered to identify all concepts with at least one valid mapping (i.e., excluding unmapped and not yet mapped concepts) across all of the ontologies mapped within each OMOP domain.

The OHDSI Concept Prevalence Data contained 62,335 condition concepts from 24 independent sites. The filtered OMOP2OBO mapping set contained 92,367 eligible condition concepts, which covered 92.5% (99.5% weighted coverage) of the OHDSI Concept Prevalence Data condition concepts (n = 57,663 concepts; median 689, range 100–874,824,195). Of the remaining condition concepts, 34,704 were only found in OMOP2OBO (median 100, range 100–39,975) and 4672 were only found in the OHDSI Concept Prevalence Data (median 100, range 100–52,739,431). These findings are visualized in Fig. 7. OMOP2OBO concept coverage ranged from 93–99.7% across the 24 OHDSI Concept Prevalence Data sites. Supplementary Fig. 3a presents the counts of OMOP condition concepts in the OHDSI Concept Prevalence Data by site. A chi-square test of independence with Yate’s correction revealed a significant association between the OHDSI Concept Prevalence Data sites and OMOP2OBO coverage (\({\chi }^{2}\)(23) = 7559.1, p < 0.0001). Post hoc tests using Bonferroni adjustment confirmed that 32% of the pairwise OHDSI Concept Prevalence Data site comparisons had significantly different OMOP2OBO coverage (p < 0.001 for all significant comparisons). The results of this analysis are visualized as a heatmap in Supplementary Fig. 3b. The OMOP2OBO concept count by OBO Foundry ontology, data wave, and coverage type are shown in Supplementary Fig. 4.

Fig. 7: OMOP2OBO—concept prevalence coverage.
figure 7

This figure visualizes the coverage distributions of Observational Medical Outcomes Partnership (OMOP) concepts over their frequency of use in clinical practice (log 10 scale) within the Concept Prevalence Study data by domain (Conditions [left]; Drugs [middle]; and Measurements [right]). The three modeled distributions include: concepts only found in the Concept Prevalence Study data (magenta), concepts only found in the OMOP2OBO mapping set (blue), and concepts found in both the Concept Prevalence Study data and the OMOP2OBO mapping set (yellow).

Results for the 4672 (7.5%) OHDSI Concept Prevalence Data condition concepts missing from OMOP2OBO are visualized in Fig. 8. Roughly 7.9% (n = 367) of condition concepts were accounted for using a newer version of the OMOP CDM and occurred in an average of 2.6 sites with a mean frequency of 27,412.3 (range 100–3,539,698.5). In total, 90.6% (n = 4231) of condition concepts purposefully excluded from the OMOP2OBO mapping set (i.e., no clear pathological or biological origin, not yet mapped, or were unable to be mapped) occurred in an average of 1.7 sites with a mean frequency of 6139.3 (range 100–8,254,186.5). The remaining condition concepts (1.6%; n = 74) were truly missing and occurred in an average of 2.7 sites with a mean frequency of 5320.1 (range 100–100,483). The frequency of distributions of the covered condition concepts from OMOP2OBO and Concept Prevalence condition concepts missing from OMOP2OBO in the OHDSI Concept Prevalence Data by site are visualized in Supplementary Fig. 3c, d. The five most frequently occurring missing condition concepts are shown in Table 3. Domain expert review determined these condition concepts were likely missing due to differences in patient populations and coding practices. The domain experts identified comparable condition concepts in the OMOP2OBO mapping set.

Fig. 8: OMOP2OBO—concept prevalence coverage error analysis.
figure 8

This figure visualizes the distributions of Observational Medical Outcomes Partnership (OMOP) concepts missing from the OMOP2OBO mapping set over their frequency of use in clinical practice (log 10 scale) within the Concept Prevalence Study data by domain (Conditions [left]; Drugs [middle]; and Measurements [right]). The three modeled error analysis distributions include: concepts recovered in a newer version of the OMOP common data model (CDM; magenta), concepts that were purposefully excluded, not yet mapped, or unable to be mapped by OMOP2OBO (blue), and concepts that were truly missing from the OMOP2OBO mapping set (yellow).

Table 3 Concept prevalence concepts missing from the OMOP2OBO mapping set.

The OHDSI Concept Prevalence Data contained 4588 drug ingredient concepts from 18 independent sites. The OMOP2OBO mapping set contained 8611 eligible drug ingredient concepts, which covered 87.9% (99.9% weighted coverage) of the OHDSI Concept Prevalence Data concepts (n = 4037 concepts; median 7299, range 100–1,308,580,305). Of the remaining drug ingredient concepts, 4574 were only found in OMOP2OBO (median 100, range 100–69,311) and 551 were only found in the OHDSI Concept Prevalence Data (median 300, range 100–10,748,492). These findings are visualized in Fig. 7. OMOP2OBO drug ingredient concept coverage ranged from 91.2–98.4% across the 18 Concept Prevalence Study sites. Supplementary Fig. 5a presents the counts of OMOP drug ingredient concepts in the OHDSI Concept Prevalence Data by site. A chi-square test of independence with Yate’s correction revealed a significant association between the OHDSI Concept Prevalence Data sites and OMOP2OBO coverage (\({\chi }^{2}\)(17) = 195.6, p < 0.0001). Post hoc tests using Bonferroni adjustment confirmed that 22% of the pairwise OHDSI Concept Prevalence Data site comparisons had significantly different OMOP2OBO coverage (p < 0.001 for all significant comparisons). The results of this analysis are visualized as a heatmap in Supplementary Fig. 5b. The OMOP2OBO drug ingredient concept count by OBO Foundry ontology, data wave, and coverage type are shown in Supplementary Fig. 6.

Results for the 551 (12%) OHDSI Concept Prevalence Data drug ingredient concepts missing from OMOP2OBO are visualized in Fig. 8. Roughly 0.9% (n = 5) of drug ingredient concepts were accounted for using a newer version of the OMOP CDM and occurred in an average of 8.4 sites with a mean frequency of 51,732 (range 100–221,229.7). In total, 82.8% (n = 456) of drug ingredient concepts purposefully excluded from the OMOP2OBO mapping set (i.e., not yet mapped) occurred in an average of 3.9 sites with a mean frequency of 18,847.3 (range 100–1,077,258.9). The remaining drug ingredient concepts (16.3%; n = 90) were truly missing and occurred in an average of 2.7 sites with a mean frequency of 3361.2 (range 100–175,551.3). The frequency of distributions of the drug ingredient concepts covered by OMOP2OBO and Concept Prevalence drug ingredient concepts missing from OMOP2OBO in the OHDSI Concept Prevalence Data by site are visualized in Supplementary Fig. 5c, d. The five most frequently occurring missing drug ingredient concepts are shown in Table 3. Domain expert review of these drug ingredient concepts found that they were likely missing as a result of hospital vendor differences or because they were a new high-risk biologic whose safety and efficacy had not yet been tested or confirmed for use in pediatric populations. The domain experts identified comparable drug ingredient concepts in the OMOP2OBO mapping set.

The OHDSI Concept Prevalence Data contained 25,513 measurement concepts from 18 independent sites. The resulting OMOP2OBO mapping set contained 3828 eligible measurement concepts (n = 10,676 results), which covered 11.1% (67.7% weighted coverage) of the OHDSI Concept Prevalence Data measurement concepts (n = 2260 concepts; median 1355, range 100–1,465,815,430). Of the remaining measurement concepts, 1208 were only found in OMOP2OBO (median 100, range 100–1,842,485) and 20,893 were only found in the OHDSI Concept Prevalence Data (median 109, range 100–1,219,846,862). These findings are visualized in Fig. 7. OMOP2OBO measurement concept coverage ranged from 4.2–75% across the 18 OHDSI Concept Prevalence Data sites. Supplementary Fig. 7a presents the counts of OMOP measurement concepts in the OHDSI Concept Prevalence Data by site. A chi-square test of independence with Yate’s correction revealed a significant association between the OHDSI Concept Prevalence Data sites and OMOP2OBO coverage (\({\chi }^{2}\)(17) = 3872.3, p < 0.0001). Post hoc tests using Bonferroni adjustment confirmed that 56% of the pairwise OHDSI Concept Prevalence Data site comparisons had significantly different OMOP2OBO coverage (p < 0.001 for all significant comparisons). The results of this analysis are visualized as a heatmap in Supplementary Fig. 7b. The OMOP2OBO measurement concept count by OBO Foundry ontology, data wave, and coverage type are shown in Supplementary Fig. 8.

Results for the 20,893 (81.9%) OHDSI Concept Prevalence Data measurement concepts missing from OMOP2OBO are visualized in Fig. 8. Roughly 0.1% (n = 13) of measurement concepts were accounted for using a newer version of the OMOP CDM and occurred in an average of 3.2 sites with a mean frequency of 9836.3 (range 100–29,098.2). In total, 0.8% (n = 158) of measurement concepts purposefully excluded from the OMOP2OBO mapping set (i.e., not mapped test type, unspecified sample, or were unable to be mapped) occurred in an average of 5.2 sites with a mean frequency of 282,115.3 (range 100–14,317,951.9). The remaining measurement concepts (99.2%; n = 20,722) were truly missing and occurred in an average of 2.8 sites with a mean frequency of 218,874 (range 100–1,219,846,862). The frequency of distributions of the measurement concepts covered by OMOP2OBO and Concept Prevalence measurement concepts missing from OMOP2OBO in the OHDSI Concept Prevalence Data by site are visualized in Supplementary Fig. 7c, d. The five most frequently occurring missing measurement concepts (reported as the average frequency across the 18 sites and number of sites with that concept) are shown in Table 3. Domain expert review of these measurement concepts confirmed that they were likely missing due to inconsistencies in hospital use of LOINC, a finding that’s been observed in literature76. The domain experts identified comparable measurement concepts in the OMOP2OBO mapping set.

Validation: clinical utility

Many patients with a genetic disease never receive a specific diagnosis, even after genetic sequencing77,78,79,80. Longitudinal EHR data has been used to identify patients with genetic disorders81,82,83,84. Inspired by the fact that most genetic diseases manifest as a recurring pattern of multiple symptoms or phenotypes affecting multiple organ systems82, the phenotype risk score (PheRS), which measures the similarity between an individual’s diagnosis codes and phenotypic features of known genetic disorders, was developed81. While the PheRS has shown great promise for identifying patients with undiagnosed Mendelian disease from EHR data58, it requires mappings that link ICD codes to HPO concepts, which most EHRs do not contain. The existing mappings58 developed to support PheRS were manually constructed, which may limit scalability when applied to new data.

The goal of this evaluation was to determine if the OMOP2OBO mappings could be used to facilitate the application of the PheRS to EHR data and to compare their performance to an existing set of validated manual mappings. For this analysis, the OMOP2OBO HPO mappings were compared to the ICD-HPO mappings58 using data from the AoU Research Program. The AoU Data were selected for this task because it provides access to a large sample of EHR data and genetic testing results (see Supplementary Table 3 for additional details on this data source). Five genetic diseases (and their associated genes) for which diagnosis codes have been found to be of high positive predictive value in EHRs58, were examined: Marfan syndrome (FBN1 and TGFBR1), multiple endocrine neoplasia (MEN1 and RET), neurofibromatosis (NF2), paragangliomas (SDHAF2, SDHB, SDHC), and tuberous sclerosis (TSC1, TSC2). These diseases were associated with 2257 unique phenotypic features (HPO codes). When querying AoU data to identify patients who had at least one of these phenotypic features, the ICD-HPO mappings (n = 7815 ICD codes) took ~30 min to complete and returned 210,718 patients and the OMOP2OBO mappings (n = 3783 OMOP concepts) took ~10 min to complete and returned 209,342 patients. Of the 208,831 patients found in common, 1887 were only identified by the ICD-HPO mappings, and 601 patients were only identified by the OMOP2OBO mappings. When the PheRS was applied to patients from both mappings they were found to be highly correlated (r2 > 0.6 across all diseases). This suggests that the patients returned by both mappings were similar.

For additional validation, case-control studies using only the OMOP2OBO mappings were performed: Marfan syndrome (131 cases and 63,086 controls), multiple endocrine neoplasia (86 cases and 72,150 controls), neurofibromatosis (255 cases and 65,256 controls), paraganglioma (105 cases and 65,256 controls), and tuberous sclerosis (38 cases and 58,555 controls). The results of these studies are shown in Supplementary Table 8 and the distributions of PheRS scores for cases and controls for each of the five diseases are visualized in Supplementary Fig. 9. As shown in this figure, PheRS were higher for cases than controls across all examined diseases. These results are further supported by one-sided Wilcoxon rank sum tests, which indicated that the PheRS were significantly higher for cases than controls (p < 0.001 for all diseases). Collectively, these results support the use of OMOP2OBO mappings as a scalable alternative to an existing set of validated manual mappings for use with PheRS to aid in the systematic identification of patients who might benefit from genetic testing.

Discussion

In this paper we present OMOP2OBO, an algorithm that semantically aligns conditions, drug ingredients, and measurement results from standard vocabularies in the OMOP CDM to OBO Foundry ontologies. Using OMOP2OBO, we built mappings for 92,367 condition, 8615 drug ingredient, and 10,673 measurement result concepts to ontology concepts representing 9636 diseases, 6309 phenotypes, 83 anatomical entities, 2704 organisms, 4261 chemicals, 132 vaccines, and 272 proteins. The mappings were evaluated on accuracy, generalizability, and clinical utility. For the first task, a panel of 10 domain experts reviewed subsets of the manually-derived mappings from each of the OMOP domains and found that 73.9% of the condition, 70.7% of the drug ingredient, and 92.9% of the measurement result mappings were correct. For the second task, we examined the generalizability of the concepts and found that 99.5% of conditions, 99.9% of drug ingredients, and 68% of measurement results overlapped with concepts used in clinical practice from 24 independent hospitals and claims databases. For the final task, we compared OMOP2OBO HPO mappings to an existing set of validated manual mappings when used to identify patients with five rare genetic diseases using data from the AoU Research Program. Queries using the OMOP2OBO mappings identified 99.3% of the patients returned by the validated manual mappings using fewer codes and one-third of the time. To the best of our knowledge, the OMOP2OBO mappings are the largest set of publicly available mappings between clinical vocabularies and OBO Foundry ontologies. The OMOP2OBO algorithm can easily be incorporated into existing clinical workflows and presents new opportunities to advance EHR-based deep phenotyping (recently published examples are described below).

Existing work to develop mapping sets and mapping algorithms has largely focused on using ontologies to improve the phenotyping of specific diseases (e.g., infectious disease85, rare diseases86,87, and cancer88) and for the investigation of specific biological (e.g., glycobiology89) and clinical domains (e.g., laboratory test results60 and medical diagnoses61,90). Our work is most similar to LOINC2HPO60, which we have expanded in our current mapping set (with annotations to five additional OBO Foundry ontologies). OMOP2OBO complements existing phenotyping efforts like the Electronic Medical Records and Genomics or eMERGE Network (https://emerge-network.org) and the AoU Research Program, by providing access to resources not currently available in EHRs and opportunities to improve the semantic interoperability of definitions through alignment to the OBO Foundry ontologies.

A portion of the mappings that are automatically derived by OMOP2OBO overlap with existing mappings provided by the OMOP CDM, the UMLS, and the OBO Foundry ontologies. The UMLS and OMOP CDM each align more than 200 vocabularies. At the time of our analysis (and determined using the same data), only the UMLS provided mappings to an OBO Foundry ontology, which included: the Gene Ontology91 (67,807 CUIs covering 69 vocabularies and an average of 166.9 codes), HPO (16,154 CUIs covering 91 vocabularies and an average of 1668.7 codes), and NCBITaxon (1,776,212 CUIs covering 55 vocabularies and an average of 3236.1 codes). Of these mappings, only the HPO and NCBITaxon are relevant to our work. Of the 1,776,212 CUIs aligned to NCBITaxon, 1128 were mapped to LOINC and 138 were mapped to RxNorm covering 0% of the measurement and 1.1% of the drug ingredient concepts in the CHCO OMOP Database, respectively. Of the 16,154 CUIs aligned to HPO, 993 were mapped to LOINC and 18,212 were mapped to SNOMED-CT covering 0% of the measurement and 4.2% of the condition concepts in the CHCO OMOP Database, respectively. Similar to the OMOP CDM and the UMLS, some of the OBO Foundry ontologies provide mappings to vocabularies in these resources. Collectively, the eight OBO Foundry ontologies used in this work provided 489,794 unique database cross-references from 179 unique data sources. Of these, only the HPO (11,616 ontology concepts to 19,569 codes from 16 data sources), Mondo (22,110 ontology concepts to 159,918 codes from 45 data sources), CL (949 ontology concepts to 1376 codes from 29 data sources), and Uberon (10,865 ontology concepts to 51,322 codes from 91 data sources) mappings were relevant to our work. Of the 19,569 HPO and 159,918 Mondo database cross-references only 3.6% and 15.6% mapped to a condition concept in the CHCO OMOP Database, respectively. These findings highlight that while there are some existing mappings between the resources that OMOP2OBO aligns, at best, they covered only ~15% of the OMOP concepts that we aimed to map supporting the need for its development. Further, it should be noted that the vast majority of the mappings provided by the OMOP CDM, UMLS, and OBO Foundry ontologies are simple one-to-one mappings. While OMOP2OBO contributes one-to-one mappings, it also provides more complex one-to-many mappings.

The OMOP2OBO mappings have been used to characterize differences in definitions of long COVID92, generate long COVID phenotypes93,94, and improve the categorization and prediction of psychiatric diseases among patients with long COVID95. Additionally, our recent work in pediatric rare disease subphenotyping demonstrated that patient representations constructed from the OMOP2OBO mappings produced more clinically meaningful clusters than representations built using OMOP concepts alone96. We further demonstrated the value of the mappings by leveraging them to successfully integrate external gene expression data from an independent sample of pediatric patients resulting in more clinically-meaningful and biologically-actionable phenotypes than those generated using only clinical data. One potential use of OMOP2OBO is to aid in the alignment of patient data to ontologies in the Global Alliance for Genomics and Health’s Phenopacket schema97, which was designed to support the global exchange of computable patient-level phenotypic information.

OMOP2OBO has not been optimized for performance; all possible ancestors are mapped when unable to generate a mapping at the concept-level. A prioritization strategy would significantly improve performance. OMOP2OBO does not take advantage of all of the knowledge available in the UMLS. Leveraging information in the mapping and hierarchy tables could improve the automatically mapped concepts and would enable use of other UMLS-aligned resources like the SemMedDB98. We only evaluated the accuracy of a small subset of the manual mappings. It is important to evaluate the remaining manually-derived mappings as well as to provide citations from the resources from which they were derived. The Accuracy evaluation revealed limitations of our expert review procedures; some of the experts experienced challenges when trying to use the OBO ontologies, which may have negatively impacted the results. Providing better training and offering outcomes other than correct/incorrect should be considered. Finally, OMOP standard clinical vocabularies are also dependent upon a large set of CDM-specific mappings and may be subject to similar errors as our mappings.

There are two primary challenges that remain given the initial development of the OMOP2OBO algorithm and mapping set. The first challenge is to establish procedures and build infrastructure to enable community sharing, monitoring, and updating of the mappings. While the GitHub repository for the OMOP2OBO currently contains policies for contributing to the mapping algorithm, we have yet to establish an infrastructure or policies for the mappings. Future opportunities include the adoption of a system like the one utilized by the Bioregistry (https://bioregistry.io)99. The Bioregistry provides extensive governance policies and templates, which make it easy to incorporate new and modify existing identifiers. They also developed a robust, semi-automated infrastructure that facilitates review by the maintainers and triggers rebuilds of the registry anytime changes are made. To improve the shareability of the mappings, we would also like to extend the mapping output formats to include Semantic Web standards like RDF/XML and the Simple Standard for Sharing Ontological Mappings or SSSOM100. In addition to creating a system like the Bioregistry, future work may include adoption and adaptation of OBO Foundry protocols for ontology development and maintenance57,101.

The second challenge is to improve and expand the evaluation of the algorithm and the mapping set. The UMLS, OMOP CDM, and the OBO Foundry ontologies provide mappings between clinical vocabularies and ontologies, which are automatically- or manually-derived (e.g., mappings between source and standard vocabulary concepts, mappings between clinical vocabularies and ontologies, and/or database cross-references mapped to ontology concepts). While the OMOP2OBO algorithm leverages these mappings (i.e., leveraging source codes mapped to standard concepts), verifying the quality of existing mappings was not within the scope of the current work. Currently, no modules in the OMOP2OBO algorithm verify the quality of existing mappings used by OMOP2OBO or mappings generated by it. This should include resources to validate automatic mappings as their accuracy depends upon the quality of the resources from which they were built, and ontologies are subject to a variety of errors102,103,104. To do this, we might leverage pretrained language models and/or develop new machine learning models using trusted resources (e.g., the scientific literature) to verify the database cross-references provided by the OBO Foundry ontologies, UMLS, and OMOP CDM database prior to running OMOP2OBO.

Methods

OMOP2OBO is open source (https://github.com/callahantiff/OMOP2OBO), available on PyPI (https://pypi.org/project/omop2obo), and includes an interactive dashboard that summarizes the current mapping set (http://tiffanycallahan.com/OMOP2OBO_Dashboard). We also created a dedicated Zenodo Community, which provides access to data, mappings, and presentations (https://zenodo.org/communities/omop2obo). A list of the acronyms used in this paper are provided in Supplementary Table 1 and the resources used by the OMOP2OBO algorithm and mappings are described in Supplementary Table 2.

OMOP2OBO algorithm: resources

Although it is possible to apply the OMOP2OBO algorithm to any clinical vocabulary, the OMOP CDM was selected because of its rich data representation, standard vocabularies (and hierarchies) and the mappings it provides to more than 200 commonly used clinical vocabularies. To increase the coverage of the resources and the potential of an automatic mapping, OMOP2OBO leverages the National Library of Medicine’s UMLS (MRCONSO and MRSTY tables [2020AA version105])70. These data are used to annotate each OMOP concept with a UMLS CUI and a Semantic Type71. Additionally, the mappings provided by the MRCONSO table are used to enhance existing database cross-reference mappings provided by OMOP and the OBO Foundry ontologies (both described in detail in the “Input data used to create OMOP2OBO mappings” section).

OMOP2OBO algorithm: overview

The OMOP2OBO algorithm (Fig. 1) consists of the following three components: (1) Process Input Data; (2) Map OMOP Standard Vocabulary Concepts to OBO Foundry Ontology Concepts; and (3) Synthesize and Process Mapping Results and Output Mappings. Each component is described in detail below.

The first component is Processing Input Data. The algorithm takes as input a table of OMOP concepts and a list of one or more OBO Foundry ontologies. For both types of data, the algorithm expects concept or class identifiers, source codes or database cross-references, labels, synonyms, and ancestor concepts or classes. While the algorithm expects a table of input OMOP concepts (due to the private nature of clinical data, the algorithm does not assume a direct database connection is possible), it automatically downloads the OBO Foundry ontologies using OWLTools (April 06, 2020 release; https://github.com/owlcollab/owltools).

The second component is Mapping OMOP Vocabulary Concepts to OBO Foundry Ontology Concepts. This component is designed to automatically map or align OMOP concepts to OBO Foundry ontology concepts. The algorithm includes several different approaches, prioritizing those that result in high-confidence mappings. This component includes concept alignment and concept embedding.

  • Concept alignment: exact-string matches between OMOP and OBO Foundry ontology concept labels, definitions, and synonyms are obtained. Prior to alignment, the label and synonym fields are both made lowercase. This step also obtains exact matches between OMOP standard concepts and source codes to OBO Foundry ontology database cross-references. To increase the likelihood of finding a match, the OMOP standard concepts and source codes are first merged with terminologies in the UMLS using core functionality from OHDSI Ananke106, a program developed to align OMOP concepts to UMLS CUIs. Prior to performing this alignment, the OMOP standard concepts and source codes and the OBO Foundry ontology database cross-references are normalized using a custom dictionary (source_code_vocab_map.csv107). This resource ensures that concepts referenced by the same code using different prefixes or symbols can be aligned (e.g., SNOMED-CT:1234567 and sctid:1234567).

  • Concept embedding: using scikit-learn108, a bag-of-words (BoW)109 vector space model with term-frequency inverse-document frequency (TF-IDF)110 and L2 normalization is used to learn concept embeddings for all OMOP and OBO Foundry ontology concepts and concept ancestors label and synonym text strings. While the BoW model was used because it is easy to understand and has shown great success when applied to EHR data and when used to align biomedical ontologies111,112, any language or embedding model could be utilized. The BoW model is implemented as an N × M document-term matrix with one row per input string and one column for each tokenized word appearing in the universe of all input strings. The value of each cell in the matrix is the normalized frequency each word occurred in each input string (using TF-IDF normalization). Prior to building the model, all text fields are made lowercase, stop words are removed using the wordnet list from Python’s NLTK library113, white spaces are removed, and word-level tokenization and lemmatization are applied. After learning the model, a final embedding is constructed for each input string by aggregating the constituent concept embeddings. Cosine similarity is used to compute scores between all pairwise combinations of OMOP and OBO Foundry concept embeddings. Given that each OMOP and OBO concept can have a label and one or more synonyms, only the single highest-scoring pairwise comparison is selected for the final mapping. Cosine similarity scores range from 0–1, where a score of one indicates a greater match between the embedding pairs. To improve the efficiency of this process, only the top 75% of pairs with scores ≥0.25 are output, which was decided after visualizing the score distribution using a histogram. All thresholds and cut-offs are customizable. Concept embeddings are created for all OMOP concepts, regardless of whether or not they were automatically mapped by a prior Component. All remaining unmapped concepts require manual curation.

The third component is Synthesizing and Outputting Mapping Results. The mapping results from the prior component are post-processed to include a mapping category and human-readable evidence. The mapping category is constructed by combining the following elements: (1) one or more OBO Foundry ontology identifiers and labels; (2) mapping logic applied to specify semantics when there are multiple ontology concepts (i.e., “and”, “or”) or to denote negation (i.e., “not”); (3) a mapping category derived from the mapping approach (e.g., automatically determined using an algorithm or manually derived by a human annotator), cardinality (i.e., one-to-one aligning a single OMOP concept to a single OBO Foundry ontology concept or one-to-many aligning a single OMOP concept to one or more OBO Foundry ontology concepts), and level (i.e., mapping to the OMOP concept directly or to one of its ancestors); and (4) mapping evidence represented as a pipe-delimited string that denotes all resources that support the mapping (i.e., the exact string matches between labels and synonyms, source codes and database cross-reference alignments, and other sources supporting a mapping like scored heuristics and references from manual review). Supplementary Table 4 provides additional details on and examples of the mapping categories. Post-processed mappings are serialized and able to be output to a variety of file types, like flat file, database dump, or RDF/XML file.

Input data used to create OMOP2OBO mappings

The OMOP2OBO mappings were constructed from two data sources: (1) the CHCO OMOP Database and (2) OBO Foundry ontologies (both are described in detail below). Figure 2 includes example mappings and illustrates how the OBO Foundry ontologies were used to map OMOP concepts from each domain. Supplementary Table 4 provides additional details on and examples of the mapping categories. Supplementary Table 3 provides descriptions of the clinical data sources used to build and validate the OMOP2OBO mappings.

Input data used to create OMOP2OBO mappings: OMOP data

The OMOP2OBO mappings were constructed using data from the CHCO OMOP Database, a de-identified database that contained data from more than six million pediatric patients. The CHCO OMOP Database is stored within University of Colorado Anschutz Medical Campus Health Data Compass’ Health Insurance Portability and Accountability Act compliant Google Cloud-based infrastructure (created in October 2018; https://www.healthdatacompass.org). The data conformed to the structure defined by the National Pediatric Learning Health System (PEDSnet) OMOP CDM v3.0, which is an adaptation of the OMOP CDM version 5.053,62. Data were obtained from a de-identified database that was determined by the Colorado Multiple Institutional Review Board to be non-human subjects (#15-0445). Due to the broad scope of the projects approved to be performed on this database, a Waiver of consent was obtained as it was not practical to obtain consent from all patients.

Concept lists were derived from standard OMOP vocabularies (i.e., SNOMED-CT [https://www.snomed.org; v20180131], RxNorm52 [v20180507], and LOINC51 [v2.64]) from the Condition Occurrence, Drug Exposure (at the drug ingredient level), and Measurement tables, respectively. For each concept set, metadata were extracted from the OMOP CDM including concept codes (i.e., codes from each standard vocabulary), labels, synonyms, and ancestor concepts (codes, labels, and synonyms were also extracted for each concept ancestor). Concept lists were organized into two data waves according to whether or not they had been used in clinical practice (i.e., Concepts Used in Practice and Concepts Not Used in Practice). As manual annotation requires significant resources, only concepts from the first data wave (i.e., Concepts Used in Practice) were manually mapped. Prior to constructing the concept lists, Condition and Measurement data were preprocessed to ensure the mapping process was robust and reproducible.

For condition concepts, Concepts used in Practice, UMLS Semantic Types were used to identify all concepts that had a clear pathological or biological origin. All remaining concepts (e.g., accidents, injuries, external complications, and findings without clear interpretations) were marked as unmapped and the reason for exclusion was provided in the evidence field. The Semantic Types were also used to group OMOP concepts such that those typed as “Findings” or “Signs and Symptoms” were treated as phenotypes and only mapped to HPO and concepts typed as “Disease or Syndrome” were only mapped to Mondo. For Concepts Not Used in Clinical Practice, all possible automatic mappings were obtained and concepts which were unable to be mapped automatically were marked as unmapped and “NOT YET MAPPED” was provided as the mapping evidence. This same approach was applied to drug ingredients.

For all measurement concepts, a scale and result type were created. The scale (i.e., ordinal, nominal, quantitative, qualitative, narrative, doc, and panel) of each measurement was identified from the OMOP CDM or by parsing the concept synonym field. For all Concepts Used in Practice, reference ranges were used to determine the result type; concepts with numeric reference ranges were typed as “Normal/Low/High” and concepts with reference ranges that included “positive” or “negative” were typed as “Positive/Negative”. Concepts Not Used in Practice with an ordinal scale or with synonyms that contained the words “presence” or “screen” were typed as “Positive/Negative”. Concepts with a quantitative scale were typed as “Normal/Low/High”. All other scale types were typed as “Unknown Result Type”. While it is possible to infer the result type from the scale type (e.g., all concepts with a quantitative scale have result type “Normal/Low/High” and all concepts with an ordinal scale have result type “Positive/Negative”), our approach aimed to maximize the inclusion of concepts from all scale types. Mappings were created for each result type using the procedures defined by LOINC2HPO60; results were annotated with respect to their result type:

  • Concepts with result type “Normal/Low/High”: for example, “Corticotropin [Mass/volume] in Plasma—4th specimen post XXX challenge” (LOINC:12460-2). Results above the reference range are mapped to “Increased Circulating ACTH Level” (HP:0003154). Results below the reference range are mapped to “Decreased Circulating ACTH Level” (HP:0002920). Results within the reference are mapped to “Abnormality of Circulating Adrenocorticotropin Level” and logically negated (NOT HP:0011043).

  • Concepts with result type “Positive/Negative”: for example, “Amphetamine [Presence] in Urine by Screen Method” (LOINC:19343-3). Positive results are mapped to “Positive Urine Amphetamine Test” (HP:0500112). Negative results are mapped to “Positive Urine Amphetamine Test” and logically negated (NOT HP:0500112).

Also consistent with the procedures adopted by LOINC2HPO, all concepts lacking sufficient detail (i.e., non-specific body substances) were marked as unmapped and “Unspecified Sample” was provided as the mapping evidence.

The initial set of measurement concepts was supplemented with LOINC2HPO annotations60, which were downloaded on August 2, 2020 from the LOINC2HPO annotation Github repository114. OMOP2OBO expands the LOINC2HPO mappings by including the measurement substance (i.e., body fluids, tissues, and organs via Uberon), the entity being measured (i.e., chemicals, metabolites, or hormones via ChEBI; cell types via CL; and proteins via PRO), and the species of the measured entity (i.e., organism taxonomy via NCBITaxon). All modifications to the original LOINC2HPO annotations were recorded in the mapping evidence field, enabling users to easily identify when an original LOINC2HPO annotation had been updated. All LOINC concepts in the LOINC2HPO mappings that were not used at least once in clinical practice in the CHCO pediatric OMOP Database were categorized as a Concept Not Used in Practice.

Input data used to create OMOP2OBO mappings: OBO Foundry ontologies

OBO Foundry ontologies were selected under the advice of several clinicians, molecular biologists, and professional OBO Foundry biocurators to cover the following domains: diseases (Mondo115 [v2020-09-14]), phenotypes (HPO59 [v2020-08-11]), anatomical entities (CL65 [v2020-05-21], Uberon64 [v2020-06-30]), organisms (NCBITaxon66 [v2020-04-18]), chemicals (ChEBI67 [v191]), vaccines (VO68 [v1.1.102]), and proteins (PRO69 [v61.0]). Similar to the clinical concepts, each ontology was queried to obtain labels, definitions, synonyms (including synonym type), and database cross-references. All OBO Foundry ontologies were downloaded in September 2020 using OWLTools (April 06, 2020 release; https://github.com/owlcollab/owltools).

Mapping evaluation

The OMOP2OBO mappings were evaluated by assessing their accuracy, generalizability, and clinical utility.

Mapping evaluation: accuracy

Automatic mappings are created from exact alignments between resources available in the OMOP CDM and the OBO Foundry ontologies and thus are assumed to be accurate and high-confidence mappings. The goal of this evaluation was to examine the accuracy of a portion of the manually-derived mappings. For conditions and drug ingredients, of all manual mappings (including one-to-one and one-to-many), 20% were randomly selected for manual review (n = 2000 conditions; n = 116 drug ingredients) by a practicing resident physician (J.M.W.) and clinical pharmacist (J.S.), respectively.

Measurement mappings are significantly more complex as they require interpreting lab test results and annotating the source of the sample (e.g., bodily fluid, anatomical entity, or cell type), entity being measured (e.g., chemical or cell type), and organism of the measured entity. While annotating the samples and entities is straightforward, interpreting lab tests results and aligning them to HPO concepts can be challenging. As a result, only the HPO mappings were evaluated by domain experts. These mappings were evaluated in two ways: (1) Survey. A subset of the mappings (n = 270) were independently validated by five domain experts including three practicing pediatric clinicians (T.D.B., J.A.F., and B.M.), a PhD-level molecular biologist (A.L.S.), and a Computational Biology PhD candidate with Masters-level training in epidemiology and biostatistics (T.J.C.) using a Qualtrics Survey116. Any mapping that did not meet agreement by at least one clinician and both the biologist and the epidemiologist were re-evaluated by the most senior clinician. These mappings were also vetted on the LOINC2HPO GitHub tracker (https://github.com/TheJacksonLaboratory/loinc2hpoAnnotation/issues) by members of the biocuration team. (2) Biocurator validation. A random subset of 1350 measurement results were manually verified by an OBO Foundry biocurator.

All of the manual mappings were derived by a member of the research team who at the time of the analysis was a Computational Biology PhD candidate with Masters-level training in epidemiology and biostatistics (T.J.C.). As this individual does not have specialized medical or pharmacological training, it is assumed that these mappings may contain errors. Additional details are provided on GitHub (https://github.com/callahantiff/OMOP2OBO/wiki/Accuracy).

Mapping evaluation: generalizability

The generalizability of the OMOP2OBO mappings were examined using the OHDSI Concept Prevalence Study data72,73,74,75. The Concept Prevalence study provides data on the frequency of OMOP concept usage in clinical practice across several independent sites in the OHDSI network. In addition to the Concept Prevalence Study sites, data were obtained from two independent academic medical centers, bringing the total number of sites to 24. None of the 24 sites overlapped with the site that was used to generate the OMOP2OBO mappings. Consistent with the Concept Prevalence Study procedures, all concepts from the OMOP CHCO Database occurring fewer than 100 times were assigned a count of 100. For all other analyses, the true range of counts in the OMOP CHCO Database were utilized. The OMOP2OBO mappings were filtered to remove all concepts without at least one ontology mapping. Coverage of all standard OMOP concepts in the OMOP2OBO mapping set was assessed by identifying: (1) concepts that existed in the OMOP2OBO set and in at least one Concept Prevalence Study site (i.e., Overlap); (2) concepts only present in the OMOP2OBO set (i.e., OMOP2OBO Only); and (3) concepts only present in the Concept Prevalence Study set (i.e., Concept Prevalence Only). Institutional review board approval was not required to use these data as the dataset was completely de-identified and contained no patient-level information.

An error analysis was performed to examine the Concept Prevalence Only concept set. Three scenarios were examined: (1) “Recovered in Newer Version of CDM”: concepts that could be recovered using a newer version of the OMOP CDM (v5.3.1; 02/25/2022); (2) “Purposefully Excluded”: concepts without clear pathological or biological origin that were purposefully excluded from the OMOP2OBO mapping set; and (3) “Truly Missing”: concepts that could not be accounted for using the prior two scenarios. For all scenarios, concept frequency within the Concept Prevalence Study sites was used as a measure of concept importance. Findings from each scenario were reviewed by a practicing resident physician and a clinical pharmacist. See GitHub for additional details (https://github.com/callahantiff/OMOP2OBO/wiki/Generalizability).

Mapping evaluation: clinical utility

The clinical utility of the OMOP2OBO mappings was compared to an existing set of validated manual mappings (ICD-HPO mappings58) when used to identify undiagnosed rare disease patients. For this analysis, AoU Data (https://www.researchallofus.org) was selected because it provides access to a large sample of EHR data with genetic testing results. For this evaluation, the version 6 build was used, which contained data from ~630 sites on more than 528,000 patients. Five genetic diseases for which diagnosis codes have been found to be of high positive predictive value in EHRs58 were selected, which included: Marfan syndrome, multiple endocrine neoplasia, neurofibromatosis, paraganglioma, and tuberous sclerosis. These diseases are associated with 11 of the 73 American College of Medical Genetics and Genomics (ACMG) secondary finding genes (ACMG-73; v3.0), which have specific mutations known to cause disorders, have well-defined phenotypes, and are clinically actionable117. The diseases and associated genes included: FBN1 and TGFBR1 (Marfan syndrome); MEN1 and RET (multiple endocrine neoplasia); NF2 (neurofibromatosis); SDHAF2, SDHB, and SDHC (paragangliomas); and TSC1, TSC2 (tuberous sclerosis). Using the Online Mendelian Inheritance in Man (OMIM; https://www.omim.org) database and the HPO gene annotation table118, each gene and its corresponding set of phenotypic features were aligned to the HPO. To calculate the phenotypic burden of each genetic disease, HPO mappings to OMOP condition concepts from OMOP2OBO (v2.0.0 beta) and ICD concepts from a validated set of ICD-HPO mappings58 were queried against the AoU data. PheRS for each gene were then calculated for patients from each the OMOP2OBO and Phecode mapping sets. The PheRS81 is an algorithm used to identify patients with phenotypic features that are clinically similar to OMIM Mendelian profiles but who lack formal diagnosis and has demonstrated utility for identifying underdiagnosed rare disease patients using only EHR data58,81. The standardized version of the PheRS was used because it is easier to interpret and reduces noise when it is suspected that a large number of phenotypes will overlap between cases and controls81. The OMOP2OBO and ICD-HPO mappings were compared and evaluated on time to complete the query against the AoU Data and differences in the returned patient cohorts. As validation, case-control studies were performed for each of the five diseases using the patients returned from the OMOP2OBO mappings. Cases were defined as patients with at least two occurrences of a relevant diagnosis code and control patients had no instances of these codes. Cases and controls were matched on age, sex, and length of EHR record. For each disease, a one-sided Wilcoxon rank sum test was performed in order to determine if PheRS were significantly higher for cases than controls. Results were verified by a PhD-level Epidemiologist specializing in genetics (C.Z.).

All analyses were performed in the AoU Researcher Workbench (https://www.researchallofus.org/data-tools/workbench) by an authorized researcher (C.Z.). Informed consent is obtained from all participants who enroll in the AoU Research program119. Because the authors were not directly involved with the participants and all data were de-identified, the use of these data was exempt from institutional review. For additional details, see “Do I need my project reviewed by the AoU Institutional Review Board (IRB) in order to access this data using the Researcher Workbench?” for more information (https://www.researchallofus.org/frequently-asked-questions/#workbench-faqs).

Statistics and technical specifications

OMOP2OBO was developed using Python 3.6.2 on a single machine with 8 cores and 16GB of RAM. All code and project information are publicly available and detailed on GitHub (https://github.com/callahantiff/OMOP2OBO). The OMOP2OBO (v1.0) mappings are publicly available from Zenodo120,121,122. The OMOP2OBO Mapping Dashboard was built with R (v4.2.1) using Rmarkdown (v2.14) and flexdashboard (v0.5.2).

Descriptive and inferential statistics were performed to evaluate the data available for mapping and the OMOP2OBO mapping set. Chi-square tests of independence with Yate’s correction were used to: (1) assess differences in the proportions of metadata available from each OBO Foundry ontology; and (2) assess differences in the proportions of mapped concepts between OHDSI Concept Prevalence sites. Post hoc tests using Bonferroni adjustment to correct for multiple comparisons were performed for significant omnibus tests. Analyses were performed in Jupyter Notebooks (v6.1.6) using the scipy (v1.4.1), statsmodels (v0.12.1), statistics (v1.0.3.5), and numpy (v.1.18.1) libraries. Visualizations were created using matplotlib (v.3.3.2). The “Clinical Utility” evaluation was performed in the AoU Researcher Workbench (https://www.researchallofus.org/data-tools/workbench) using R (v4.1.2) and Python (v3.7). Analyses were performed on a machine with 16 CPUs and 60GB of memory.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.