Ontologizing health systems data at scale: making translational discovery a reality

Common data models solve many challenges of standardizing electronic health record (EHR) data but are unable to semantically integrate all of the resources needed for deep phenotyping. Open Biological and Biomedical Ontology (OBO) Foundry ontologies provide computable representations of biological knowledge and enable the integration of heterogeneous data. However, mapping EHR data to OBO ontologies requires significant manual curation and domain expertise. We introduce OMOP2OBO, an algorithm for mapping Observational Medical Outcomes Partnership (OMOP) vocabularies to OBO ontologies. Using OMOP2OBO, we produced mappings for 92,367 conditions, 8611 drug ingredients, and 10,673 measurement results, which covered 68–99% of concepts used in clinical practice when examined across 24 hospitals. When used to phenotype rare disease patients, the mappings helped systematically identify undiagnosed patients who might benefit from genetic testing. By aligning OMOP vocabularies to OBO ontologies our algorithm presents new opportunities to advance EHR-based deep phenotyping.


INTRODUCTION
Electronic health record (EHR) adoption, which is nearly universal within the US healthcare system, 1,2 has increased adherence to evidence-based clinical guidelines 3 and facilitated greater patient communication 4 resulting in significant improvements in care. 5 EHRs contain a myriad of systematically collected, longitudinal, patient-level information and are a valuable resource for population-level research. 6 The cornerstone of medicine, diagnosis or clinical phenotyping, aims to identify empirically observable traits exhibited by patients (i.e., signs and symptoms) known to be characteristic of a specific disease. 7 Computational phenotyping is the process of converting clinical phenotypes into computer-executable algorithms in order to identify relevant patients from large sources of clinical data, usually EHRs. 8 One promise of EHR-based computational phenotyping is the ability to perform population-level investigations of mechanistic drivers of disease in diverse patient populations. 9,10 Despite significant progress, this objective remains largely aspirational. 6,[11][12][13][14] Traditionally, computational phenotypes have been imprecise due to their exclusive reliance on EHR data, which has been shown to be insufficient at capturing the phenotypic heterogeneity present in most complex diseases. [15][16][17][18] Deep phenotyping, or "the precise and comprehensive analysis of phenotypic abnormalities in which the individual components of the phenotype are observed and described", 7 is a fundamental component of precision medicine that requires timely synthesis of multiple types of patient data. 19,20 Deep phenotyping has been successfully applied to rare disease and genetic disorders, [21][22][23][24][25][26][27][28][29][30][31][32][33] cancer, [34][35][36][37][38][39][40] and pregnancy [41][42][43] using a variety of clinical and -omic data. Despite large-scale biobanking efforts and resources like the UK Biobank 44 and the All of Us Research Program 45 , most EHRs do not systematically integrate nor have the infrastructure to integrate patient-level genomic data or other forms of external knowledge (e.g., scientific literature) with clinical data. [46][47][48] Within an EHR, most data used for research (i.e., structured data) are stored using clinical terminologies or vocabularies. Clinical vocabularies are defined as a standard representation of preferred terms which may or may not be hierarchical or have formally defined relationships and are designed to facilitate meaningful and unambiguous information exchange within the medical domain. [49][50][51] Hundreds of clinical vocabularies have been developed and their use differs by hospital and country. Examples include the International Classification of Diseases (ICD), 52 the Logical Observation Identifiers, Names and Codes (LOINC), 53 the Systematized Nomenclature of Medicine --Clinical Terms (SNOMED-CT), 54 and RxNorm. 55 Most clinical vocabularies were not designed to be integrated or interoperable with other vocabularies, which is one of the long standing barriers preventing the secondary use of EHR data for research. 48 Common data models (CDMs) like the Observational Medical Outcomes Partnership (OMOP) 56 have solved many of the challenges of standardizing, representing, and utilizing clinical EHR data. Unfortunately, most CDMs and associated terminology management systems are not yet able to integrate and interpret genomic data or other sources of external knowledge or publicly available data. 57 Similar to clinical vocabularies, ontologies are classification systems that provide detailed representations of our knowledge of a specific domain. 51 Ontologies, like those in the Open Biological and Biomedical Ontology (OBO) Foundry, exist for nearly all scales of biological organization and when combined, can provide a semantically rich and biologically accurate representation of molecular entities and mechanisms. [58][59][60] Unlike clinical vocabularies, ontologies are semantically computable and interoperable with formally defined relationships, which means they can be logically verified and integrated with data from basic science and clinical research. 51 Mapping clinical vocabularies to ontologies has been recognized as a fundamental requirement for use in deep phenotyping. 20,48,51,61 An example of how aligning these resources improves deep phenotyping was recently demonstrated by Zhang et al., (2019) 62 who mapped LOINC 53 to the Human Phenotype Ontology (HPO), 63 which enabled the harmonization of laboratory tests with different clinical codes to common HPO concepts.
Due to the time-consuming manual effort required to map clinical vocabularies to OBO Foundry ontologies, no comprehensive mapping across commonly used ontologies currently exists. While automated mapping approaches exist, they largely remain unable to correctly capture the complex semantics underlying clinical data and the knowledge encoded by clinical vocabulary concepts. 64  Building on LOINC2HPO, the goal of this work is to develop OMOP2OBO, an algorithm that enables semantically interoperable mappings between clinical vocabularies in the OMOP CDM to OBO Foundry ontologies ( Figure 1). The resulting mappings will enhance the semantic interoperability of the data represented by the OMOP concepts and have the potential to advance deep EHR-based phenotyping by enabling the identification of relevant patients using our knowledge of the molecular mechanisms underlying disease rather than billing codes which are prone to error and subject to bias. Using OMOP2OBO, we created the first healthcare system-scale mappings between clinical vocabularies in the OMOP CDM and eight of the most widely used OBO Foundry ontologies 59 Table 1 lists the acronyms and definitions used in the paper. The resources used to build and evaluate the OMOP2OBO algorithm and mappings are described in Supplementary Table 2 and Supplementary Table 3.

OMOP Data
The OMOP2OBO mappings were created using a de-identified pediatric dataset from the Children's Hospital of Colorado (CHCO) normalized to the OMOP CDM (referred throughout the manuscript as "CHCO OMOP Database" and described in detail in Supplemental Table 3). 56

OBO Foundry Ontologies
Under the guidance of domain experts, eight OBO Foundry ontologies were selected to represent the following domains: diseases (Mondo 67 73 ). Each set of ontology concepts also included metadata, which was obtained by querying each ontology for labels, definitions, synonyms, and database cross-references (i.e., codes from other vocabularies and ontologies). The amount of metadata available for mapping is shown in Table 1    used to map concepts from each OMOP domain . As illustrated by this figure, OMOP conditions   were mapped to HPO and Mondo, OMOP drug ingredients were mapped to ChEBI, NCBITaxon,   PRO, and VO, and OMOP measurements results were mapped to HPO, Uberon, NCBITaxon, PRO, CheBI, and CL. As illustrated in the bottom panel of Figure 1, each mapping consists of four elements: (I) the approach used to create it (i.e., "automatic", "manual", or "cosine similarity"); (II) cardinality (i.e., one-to-one or one-to-many); (III) level (i.e., concept or ancestor);

Conditions
Unified Medical Language System (UMLS) 74 concept unique identifiers (CUIs) were found for 96.6% of condition concepts (n=105,976) representing 69 unique Semantic Types. 75 The mapping results for each OBO Foundry ontology are displayed in Figure 3 and detailed in Supplementary

Measurements
UMLS CUIs were found for 94.8% of measurement concepts (n=3869) representing a single Semantic Type. The mapping results for each OBO Foundry ontology are displayed in Figure 3 and detailed in Supplementary  Figure 6. The majority of the

Accuracy
The goal of this task was to verify the accuracy of randomly selected sets of manual one-to-one

Generalization
The goal of this evaluation was to characterize the generalizability or coverage of concepts in the OMOP2OBO mapping set to a set of OMOP standard concepts that are commonly utilized in clinical practice. The Observational Health Data Sciences and Informatics (OHDSI) Concept Prevalence Study contains OMOP standard concepts that are commonly utilized in practice from several independent study sites across the OHDSI network (see Supplementary Table 3 for more information). [76][77][78][79] For this evaluation, we leveraged data (referred throughout the remainder of the manuscript as the "OHDSI Concept Prevalence Data") from 24 independent study sites, which included hospitals, academic medical centers, and claims databases. For this analysis, the OMOP2OBO mappings were filtered to identify all concepts with at least one valid mapping (i.e., excluding unmapped and not yet mapped concepts) across all of the ontologies mapped within each OMOP domain.

Conditions
The    Table 2. Domain expert review determined these concepts were likely missing due to differences in patient populations and coding practices. The domain experts identified comparable condition concepts in the OMOP2OBO mapping set.

Drug Ingredients
The   Table 2. Domain expert review of these concepts found that they were likely missing as a result of hospital vendor differences or because they were a new high-risk biologic whose safety and efficacy had not yet been tested or confirmed for use in pediatric populations. The domain experts identified comparable drug ingredient concepts in the OMOP2OBO mapping set.

Measurements
The   Table 2. Domain expert review of these concepts confirmed that they were likely missing due to inconsistencies in hospital use of LOINC, a finding that's been observed in literature. 80 The domain experts identified comparable measurement concepts in the OMOP2OBO mapping set.

Clinical Utility
Many patients with a genetic disease never receive a specific diagnosis, even after genetic sequencing. [81][82][83][84] Longitudinal EHR data has been used to identify patients with genetic disorders. [85][86][87][88] Inspired by the fact that most genetic diseases manifest as a recurring pattern of multiple symptoms or phenotypes affecting multiple organ systems, 86 Figure 9. As shown in this figure, PheRS were higher for cases than controls across all examined diseases. These results are further supported by one-sided Wilcoxon rank sum tests, which indicated that the PheRS were significantly higher for cases than controls (p<0.001 for all diseases). Collectively, these results support the use of OMOP2OBO mappings as a scalable alternative to an existing set of validated manual mappings for use with PheRS to aid in the systematic identification of patients who might benefit from genetic testing.

DISCUSSION
In this paper we present OMOP2OBO, an algorithm that semantically aligns conditions, drug

Related Work
Existing work to develop mapping sets and mapping algorithms has largely focused on using ontologies to improve the phenotyping of specific diseases (e.g., infectious disease, 90 rare diseases, 91,92 and cancer 93 ) and for the investigation of specific biological (e.g., glycobiology 94 ) and clinical domains (e.g., laboratory test results 62 and medical diagnoses 64,95  These findings highlight that while there are some existing mappings between the resources that OMOP2OBO aligns, at best, they covered only ~15% of the OMOP concepts that we aimed to map supporting the need for its development. Further, it should be noted that the vast majority of the mappings provided by the OMOP CDM, UMLS, and OBO Foundry ontologies are simple one-to-one mappings. While OMOP2OBO also contributes one-to-one mappings, it provides more complex one-to-many mappings.

Applications
The OMOP2OBO mappings have been used to characterize differences in definitions of long COVID, 97 generate long COVID phenotypes, 98,99 and improve the categorization and prediction of psychiatric diseases among patients with long COVID. 100 Additionally, our recent work in pediatric rare disease subphenotyping demonstrated that patient representations constructed from the OMOP2OBO mappings produced more clinically meaningful clusters than representations built using OMOP concepts alone. 101 We further demonstrated the value of the mappings by leveraging them to successfully integrate external gene expression data from an independent sample of pediatric patients resulting in more clinically-meaningful and biologically-actionable phenotypes than those generated using only clinical data.
One potential use of OMOP2OBO is to aid in the alignment of patient data to ontologies in the Global Alliance for Genomics and Health's Phenopacket schema, 102 which was designed to support the global exchange of computable patient-level phenotypic information. This work was discussed during the 2021 ELIXIR European BioHackathon. 103

Limitations
OMOP2OBO has not been optimized for performance; all possible ancestors are mapped when unable to generate a mapping at the concept-level. A prioritization strategy would significantly improve performance. OMOP2OBO does not take advantage of all of the knowledge available in the UMLS. Leveraging information in the mapping and hierarchy tables could improve the automatically mapped concepts and would enable use of other UMLS-aligned resources like the SemMedDB. 104 We only evaluated the accuracy of a small subset of the manual mappings. It is important to evaluate the remaining manually derived mappings as well as to provide citations from the resources from which they were derived. The Accuracy evaluation revealed limitations of our expert review procedures; some of the experts experienced challenges when trying to use the OBO ontologies, which may have negatively impacted the results. Providing better training and offering outcomes other than correct/incorrect should be considered. Finally, OMOP standard clinical vocabularies are also dependent upon a large set of CDM-specific mappings and may be subject to similar errors as our mappings.

Future Work
There are two primary challenges that remain given the initial development of the OMOP2OBO algorithm and mapping set. The first challenge is to establish procedures and build infrastructure to enable community sharing, monitoring, and updating of the mappings. While the GitHub repository for the OMOP2OBO currently contains policies for contributing to the mapping algorithm, we have yet to establish an infrastructure or policies for the mappings.
Future opportunities include the adoption of a system like the one utilized by the Bioregistry (https://bioregistry.io/). 105 The Bioregistry provides extensive governance policies and templates, which make it easy to incorporate new and modify existing identifiers. They also developed a robust, semi-automated infrastructure that facilitates review by the maintainers and triggers rebuilds of the registry anytime changes are made. To improve the shareability of the mappings, we'd also like to extend the mapping output formats to include Semantic Web standards like RDF/XML and the Simple Standard for Sharing Ontological Mappings or SSSOM. 106 In addition to creating a system like the Bioregistry, future work may include adoption and adaptation of OBO Foundry protocols for ontology development and maintenance. 60,107 The second challenge is to improve and expand the evaluation of the algorithm and the mapping set. The UMLS, OMOP CDM, and the OBO Foundry ontologies provide mappings between clinical vocabularies and ontologies, which are automatically-or manually-derived (e.g., mappings between source and standard vocabulary concepts, mappings between clinical vocabularies and ontologies, and/or database cross-references mapped to ontology concepts).
While the OMOP2OBO algorithm leverages these mappings (i.e., leveraging source codes mapped to standard concepts), verifying the quality of existing mappings was not within the scope of the current work. Currently, no modules in the OMOP2OBO algorithm verify the quality of existing mappings used by OMOP2OBO or mappings generated by it. This should include resources to validate automatic mappings as their accuracy depends upon the quality of the resources from which they were built, and ontologies are subject to a variety of errors. [108][109][110] To do this, we might leverage pretrained language models and/or develop new machine learning models using trusted resources (e.g., the scientific literature) to verify the database cross-references provided by the OBO Foundry ontologies, UMLS, and OMOP CDM database prior to running OMOP2OBO.

Algorithm Resources
Although it is possible to apply the OMOP2OBO algorithm to any clinical vocabulary, the OMOP CDM was selected because of its rich data representation, standard vocabularies (and  74 . These data are used to annotate each OMOP concept with a UMLS CUI and a Semantic Type. 75 Additionally, the mappings provided by the MRCONSO

Algorithm Overview
The OMOP2OBO algorithm (Figure 1)   semantics when there are multiple ontology concepts (i.e., "and", "or") or to denote negation (i.e., "not"); (III) a mapping category derived from the mapping approach (e.g., automatically determined using an algorithm or manually derived by a human annotator), cardinality (i.e., one-to-one aligning a single OMOP concept to a single OBO Foundry ontology concept or one-to-many aligning a single OMOP concept to one or more OBO Foundry ontology concepts), and level (i.e., mapping to the OMOP concept directly or to one of its ancestors); and (IV) mapping evidence represented as a pipe-delimited string that denotes all resources that support the mapping (i.e., the exact string matches between labels and synonyms, source codes and database cross-reference alignments, and other sources supporting a mapping like scored heuristics and references from manual review). Supplementary

Input Data Used to Create OMOP2OBO Mappings
The OMOP2OBO mappings were constructed from two data sources: (i) the CHCO OMOP Database and (II) OBO Foundry ontologies (both are described in detail below). Figure 2 includes example mappings and illustrates how the OBO Foundry ontologies were used to map OMOP concepts from each domain. Supplementary

OMOP Data
The OMOP2OBO mappings were constructed using data from the CHCO OMOP Database, a de-identified database that contained data from more than 6 million pediatric patients. Measurement data were preprocessed to ensure the mapping process was robust and reproducible.

Conditions
For Concepts used in Practice, UMLS Semantic Types were used to identify all concepts that had a clear pathological or biological origin. All remaining concepts (e.g., accidents, injuries, external complications, and findings without clear interpretations) were marked as unmapped and the reason for exclusion was provided in the evidence field. The Semantic Types were also used to group OMOP concepts such that those typed as "Findings" or "Signs and Symptoms" were treated as phenotypes and only mapped to HPO and concepts typed as "Disease or Syndrome" were only mapped to Mondo. For Concepts Not Used in Clinical Practice, all possible automatic mappings were obtained and concepts which were unable to be mapped automatically were marked as unmapped and "NOT YET MAPPED" was provided as the mapping evidence. This same approach was applied to drug ingredients.

Measurements
For all measurement concepts, a scale and result type were created. The scale (i.e., ordinal, nominal, quantitative, qualitative, narrative, doc, and panel) of each measurement was identified from the OMOP CDM or by parsing the concept synonym field. For all Concepts Used in Practice, reference ranges were used to determine the result type; concepts with numeric reference ranges were typed as "Normal/Low/High'' and concepts with reference ranges that included "positive" or "negative" were typed as "Positive/Negative".

Concepts Not Used in
Practice with an ordinal scale or with synonyms that contained the words "presence" or "screen" were typed as "Positive/Negative". Concepts with a quantitative scale were typed as Also consistent with the procedures adopted by LOINC2HPO, all concepts lacking sufficient detail (i.e., non-specific body substances) were marked as unmapped and "Unspecified Sample" was provided as the mapping evidence.

Mapping Evaluation
The OMOP2OBO mappings were evaluated by assessing their accuracy, generalizability, and clinical utility.

Accuracy
Automatic mappings are created from exact alignments between resources available in the OMOP CDM and the OBO Foundry ontologies and thus are assumed to be accurate and high-confidence mappings. The goal of this evaluation was to evaluate the accuracy of a portion of the manually-derived mappings. For conditions and drug ingredients, of all manual mappings (including one-to-one and one-to-many), 20% were randomly selected for manual review (n=2,000 conditions; n=116 drug ingredients) by a practicing resident physician and clinical pharmacist, respectively.
Measurement mappings are significantly more complex as they require interpreting lab test results and annotating the source of the sample (e.g., bodily fluid, anatomical entity, or cell type), entity being measured (e.g., chemical or cell type), and organism of the measured entity.
While annotating the samples and entities is straightforward, interpreting lab tests results and aligning them to HPO concepts can be challenging. As a result, only the HPO mappings were evaluated by domain experts. These mappings were evaluated in two ways: (i) Survey. A subset of the mappings (n=270) were independently validated by five domain experts including three practicing pediatric clinicians, a PhD-level molecular biologist, and a master's-level epidemiologist using a Qualtrics Survey. 123 Any mapping that did not meet agreement by at least one clinician and both the biologist and the epidemiologist were re-evaluated by the most senior clinician. These mappings were also vetted on the LOINC2HPO GitHub tracker 124

Generalizability
The generalizability of the OMOP2OBO mappings were examined using the OHDSI Concept Prevalence Study data. [76][77][78][79] The Concept Prevalence study provides data on the frequency of OMOP concept usage in clinical practice across several independent sites in the OHDSI network. In addition to the Concept Prevalence Study sites, data were obtained from two independent academic medical centers, bringing the total number of sites to 24

Clinical Utility
The clinical utility of the OMOP2OBO mappings was compared to an existing set of validated manual mappings (ICD-HPO mappings 61 ) when used to identify undiagnosed rare disease patients. For this analysis, AoU Data 89 was selected because it provides access to a large sample of EHR data with genetic testing results. For this evaluation, the version 6 build was used, which contained data from ~630 sites on more than 528,000 patients. 89  demonstrated utility for identifying underdiagnosed rare disease patients using only EHR data. 61,85 For this evaluation, the standardized PheRS was used because it is easier to interpret and reduces noise when it is suspected that a large number of phenotypes will overlap between cases and controls. 85 The OMOP2OBO and ICD-HPO mappings were compared and evaluated on time to complete the query against the AoU Data and differences in the returned patient cohorts. As validation, case-control studies were performed for each of the five diseases using the patients returned from the OMOP2OBO mappings. Cases were defined as patients with at least two occurrences of a relevant diagnosis code and control patients had no instances of these codes. Cases and controls were matched on age, sex, and length of EHR record. For each disease, a one-sided Wilcoxon rank sum test was performed in order to determine if PheRS were significantly higher for cases than controls. Results were verified by a PhD-level Epidemiologist specializing in genetics (CZ).

Statistics and Technical Specifications
OMOP2OBO was developed using Python 3.6.2 on a single machine with 8 cores and 16GB of RAM. All code and project information are publicly available and detailed on GitHub

COMPETING INTERESTS
The authors declare no competing interests.    The three modeled distributions include: concepts only found in the Concept Prevalence Study data (magenta), concepts only found in the OMOP2OBO mapping set (blue), and concepts found in both the Concept Prevalence Study data and the OMOP2OBO mapping set (yellow). concepts recovered in a newer version of the OMOP common data model (CDM; magenta), concepts that were purposefully excluded, not yet mapped, or unable to be mapped by OMOP2OBO (blue), and concepts that were truly missing from the OMOP2OBO mapping set (yellow).  The National Institutes of Health's All of Us Research Program is an initiative tasked with gathering data from at least one million United States citizens with the goal of creating a diverse health resource to support biomedical research and precision medicine. The All of Us Research Hub contains data from over 630 sites on more than 528,000 participants. Data include electronic health records, biological and genetics samples, physical measurements and wearable data, and survey data. The All of Us Research Program would not be possible without the partnership of its participants. The current work utilized data from the version 6 build.
See the All of Us Research Hub for more information: https://www.researchallofus.org

Mapping Validation
Clinical Utility a This is a private repository, please contact the authors for access and to obtain additional information. Acronyms: AoU (AllOfUs); CDM (common data model); CHCO (Children's Hospital Colorado); HIPAA (Health Insurance Portability and Accountability Act); OHDSI (Observational Health Data Sciences and Informatics); OMOP (Observational Medical Outcomes Partnership; PEDSnet (National Pediatric Learning Health System).

Mapping Category Definition
Automatic One-to-One Concept Definition: A one-to-one mapping that is automatically generated at the concept-level through exact string mappings to labels/synonyms or exact mappings between codes. This mapping was created through an exact string mapping on "overjet", which is the HP concept label and an OMOP concept synonym. This mapping is also supported through exact mappings between database cross-references to SNOMED-CT 70305005 and UMLS C0596028.

Automatic
One-to-One Ancestor Definition: A one-to-one mapping that is automatically generated for a concept's ancestor through exact string mappings to labels/synonyms or exact mappings between codes.

Example:
-OMOP:22722 (Accessory salivary gland) -HP:0010286 (abnormal salivary gland morphology) This mapping was created through exact mappings to one of the OMOP concept's ancestors on the database cross-references to SNOMED-CT 10890000 and UMLS C0036093.

Automatic One-to-Many Concept
Definition: A one-to-many mapping that is automatically generated at the concept-level through exact string mappings to labels/synonyms or exact mappings between codes. For release 1.0, one-to-many mappings indicate that one OMOP concept was mapped to one or more OBO Foundry ontology concepts. This mapping was created through 2 exact string mappings on "osteopoikilosis", which is a Mondo concept exact synonym and an OMOP concept label and synonym and "duschke-ollendorff syndrome", which is a Mondo concept exact synonym and label and an OMOP concept synonym. This mapping is also supported through exact mappings between database cross-references to SNOMED-CT 9147009.

Automatic One-to-Many Ancestor
Definition: A one-to-many mapping that is automatically generated for a concept's ancestor through exact string mappings to labels or synonyms or exact mappings between codes. For release 1.0, one-to-many mappings indicate that one OMOP concept was mapped to one or more OBO Foundry ontology concepts. This mapping was created through 3 exact string mappings on "fracture", "fracture of bone", and "disorder of foot", which are all Mondo exact synonyms and labels of the OMOP concept's ancestors. This mapping is also supported by exact mappings to one or more of the OMOP concept's ancestors on the database cross-references to SNOMED-CT 125605004 and 118932009.

Mapping Category Definition
Manual One-to-One Concept Definition: A one-to-one mapping that is manually generated at the concept-level and usually requires the use of external resources. This mapping was manually created through external evidence from a PubMed article, which stated "Mesiodens is a supernumerary tooth present in the midline between the two central incisors" (PMID:21998774).

Manual One-to-Many Concept
Definition: A one-to-many mapping that is manually generated at the concept-level and usually requires the use of external resources. For release 1.0, one-to-many mappings indicate that one OMOP concept was mapped to one or more OBO Foundry ontology concepts. This mapping was created through an exact string mappings on "erythrocytosis", which is a HP concept exact synonym and a OMOP concept ancestor label. This mapping is also supported through exact mappings between database cross-references to SNOMED-CT 127062003 and UMLS C1527405 and C0032461.
Cosine Similarity One-to-One Concept Definition: A one-to-one mapping that is automatically generated at the concept-level using cosine similarity scores. For release 1.0, the cosine similarity scores were applied to concept embeddings learned from a Bag-of-Words model with TF-IDF, which was applied to all available labels and synonyms at the concept-and ancestor-level.

Unmapped
This concept is used when no suitable mapping is possible, for concepts which have not yet been mapped, and for concepts which are purposefully not mapped.  The standardized PheRS is derived by subtracting the normalized raw scores by the mean and dividing by the standard deviation. Acronyms: PheRS (Phenotype Risk Score)