Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases

Phenotypes are the observable characteristics of an organism arising from its response to the environment. Phenotypes associated with engineered and natural genetic variation are widely recorded using phenotype ontologies in model organisms, as are signs and symptoms of human Mendelian diseases in databases such as OMIM and Orphanet. Exploiting these resources, several computational methods have been developed for integration and analysis of phenotype data to identify the genetic etiology of diseases or suggest plausible interventions. A similar resource would be highly useful not only for rare and Mendelian diseases, but also for common, complex and infectious diseases. We apply a semantic text-mining approach to identify the phenotypes (signs and symptoms) associated with over 6,000 diseases. We evaluate our text-mined phenotypes by demonstrating that they can correctly identify known disease-associated genes in mice and humans with high accuracy. Using a phenotypic similarity measure, we generate a human disease network in which diseases that have similar signs and symptoms cluster together, and we use this network to identify closely related diseases based on common etiological, anatomical as well as physiological underpinnings.


Introduction
Over the last decade, the rapid emergence of new technologies has redefined our understanding of the genetic and molecular mechanisms underlying disease.For example, we can now identify genetic predisposition to diseases, and responses to environmental factors, through a rapidly increasing number of genome-wide association studies.These studies utilize genetic variation in human populations to identify sequence variants that predispose some individuals to common or complex diseases.Such studies also reveal a variety of differences between disease manifestations.Application of sequencing technologies to disease studies has been particularly successful for genetically-based diseases.For example, full exome sequencing is an approach that has emerged to identify causative mutations underlying congenital diseases, and is successfully applied widely. 1,2 n contrast to genetically based diseases, the investigation of infectious diseases poses an additional challenge as it requires not only the understanding of the physiology and patho-physiology of a single organism, but the investigation of two or more organisms, their interactions, and the response of one organism to the other.Similarly, investigations of environmentally-based diseases require understanding the response of organisms to environmental influences such as chemicals, radiation or habitat.
For each type of disease (genetically-based, environmental, and infectious), the genetic architecture of an organism plays a vital role in the disease manifestation it exhibits, including severity of symptoms, complications, as well as its response to therapeutic agents.A key to gaining an in-depth understanding of the molecular basis of disease is the understanding of the complex relationship between the genotype of an organism and the phenotypic manifestations it exhibits in response to certain influences (genetic, environmental, or exposure to an infectious agent).To achieve such a goal, it is imperative that there is a consistent and thorough account of the various phenotypes (including signs and symptoms) exhibited by an organism in response to etiological influences.
To utilize phenotype data for disease studies, information about Mendelian diseases has been historically well documented in various formats and, more recently, in electronic resources such as the Online Mendelian Inheritance in Man (OMIM) 3 database and the Orphanet 4 resource.Both OMIM and Orphanet provide a catalog of human genes and genetic disorders, and contain a variety of textual information including patient symptoms and signs.Ontologies (i.e., structured, controlled vocabularies that formally describe the kinds of entities within a domain) such as the Human Phenotype Ontology (HPO) 5 have been created in an attempt to provide a comprehensive controlled vocabulary and knowledge base describing the manifestations of human diseases, and these ontologies have been applied to characterize diseases in the OMIM and Orphanet databases. 6,7 dditionally, ontology-based analysis of phenotype data has also been shown to significantly improve the accuracy of finding disease gene candidates from GWAS data 8 and assignation of phenotypes to genes in Copy Number Variation syndromes. 9e remarkable conservation of phenotypic manifestations across vertebrates implies a high degree of functional conservation of the genes participating in the underlying physiological pathways.Our increasing ability to identify such functions as well as their role in human disease using a variety of organisms and approaches, such as forward and reverse genetics, renders animal models valuable tools for the investigation of gene function and the study of human disease.Phenotype information related to model organisms is also being described using ontologies such as the Mammalian Phenotype Ontology (MP), 10 and data annotated with these ontologies is being systematically collected and organized in model organism databases. 11[18][19] Extension of these strategies and tools for the study of common and infectious diseases has been hampered by the lack of an infrastructure providing phenotypes associated with common and infectious diseases, and integrating this information with the large volumes of experimentally verified and manually curated data available from model organisms.We have now generated a resource of disease-associated phenotypes for over 8,000 Mendelian, rare, common and infectious diseases.The phenotypes and diseases are characterized using ontologies and interoperate with widely used ontologies used for describing human and model organism phenotypes. 20We evaluate our phenotype data against its ability to prioritize genes for human diseases, and demonstrate that our method yields disease phenotypes that are comparable to those available from OMIM when applied to finding candidate genes.Following validation, we demonstrate the utility of our resource by revealing closely related disease modules based on common etiological, anatomical as well as physiological underpinnings.We make our results freely available at http://aber-owl.net/aber-owl/diseasephenotypes/ and provide a visualisation environment for them at http://aber-owl.net/aber-owl/diseasephenotypes/network/.

Results
We have created a resource of disease-associated phenotypes for diseases in the Human Disease Ontology (DO).For this purpose, we have identified co-occurrences between names of diseases (from DO) and names of phenotypes (from HPO and MP) in abstracts and titles of 5 million articles in Medline.
We employ several different scoring functions to rank the co-occurrences based on their significance within our corpus.In particular, we use the normalized pointwise mutual information (NPMI), T-Score, Z-Score and the Lexicographer's mutual information scores 21 to rank the co-occurrences.The phenotypes associated with diseases, scored by our scoring functions, can be viewed and downloaded at http://aber-owl.net/aber-owl/diseasephenotypes.
As our scoring functions associate a value with each identified co-occurrence between a term referring to a disease class and a term referring to a phenotype class, we use known gene-disease associations from the OMIM database to identify a cutoff that maximizes the potential to prioritize candidate genes of disease based on phenotypic similarity.For this purpose, we use the PhenomeNET system 14 to systematically compute the semantic similarity between disease phenotypes and mouse model phenotypes, and we compare the results against known mouse models of disease from the Mouse Genome Informatics (MGI) database, 11 as well as, using human-mouse orthology, to known gene-disease associations in the OMIM database.We quantify the predictive power of the phenotype by computing the area under the ROC curve for predicting gene-disease associations through phenotype similarity.
To standardize the number of phenotypes associated with a disease, we rank all phenotype-disease associations for each disease by their normalized pointwise mutual information score.We then perform our PhenomeNET analysis for an increasing number of phenotypes for each disease.Figure 1 shows the resulting ROCAUC for varying cutoff values.In particular, we find that using the top-ranking 0.4% (NPMI-based) of the disease-phenotype associations maximizes their potential for prioritizing candidate genes using PhenomeNET, and we use this value as the main cutoff in the remaining analysis.Using this cutoff, we have mined phenotypes for 8,672 disease classes in DO, using a total of 12,180 different phenotype classes from the HPO and MP (7,041 from HPO and 5,139 from MP).
Using only heritable diseases from OMIM, we can demonstrate that our text-mined phenotypes come close to the phenotypes associated with OMIM diseases in the HPO database when applied to prioritizing candidate genes of disease.Figures 2, 3 and 4 show the comparison of the performance of our text-mined phenotypes with the original OMIM phenotypes in PhenomeNET.To further test our phenotypes, we have merged the original OMIM phenotypes with our text-mined phenotypes.In each case, we could demonstrate an increase in ROCAUC over both our text-mined phenotypes and the original OMIM phenotypes.In particular, as can be seen in Figures 2, 3 and 4, sensitivity of PhenomeNET gene prioritization increases for the highest ranks when our text mined phenotypes are merged with the original OMIM phenotypes.
We further evaluate the overlap with OMIM disease definitions, as characterized by the HPO database.
We use two measures to quantify the overlap.First, we directly compute the set overlap (Jaccard index) between the HPO phenotypes we have text-mined for each disease and the HPO phenotypes associated with the disease in the HPO database.The average Jaccard index between our disease definitions and the corresponding OMIM diseases is 0.053 (0.309 when considering the phenotypes together with all their superclasses).We also compute the percentage of coverage of the OMIM phenotypes in our disease definitions.Using our text-mining approach, we cover on average 17.6% of the phenotypes in OMIM (46.8% when considering the phenotypes together with all their superclasses).Finally, we compute the semantic similarity between our text-mined disease definitions and the phenotypes associated with the disease in HPO, and use ROC analysis to quantify the performance of directly identifying a matching disease.Figure 5 shows the resulting ROC curve.
Using the phenotypes associated with DO diseases, we compute a pairwise disease-disease similarity based on semantic similarity of their associated phenotypes.From the resulting similarity matrix, we generate a disease-disease network based on phenotypes from the top-ranking 5% of disease-disease similarity values.The generated network is shown in Figure 6.For each disease, we also identify toplevel DO categories, and assign node colors in the network based on the DO categories in which a disease falls.The disease-disease similarity network can be accessed online at http://aber-owl.net/aber-owl/diseasephenotypes/network.
We also use the disease-disease network to compute phenotypic homogeneity of diseases within their respective disease category.For this purpose, for each disease, we sort all other diseases based on their phenotypic similarity, and identify the ranks at which other diseases in the same category appear.The results (summarized in Table 1) are ROCAUC values for each of DO's top-level categories that quantify how phenotypically similar are the diseases within the same category.

Related work
Associations between phenotypes, signs and symptoms on one side and diseases on the other have been used to gain insights into the modular nature and network structure of human diseases and drug indications. 12,22,23 I prior work, text-mining has been used based on labels of diseases and labels of phenotypes (signs and symptoms), 23 or the identifiers of the Medical Subject Headings (MeSH) Thesaurus 24 that are associated with article citations in Pubmed, have been used to identify associations between disease and phenotype.In general, the resulting disease-phenotype associations have been evaluated based on their ability to reveal or explain perceived clusters of diseases, 12 group diseases with known common etiology together, 22,23 based on gold standard comparison and clustering for common drug targets. 23e fundamental question that has not been answered by any of the prior approaches has been what kind of evidence or support would be required to consider a disease-phenotype association as "correct".This is a fundamental challenge in any kind of phenotype-or symptom-based characterization of disease.
Most diseases have cardinal signs and symptoms which will always be associated with a disease.However, a large number of signs and symptoms for a disease are not always present but rather occur with varying frequency, and even very rare manifestations may prove to be highly useful in the context of differential diagnosis.In our evaluation, we provide a quantifiable measure through comparison against experimental data which can be used to determine -and maximize -the utility of our text-mined disease-phenotype associations.We therefore provide an objective measure that can be used to determine how applicable a set of disease-phenotype associations are to a particular scientific question -in our case, identifying candidate genes for diseases of genetic origin.
One main limitation of our evaluation is that it is limited to genetically-based diseases, while the majority of diseases in the DO is not genetically-based.Other approaches, such as clustering diseases based on similarity and identifying meaningful, well-known clusters, 12,22 or comparison with known drug indications, 23 can evaluate the biological validity of generated associations, but often cannot quantify the results.

Novel candidate genes based on text-mined phenotypes
Through our approach, we do not only obtain phenotypic characterization of common and infectious diseases, but we have also obtained novel phenotype associations for genetically based diseases in OMIM for which currently no phenotypic characterization exists either in the HPO annotations or as a clinical synopsis in OMIM.
The HPO database contains phenotype annotations for 9,286 OMIM entries (genes and diseases).
Through the DO-OMIM mappings and our method, we obtain phenotypes for 1,683 OMIM entries, 115 of which have no phenotype annotations in HPO or an associated clinical synopsis in OMIM.For example, Halo Nevi (Leukoderma acquisitum centrifugum of sutton, OMIM:234300), a dermatological condition in which melanocytes are destroyed by CD8+ cytotoxic T lymphocytes, 25 has currently no clinical synposis in OMIM and consequently no associated phenotypes in the HPO database, while we identify several phenotypes, including Irregular hyperpigmentation (HP:0007400), abnormal dermal melanocyte morphology (MP:0009386) and Progressive vitiligo (HP:0005602) as phenotypes, all of which are known to be associated with halo nevi. 26r these 115 diseases, 167 disease models are known in the mouse.We can prioritize the correct model with ROCAUC of 0.940 ± 0.018 for this set of 115 diseases (Figure 7).

Exploring disease-disease similarities: revealing the modular nature of disease
In Figure 6, we show the relationships between common, genetic, infectious and environmental diseases.
Each node in the network represents a disease and is coloured according to its corresponding top-level disease class in DO.Using this similarity network, it is clear that diseases of different systems and pathological processes can be separated on the basis of phenotypic relatedness.DO classifies both by anatomical site or system, and by general pathology, and for each of the classifications, despite these different criteria, we find that diseases within one category cluster together on the basis of phenotypic relatedness alone.In Figure 6, we highlight different upper-level disease categories from DO, including neoplasias, immune diseases, respiratory diseases, mental health diseases, endocrine, and nervous system disease.While many diseases cluster tightly within their group, as expected, several diseases show significant phenotypic relations to different areas or systems, and we see many examples in which a disease clusters predominantly within one part of the DO-defined area but has more distant relationships with others.For example, phaeochromocytoma is associated with other adrenal tumors in the class of "neoplasia" but also with adrenal gland hyperfunction, adrenal cortex disease from the category of endocrine diseases, and hypertension in the cardiovascular disease category.Hyperprolactinaemia has relations to a cluster of pituitary tumours and, as expected, to prolactinomas (neoplasms), acromegaly (physical disorders) and hypogonadotropism (reproductive system), and, more distantly, psychological dyspareunia (mental).
We can also identify phenotypically-defined "footprints" for disease groups which show overlapping phenotypic similarity.For example comparing the disease networks centered on rheumatoid arthritis (RA) and ankylosing spondylitis (AS), it is clear that the two are quite closely related to the same group of inflammatory diseases (Figure 8).However, a close phenotypic relationship to rheumatic fever and rheumatoid lung disease is missing from the ankylosing spondylitis-centered network, and uveitis is missing from that of rheumatoid arthritis.Uveitis forms one of the diagnostic features of ankylosing spondylitis, and some of the most common diseases that result in uveitis are ankylosing spondylitis and juvenile rheumatoid arthritis. 27Acute anterior uveitis is the most common extra-articular feature of AS, occurring in 25%-40% of patients at some time in the course of their disease.AS and uveitis share an association with HLA-B haplotypes, indicating the possible existence of a modular phenotype linking these inflammatory diseases with a common genetic etiology. 28Interestingly Felty's syndrome with thromobocytopenia and vasculitis, present in the network, is also associated both with the spondylopathies and the HLA-B haplotypes, and there are suggestions of a relationship between ankylosing spondylitis and vascular inflammatory disease in addition to the common cardiac effects (Figure 8). 29other example of phenotype-defined disease groups are the lysosomal storage diseases.All cells contain lysosomes which contain soluble acid hydrolases whose role is to process a wide range of substrates.
Failure to perform this function results in accumulation of lysosomal accumulation of unmetabolized proteins lipids and carbohydrates, which are the primary cause of disease through their effects on cellular metabolism.The pathways by which these accumulations exert their pathological effects are only just becoming understood, but they display an extensive range of disease symptoms with central neurological involvement and a wide range of peripheral phenotypes with very variable individual manifestation. 30gure 9 shows the relationships between sphingolipidoses, mucopolysaccharidoses, and oligosaccharidoses, demonstrating a coherent phenotypic disease footprint for this wide range of lysosomal storage disorders.This striking clustering is similar in type to that seen in the ciliopathies 31 where a range of related phenotypes reflect lesions in a collection of molecules involved in different aspects of cilium assembly or function, which, along with other examples, lead Oti and Brunner 32 to postulate the existance of common functional modules underlying the phenotypic profiles of diseases.Phenotypic annotation for these diseases benefits not only from organismal level description but in many cases molecular annotations such as Abnormality of proteoglycan metabolism (HP:0004355).While this alone does not account for the clustering, the increased depth available for these diseases greatly improves the quality of the network associations.
Within disease modules, we find the separation of diseases by both anatomical site and pathology.
For example, in Figure 10, the integumentary diseases form a distinct group and show clear clustering of inflammatory skin diseases, such as seborrheic dermatitis and granulomatous dermatitis along with neurotic excoriation which itself often involves inflammation as a consequence of compulsive "skin picking".An additional cluster is evident which includes benign proliferative disorders and those of keratinisation, together with bullous diseases, such as epidermolysis bullosa, themselves involving acantholysis.Finally, there is a group of diseases of the eyelid, ranging from mechanical lesions to parasitic disease.Alopecia telogen effluvium, alopecia areata, alopecia universalis and follicular mucinosis similarly cluster together, all diseases involving hair follicles and causing hair loss. 33surprisingly, we also find (see Table 1) that diseases classified by anatomical site or system (e.g., thoracic diseases, respiratory diseases) exhibit higher phenotypic homogeneity than diseases classified by their pathological mechanism (e.g., infectious diseases, genetic diseases).In particular, we observe that narrowly defined disease categories such as thoracic disease or respiratory disease exhibit a high phenotypic homogeneity; broad categories such as all the infectious diseases, on the other hand, are relatively heterogeneous.However, all of DO's top-level categories cluster significantly based on their phenotypic similarity, and diseases falling into more specific DO categories (such as lysosomal storage diseases) cluster closely as well, demonstrating that not only Mendelian diseases form disease modules, 12,32 but also common diseases.

Conclusions
Exploring diseases through their associated phenotypes associated has major applications for biomedical research, and several studies have primarily relied on disease phenotypes to reveal functional disease modules, 12,22,32 candidate genes of disease, 14,34 prioritize genes in GWAS studies, 35 and investigate drug targets and indications. 17,18,36,37 Whle the majority of these investigations have been focused on genetic diseases, application of similar methods may lead to novel insights into the patho-biology of common and infectious diseases as well.

Ontologies and vocabularies
We use the Human Phenotype Ontology (HPO) 5 and the Mammalian Phenotype Ontology (MP) 10 as vocabularies that provide terms referring to phenotypes, signs and symptoms associated with diseases.
Additionally, the MP is used to describe mouse model phenotypes, 38 and we rely on comparison to mouse model phenotypes for the evaluation of our approach.
We use the Human Disease Ontology (DO) 39 as an ontology of diseases.The DO contains a rich classification of rare and common diseases, and spans heritable, developmental, infectious and environmental diseases.All ontologies were downloaded from the OBO Foundry website 40 on 2 July 2013.

Semantic mining with Aber-OWL: Pubmed
We make use of the Aber-OWL: Pubmed infrastructure to semantically mine Medline abstracts.Aber-OWL: Pubmed (http://aber-owl.net/aber-owl/pubmed/)consists of an ontology repository, a reasoning infrastructure capable of performing OWL-EL reasoning over the ontologies in the repository, a fulltext index of all Medline 2014 titles and abstracts as well as all Pubmed Central articles, and a search interface.Aber-OWL: Pubmed uses an Apache Lucene (http://lucene.apache.org)index to store the articles.Before indexing, every text is processed using Apache Lucene's English language standard analyzer which tokenizes the text, normalizes text to lower case, and applies a list of stop words.
To identify documents which contain references to a disease or phenotype term, we first limit our search to Medline abstracts and treat documents as consisting of a title and the abstract.We then limit our corpus to documents in which at least one term from a phenotype ontology (HPO or MP) or the DO occurs.As a result of this filtering step, we use a corpus consisting of 5,164,316 documents.
We use the information in ontologies together with the Aber-OWL reasoning infrastructure to identify the set of terms referring to a disease or phenotype.For this purpose, we first identify all labels and synonyms Lab(C) associated with a class C in an ontology.We then define the set of terms T erms(C) referring to a class C as: According to this definitions, the set T erms(C) refers to the set of labels and synonyms of C or any subclass of C, as inferred using the automated reasoner employed by the Aber-OWL infrastructure.
To identify the number of documents in which a disease or phenotype term occurs, we construct a Lucene query based on T erms(D) and T erms(P ) in which we concatenate each member of T erms(D) or T erms(P ) using the OR operator: x∈T erms(D) x and x∈T erms(P ) x.As a result, the Lucene query will match any document (title or abstract) that contains a label or synonym of a class D or P .To identify the number of documents in which D and P occur together, we concatenate both queries using the AND operator: x∈T erms(D) x ∧ x∈T erms(P ) x.
We use Docs(q) to refer to the set of documents satisfying the query q, n D to refer to the number of documents in which a term referring to disease D occurs, n P to refer to the number of documents in which a term referring to a phenotype P occurs, and n DP to refer to the number of documents in which both a term referring to D and a term referring to P occurs: n tot is the total number of documents in our corpus (5,164,316).
We compute several co-occurrence measures 21 to determine whether a co-occurrence between a term referring to a phenotype and a term referring to a disease is significant.In particular, we compute the Normalized Pointwise Mutual Information (NPMI), T-Score, Z-Score, and Lexicographer's Mutual Information (LMI) measures: T Score(D, P ) = ZScore(D, P ) = We use NPMI as our primary scoring function for phenotype-disease associations; the other scoring functions are pre-computed and made available for further analysis on our website.Based on a score for a co-occurrence, we can sort phenotype associations for each disease based on decreasing score values.
Using this sorted list, we then compute a rank for an association such that the highest-scoring association for a disease is on rank 0. We use this ranking based on the NPMI score to determine a rank-based cut-off; in particular, we set a cut-off based on highest-scoring p percent of the associations.Using the rank as cut-off instead of raw score value allows comparison across multiple diseases, as each disease will have a the same number of phenotypes associated, independent of the actual value of the score.

Semantic similarity
We use the PhenomeNET system to compute the semantic similarity between disease phenotypes and mouse model phenotypes.PhenomeNET 14 integrates multiple species-specific phenotype ontologies into a single structure in which classes are related based on their formal definitions. 20For example, the HPO class Tetralogy of Fallot (HP:0001636) will become a subclass of the MP classes ventricular septal defect (MP:0010402), overriding aortic valve (MP:0000273) and abnormal blood vessel morphology (MP:0000252), among others, based on the definitions that were developed for the classes in both ontologies.As a consequence of this cross-species integration, it becomes possible to directly compare phenotypes and sets of phenotypes across species using approaches based on semantic similarity. 41 compare sets of phenotypes (either associated with a disease, or observed in a mouse model), we use the set-based simGIC measure. 42The simGIC measure is based on the Jaccard index weighted by information content of a class within the corpus consisting of mouse models and diseases: where Cl(X) is the smallest set containing X and which is closed against the superclasses relation (i.e., Cl(X) = {a|a ∈ X ∨ ∃y(y ∈ X ∧ a y)}).IC(x) is the information content of a class x within the corpus of mouse models and diseases (i.e., IC(x) = −log(p(x))).

Evaluation
We evaluate our text-mined disease phenotypes by comparing the semantic similarity between the disease phenotypes and mouse model phenotypes.We assume that semantic similarity over phenotype ontologies ("phenotypic similarity") is indicative of a causal relation between the mutation underlying the mouse phenotypes and the disease.For this purpose, we compare the results against three curated datasets of known gene-disease associations for heritable diseases.All other associations are treated as negative instances for the purpose of the evaluation.As evaluation datasets, we use two sets of gene-disease associations: the gene-disease associations from OMIM's MorbidMap 3 and the genotype-disease associations from the MGI database. 38We generate the third evaluation dataset by taking the genotype-disease associations from the MGI database, filtering by single gene knockouts, and merging all phenotypes associated with one gene.As a result, our third dataset consists of gene-disease associations.We refer to the three evaluation datasets as "OMIM", "MGI" and "MGI (genes)", respectively.
We use receiver operating curve (ROC) analysis to evaluate and quantify the predictive power of the text-mined disease phenotypes.A ROC curve is a plot of the true positive rate of a classifier as a function of the false positive rate, and the area under the ROC curve (ROCAUC) is a quantitative measure of a classifier's quality. 43We report the ROC AUC values together with an estimate of the 95% confidence interval: 44 we use σ 2 max = AU C(1−AU C) min{m,n} , with m and n being the number of positive and negative instances in the evaluation dataset, and then use AU C ±2σ as an estimate of the 95% confidence interval.

Interface and visualization
The web interface was written in Groovy (backend) and Javascript (frontend).The disease network was visualized using the Gephi graph visualization tool, 45 and the disease network browser was generated with Gephi's Sigma ˙js export plugin (http://blogs.oii.ox.ac.uk/vis/).All graphs are visualized using a force-directed layout.

Figures
Figures 0

Figure 6 .
Figure 6.An overview over the disease-disease similarity network generated by our approach as well as six disease modules obtained by filtering for disease categories in DO.The graph is based on a force-directed layout using the similarity between diseases as attraction force.

Figure 7 .Figure 8 .
Figure 7.The figure shows the ROC curve for ranked retrieval of MGI disease models by semantic similarity to text-mined phenotypes of diseases without clinical synopsis in OMIM (ROC AUC: 0.940 ± 0.018).

Figure 9 .
Figure 9.The sub-network around lysosomal storage diseases.
The figure shows the ROCAUC values obtained when using different cutoffs for the rank of the pointwise mututal information co-occurrence measure.Based on three different evaluation datasets, we find that the top 0.4% ranking co-occurrences (NPMI-based) maximize the ROCAUC across our datasets.Figure 4.The figure shows the ROC curve for cross-species prioritization of disease models using Figure 5.The figure shows the ROC curve for ranked retrieval of OMIM diseases by semantic similarity to our text-mined disease phenotypes (ROC AUC: 0.939 ± 0.011).

Table 1 .
Phenotypic homogeneity of disease categories.We compute ROCAUC values for top-level categories in DO.Diseases are ranked based on phenotypic similarity, true positive matches are diseases in the same top-level DO category, and negative matches are diseases in different DO categories.