A network medicine approach to quantify distance between hereditary disease modules on the interactome

We introduce a MeSH-based method that accurately quantifies similarity between heritable diseases at molecular level. This method effectively brings together the existing information about diseases that is scattered across the vast corpus of biomedical literature. We prove that sets of MeSH terms provide a highly descriptive representation of heritable disease and that the structure of MeSH provides a natural way of combining individual MeSH vocabularies. We show that our measure can be used effectively in the prediction of candidate disease genes. We developed a web application to query more than 28.5 million relationships between 7,574 hereditary diseases (96% of OMIM) based on our similarity measure.


A brief introduction to The Online Mendelian Inheritance in Man
OMIM records are prefixed with a character denoting its type (i.e. whether the entry describes a phenotype or a gene) and diseases are represented by four prefixes: "+", "#", "%" and "null", where "null" represents the lack of prefix. A total of 23,611 records comprise the entirety of OMIM, of which 7,812 correspond exclusively to diseases.

OMIM and the links to MEDLINE
Each OMIM record contains the hand-curated key references that describe the disease.
OMIM entries do not provide "new" information, in fact, they are compendiums of the available knowledge in the literature and as such, the records are continually refined to reflect the latest knowledge available for a particular disease. A vast majority of references are specified in the form of PubMed identifiers.
We retrieve the PubMed identifiers for the OMIM diseases by querying the API, which results in 7,609 records mapped to 71,083 references, of which 62,829 are unique references. The 203 missing OMIM diseases correspond to entries for which no publication could be obtained through API queries to OMIM.
OMIM shows an imbalance in the study of inherited diseases Figure 1 shows the number of publications the OMIM entries reference, reflecting the fact that highly-prevalent and easily diagnosed Mendelian disorders were elucidated first, leaving a large number of rare diseases understudied [2].    In the first case, there are two possibilities. The first possibility is that the OMIM disease does not reference any publication, such as the case of Fragile Site 20p11 (MIM: 136590).

A brief introduction to The Medical Subject Headings thesauri
The second possibility (this is the most common one), is that the publications that are referenced, are not indexed in PubMed. A total of 203 OMIM diseases fall in these two categories.
The second case relates to the lack of MeSH terms associated to few publications in In some exceptional cases, the PubMed identifier referenced in OMIM entry referenced a non-existent PubMed identifier, such as in the case of MIM: 601419 referencing PubMed identifier 10553984. These cases were reported to the staff at OMIM.

A brief description of the semantic similarity measures analysed
Semantic similarity measures can be classified into term-based and graph-based [3]. Termbased measures determine similarities between pairs of terms in an ontology, while graphbased measures determine similarity between the annotated objects. We chose a representative set of each type of measure composed of some of the best known semantic similarity measures. For the term-based measures we chose Resnik [4], Lin [5], Jiang and Conrath [6] and for the graph-based measures simUI and simGIC [7].
Except for simUI, the aforementioned measures rely on the concept of information content of the ontology terms. The information content of the term is defined as ( ) = −log ( ( )) where ( ) is the probability of the term defined as the quotient between the number of objects annotated with term and the total number of objects annotated by the all other ontology terms.
The true-path rule states that an object annotated with a term is also annotated by all ancestors of . This results in ( ) always being smaller or equal than the probability of any of its ancestors. Therefore, the information content decreases as one moves up the ontology, with the root having no information at all. That is, ( ) = 0 implying that the probability of a random object being annotated to the root is 1, in accordance to the truepath rule.
For term based measures, the similarity of each pair of objects is determined by a matrix composed of the pairwise similarities between the terms annotating each object. If we consider a simple example with two objects annotated as follows = { t1, t2} and = { 1}, the similarities between {t1, t2} and {t1, t1} should fully determine the similarity of and . To obtain a single similarity score for and a choice among the similarity of the individual pairs has to be made. A few options are available [3], such as choosing the similarity of the most similar pair, the similarity closest to the arithmetic mean or the median similarity. For this work we chose the similarity of the most similar pair.

Resnik [4]
Resnik defines semantic similarity between two terms a and b in an ontology as the information content of the most informative ancestor, i.e. the lowest common ancestor LCA (a, b): ( , ) = − log( ( ( , )))

Lin [5]
Lin's similarity measure normalises Resnik's proposal to account for the divergence between the terms: log( ( )) + log ( ( )) Jiang and Conrath propose a distance measure: Converting this distance into a similarity yields = 1 − where M is the maximum possible value of .

Graph based measures
simUI is based on the overlap of terms annotating two objects ( , ) simGIC is an improvement on simUI and it is based on a weighted Jaccard index, where the weight of each element is its information content [3]. Similarity between any two objects is defined as The disease similarity methods discussed in the paper

Zhou.
Zhou et al. mine PubMed, extracting the MeSH terms associated to each publication and analyse the co-occurrence of a symptom term (terms in the C23 ontology in MeSH) and a disease term (terms in the C01-C26, except for the C22 and C23 ontologies in MeSH). This co-occurrence is compiled into a feature vector that characterises each disease based the frequency of its symptoms across PubMed. Similarity between every disease is obtained by computing the cosine of the angle between the feature vectors and then conserving only those statistically significant scores.
Park et al.
The main premise of Park's et al. work is that diseases whose proteins share a common subcellular localisation are phenotypically related. Similarity between two diseases is determined by an association score between diseases based on the cellular co-localisation of their disease proteins.

Robinson et al.
Robinson et al. propose the Human Phenotype Ontology (HPO), purpose-specific ontology.
This manually curated ontology contains terms for the description of all phenotypic abnormalities for the diseases in OMIM. A text-mining analysis of OMIM, followed by careful manual curation, produces over 10,000 terms in the HPO. The ontology is then used to compute the semantic similarity between the diseases with an information-content based similarity measure.

Simple similarity measures
To keep this section self-contained we reproduce the definitions of the simple similarity measures presented in the Methods section of the main paper.
We constructed several simple disease similarity measures to explore the impact of the MeSH ontology structure on the accuracy of disease similarity measures.  8. Details on the construction of the evaluation datasets.
To ensure this section is self-contained, the following paragraph presents a brief synopsis of the evaluation criteria presented in the Online Methods of the paper.
For the evaluation of our diseases similarity measure we follow the approach presented by van Driel et al. [8], and assess the accuracy of our scores with respect to three binary relationships defining molecular relatedness between pairs of diseases.

Pfam dataset
The first relationship proposed by van Driel et al. is based on the co-occurrence of Pfam-A signatures (i.e. families, domains, motifs or repeats), and it relates two diseases if any of their disease-proteins share at least one of these signatures.
We noticed that certain MeSH terms correspond to Pfam signatures, and this fact could introduce a bias in the evaluation. With an automated analysis of MeSH terms followed by manual curation, we found 113 descriptors that correspond exactly to a Pfam signature (shown in Table 2). We then excluded from our evaluation any disease pair in which a protein's Pfam signature matched any of the ones found in MeSH.
This results in 33,660 pairs relating 2,647 OMIM diseases. First, the scores produced by the simple similarity measures do not depend on the specificity of the annotations. That is, the quality of the annotations of the individual diseases is not considered and this results in a measure with a reduced capability to discriminate between well-annotated diseases and those annotated with very general but overlapping terms. As an example, 5 overlapping but very general terms are as good as 5 very specific overlapping terms. These measures are "coarse", in the sense that they are unable to determine similarity between any two pairs of slightly similar diseases. When the evidence is abundant these measures produce accurate similarity scores. The ontological similarities measures are able to distinguish more nuanced similarity values, particularly   10. Correct use of the MeSH ontological structure is essential for accurate disease similarity calculations While we show that the use of the ontological structure improves accuracy significantly, it is important to note that to fully take advantage of the quality of the annotations, the ontology must be used appropriately.
Several similarity measures are available (Section 4 of this Supplementary Material), however, not all perform equally well. Figure 4 shows a comparison of the semantic similarity measures evaluated, were we readily verify that Resnik's similarity measure outperforms all others.

Combining the MeSH ontologies. Performance plots.
It is important to note that there are 16 Ontologies in MeSH, thus the similarity method described so far will result in up to 16 similarity scores for each pair of diseases. Since our aim is to produce a single similarity score for each pair of diseases, we need to combine the ontologies.
Our analysis of MeSH revealed a large overlap between the various ontologies.
Nevertheless, note that the most relevant factor in the combination is the existence of paths between the different ontologies, not only on the overlap. As shown in Figure 23 (reproduced from the main paper), the shared terms connect most ontologies.

Technology, Industry, Agriculture, [K] Humanities, [L], Information Science, [M] Named Groups, [N] Health Care, [V] Publication Characteristics, [Z] Geographical.
This interconnectedness allows us to combine the ontologies in a simple way. By adding a fictitious node at the top level connecting to all first level nodes from of the MeSH ontologies to be combined, we are able to maintain the ontological structure and obtain a single comprehensive ontology. When used, this combined ontology results in a single score for each pair of disease. Figure 24 shows the paths and the fictitious root node in a toy example. Terms labelled t1 and t2 illustrate two terms used to annotate diseases. At this point, having identified the need and procedure to combine the ontologies, we chose to combine the A, C, D, E and G ontologies. Table 3 shows the number of times terms in each ontology that are used to annotate an OMIM disease. Based on this table, we can safely discard the V (Publication characteristics) ontology, which is not used to annotate any diseases. Analysing the usage and performance of each ontology, we combine the 5 ontologies whose AUC in PPI is above 60% while maintaining a good coverage. We chose the PPI AUC considering that this dataset is the most stringent one. It positively relates far fewer disease than the Pfam (46% smaller) and the Sequence similarity (41% smaller) datasets. We also tried other combinations and we found results to be equivalent as long as we included ontologies with high coverage.
14. Performance comparison of existing methods of disease similarity. ROC plots and Bar charts.
In this section we present the performance for the Pfam, PPI and Sequence similarity datasets for the disease similarity methods presented in the paper. Namely, Our method,  Note that the ROC curves do not consider coverage, however, to evaluate the practical importance of any measure, coverage has to be considered. Figure 28 replicates the figure   1a of the main paper and shows the composite score combining coverage with AUC for each method.

Small variability in scores for highly similar diseases
When analysing highly similar diseases we noted that, in some cases, there is very little variability in the similarity scores. That is, in a relatively large set of disease pairs, very few different scores are present. As an example, the similarity between Breast Cancer (MIM: 114480) and the 10 diseases most similar to it is shown in Table 4. This is due to the fact that the score depends on number of diseases annotating the lowest common ancestor, and it can happen that this number is the same for different pairs of diseases, even if the common ancestor is different. Looking at lowest common for a few examples, we notice that, while the variability of scores is low, the lowest common ancestors (LCA) are different.  just recently associated to a gene (after July 21, 2014). We then computed the similarity scores using the older version of OMIM and extracted high-similarity pairs to verify the capability of our method at predicting molecular relatedness. Unfortunately, this process had similar issues as our previous method, as it returned only 63 OMIM diseases being classified as Complex which is less than 1% of the total, as well as many diseases (261) classified as Mendelian even if many disease genes have already been associated with them.

Extracting the multigenic disorders from OMIM
We classified all multigenic diseases (with more than one gene) in OMIM as Complex and all monogenic diseases as Mendelian. Here we assume, that the multiple disease genes complicate the elucidation of the gene-disease relationship and therefore, multigenic diseases correspond to the set of inherently Complex diseases.
In this way we obtained a set of 287 Complex diseases and a set of 3,743 Mendelian diseases. There is a statistically significant difference (t-test p-value is 10^-350) between the mean number of disease genes associated to the Complex diseases (3.61) and to the Mendelian diseases (1).
The results of the evaluation on the three datasets, Pfam, PPI and Sequence Similarity, are shown in Figure 32. The composite performance of our method is slightly inferior for the set of Complex diseases with respect to the Mendelian diseases -the overall composite score is 3.08 for Complex and 3.13 for Mendelian. Interestingly, the method by Park, which uses molecular level information, is the only method that shows the same behaviour; the methods of Robinson and van Driel obtain a better performance on each of the 3 datasets for Complex rather than Mendelian diseases. Overall, our method is the most stable as it varies the least in performance between the 2 sets of diseases.
Finally, note that in Figure 32    21. The impact on the number of genes in a disease pair on its disease similarity score Diseases with many genes have, on average, higher similarity scores. The higher similarity scores are expected, as diseases with many genes will be more likely to be close to the other diseases in the interactome -informally, we can think of their disease modules as being "larger", and therefore closer.
To show this, we compared the mean similarity of two sets of diseases and all other diseases in OMIM. The first set consists of multigenic (strictly more than one gene) and the second set exclusively of monogenic (exactly one gene) diseases. The monogenic set consists of 3,743 diseases and the multigenic set of 287 diseases. The mean similarity between all diseases in the interactome and the diseases in the monogenic set is 1.19, compared to the 1.27 between all diseases and the multigenic diseases (p-value: 1e-350). We also verify that the multigenic diseases are closer to all other diseases in the interactome, compared to the monogenic diseases (mean shortest path multigenic 4.08, monogenic 4.12 p-value: 1e-350).
The number of shared genes between diseases is also reflected in the similarity scores. In  As we can see, the similarity scores grow the more genes a pair shares. To verify the significance of the difference in the similarity distributions represented in Figure 34 we performed a pair-wise t-test between the diseases sharing no genes (labelled 0 in the X-axis) an all other diseases, and between the diseases sharing 1 gene and all other pairs. In the

The disease similarity resource.
We have produced the Disease Similarity Resource (DSR), a database that provides a starting point for transferring knowledge between diseases and possibly the discovery of new disease genes using our measure. The DSR consists of the disease pairs whose similarity scores was in the top 5% (1,552,356 pairs) and their associated disease genes. Each pair constitutes an entry, and contains 5 columns: As we have shown in the paper, the high similarity scores between disease pairs indicate that they are likely to be close on the interactome. These highly similar disease pairs are, therefore, suitable candidates for transferring knowledge between them. Thus the DSR provides a starting point for an in-depth analysis into the relationships and aetiology of the diseases, providing the basis for a statistical gene-discovery process. The DSR is freely available in the Downloads section at http://www.paccanarolab.org/disimweb.

23.
Evaluating performance of our method on multigenic diseases using ROC curves.
It is important to note that for multigenic diseases the prior probability of having an interacting disease pair (positive) according to one of the relationships (Pfam, PPI and Sequence Similarity) is higher. In fact, when comparing the number of positives in multigenic and monogenic diseases we have: for Pfam 5% vs 1%; for PPI 3% vs 1%; for Sequence Similarity 5% vs 1%. However, the performance of our method does not improve for complex diseases, even if the numbers of positives in the test set is higher; in fact, results are slightly worse (see section 19 of this supplementary material). This is because the area under the ROC curve does not count only the accuracy at predicting positives. By framing the evaluation of our method as a binary classification problem we evaluate, through ROC curves, the capability of the method to predict positives as well as the capability to predict negatives in the 3 datasets. Thus, when the number of positives in the dataset increases, a method would improve its performance if and only if it were able to produce higher scores only in correspondence to those positives only -an overall mean increase in the scores would not necessarily improve the AUC.
24. Appendix: how to run our disease similarity pipeline All data is available from www.paccanarolab.org/disease_similarity. The code is released under GPLv3 and is available from https://github.com/pwac092/disim_calculator Data:  OMIM data was downloaded on July 21 2014.
 MeSH data corresponds to the 2014 release.
A full browser is available at www.paccanarolab.org/disimweb Extracting the OMIM data: We have to manually download the data from omim.org (registration is required) and extract the desired MIM numbers from the catalogue.