Genetic and functional characterization of disease associations explains comorbidity

Understanding relationships between diseases, such as comorbidities, has important socio-economic implications, ranging from clinical study design to health care planning. Most studies characterize disease comorbidity using shared genetic origins, ignoring pathway-based commonalities between diseases. In this study, we define the disease pathways using an interactome-based extension of known disease-genes and introduce several measures of functional overlap. The analysis reveals 206 significant links among 94 diseases, giving rise to a highly clustered disease association network. We observe that around 95% of the links in the disease network, though not identified by genetic overlap, are discovered by functional overlap. This disease network portraits rheumatoid arthritis, asthma, atherosclerosis, pulmonary diseases and Crohn’s disease as hubs and thus pointing to common inflammatory processes underlying disease pathophysiology. We identify several described associations such as the inverse comorbidity relationship between Alzheimer’s disease and neoplasms. Furthermore, we investigate the disruptions in protein interactions by mapping mutations onto the domains involved in the interaction, suggesting hypotheses on the causal link between diseases. Finally, we provide several proof-of-principle examples in which we model the effect of the mutation and the change of the association strength, which could explain the observed comorbidity between diseases caused by the same genetic alterations.


Protein-Protein interaction network
The human protein interaction network used in this study, hereinafter referred to as PIN, was derived using BIANA 1 , by integration of interactomic data from: HPRD 2 , DIP 3 , MIPS 4 , BioGRID 5 , BIND 6 , IntAct 7 and MINT 8 databases using the protein sequence, UniProt Accession number 9 and NCBI Entrez gene identifiers 10 as unifying criteria. The resulting PIN was composed of 11,123 nodes (i.e. proteins) and 149,931 edges (i.e. interactions.) The mapping of proteins in the PIN and its corresponding disease-associated genes was done using using the NCBI Entrez gene identifiers (see next section).

Genetic disease data
The collection of genes associated with diseases was obtained from DisGeNET database 11 , Two different snapshot versions of DisGeNET were used in this study: DisGeNET v.1.0 12 , referred as DGN1 and DisGeNET v2.0 11 , referred as DGN2 throughout the text. The mapping between genes in DGN1 and DGN2 was done using the Unified Medical Language System (UMLS) 13 disease identifiers in the Medical Subject Headings (MeSH) 14 . For this mapping, whenever possible, the Online Mendelian Inheritance in Man (OMIM) identifiers 15 were matched to corresponding MeSH terms using UMLS.

BIANA approach to integrate several resources of protein-protein interaction data
BIANA (Biologic Interactions and Network Analysis) is a platform designed to compile and integrate interactomic data from multiple sources in a comprehensive and traceable manner 1 . BIANA uses a high-level abstraction schema to integrate and define external repositories or databases compiling interactomics and protein-ligand (small chemical) information. The two unique features of BIANA are: (i) its unification protocol (i.e. a set of rules defined by the users that determine how data from various sources is combined) offering also the possibility of cross-checking data across different databases; and (ii) its traceability: merged entities can always be traced back to its original source.

GUILD: Network-based disease gene-prioritization predictions
GUILD is a network-based tool to predict and rank genes linked to biological processes and disease phenotypes by combining experimental data and graph-based guilt-by-association algorithms 16 . The hypothesis of these methods is that, for a given phenotype and a set of core genes (e.g. known disease-genes or user-defined genes), the interactors of these genes can also be associated with the same phenotype. Based on this principle, GUILD predicts and ranks potential novel genes associated with the given phenotype based on the connectedness to a core set of genes through the underlying PPI network. GUILD uses BIANA to create an integrated knowledge base of protein-coding genes, their functional and disease annotations and the interactions between them. GUILD combines the results of three algorithms based on the message passing of information (NetScore and NetZcore) and the shortest paths between a gene and the set of core genes associated with a disease (NetShort).

"Guilt-by-association" strategy to tackle the incompleteness of disease-gene relationships
A common approach to establish relationships between diseases is finding disease-related genes that are common to both diseases 17,18 . To investigate disease-disease relationships originating from shared genes, we used information of 3084 diseases annotated in DisGeNET v1.0 database (DGN1), yielding 4,753,986 possible disease pairs.
We used GUILD 16,19 , a network-based disease-gene prioritization tool based on guilt-byassociation to extend the number of potential genes associated to a given disease. GUILD exploits the underlying PIN to rank all the genes in the network in respect to their topological closeness to the set of disease-associated genes (seeds). GUILD has previously been used to identify novel candidate genes in biological and pathological processes such as apoptosis 20 and brain and lung metastases 21 . In this project, we studied whether GUILD predictions could reduce our lack of completeness in two issues: (i) expand the knowledge on existing relationships, i.e. diseases with common known genes, by including new common genes; and (ii) unveil novel links between diseases which would have been otherwise hidden considering only known genes, i.e. by using only annotated data in DisGeNET database.
Accordingly, we compared potential comorbidities based on disease-gene annotation, i.e. those obtained through common genes from DGN1 only, with relationships found after expanding DGN1 disease-associated genes with GUILD. We performed two independent tests: 1) DGN1-GUILD test, that compares potential comorbidities based on disease-annotation with the relationships obtained after expanding original disease-disease associations (DDAs) with GUILD ( Figure S2A panel); and 2) Expanded-DGN1-GUILD test, where the overlap was obtained with the extended set of genes but discarding the shared seeds before expansion ( Figure S2B panel). The significance of the predictions of DDAs was calculated in terms of recovery of genes linked to the partner disease, using a p-value derived from a hypergeometric distribution (p-value < 0.05): where N is the sample size, K is the number of genes associated to the disease dis2 as per DGN1, n is the size of the set predicted for dis1 (seeds from DGN1 and extended set, i.e. genes with a predicted Z-score > 2), k is the number of correct predictions (i.e. number of genes of K set within n, i.e. intersection between K and n).
To address the first issue above (i), i.e. expanding our knowledge on existing DDAs, we studied GUILD predictions with DGN1-GUILD test. To address the second issue (ii), i.e. unveil novel links between diseases, we used the Expanded-DGN1-GUILD test. It must be noted that, in this test, predictions were performed using orthogonal seed genes, i.e. genes that are unique to either of the diseases. Besides, p-values of association between diseases were significant in both directions, e.g. in the test between two diseases, A and B, the enrichment of genes associated with disease B was obtained with orthogonal genes of disease A used as seeds, and vice-versa ( Figure S2).

Quantifying disease relationships using shared genes and gene functions
To quantify the DDAs, a composite score based on the number of common genes and gene- (i) For CG measure: The significance of the relationship between a given pair of diseases was considered by looking at the overlap between genes ( Figure 1C). In this case, the variables N, K, n and k in Eqn 1 represent: the size of PIN of both diseases (i.e. number of nodes associated to both diseases including seeds and expanded set); total number of genes annotated in DGN2 that are common to both diseases; number predicted genes shared between both diseases excluding the seed genes, i.e. GUILD expanded set; and the number of correct predictions, i.e. intersection between K and n ( Figure 1C) For FCG-RP, FCG-GObp, and FCG-GOmf measures: Links between diseases were computed as in CG but considering the functions of the genes instead of the genes themselves ( Figure 1D). Thus, genes were converted into function descriptors using three different sources of functional annotations1) Reactome database 22  Note that in (ii) and (iii), we apply the hypergeometric test twice, first to identify functions induced by common genes and second to identify the significance of the overlap between these functions.
During functional enrichment calculations, we only considered significantly enriched functions, that is, those with corrected p-value < 0.05.
We tried two multiple testing correction strategies, Bonferroni and FDR (Benjamini Hochberg), to identify significantly linked DDAs (corrected p-value < 0.05). Depending on the multiple testing correction considered, we identify disease-disease associations with strict (Bonferroni) or relaxed (FDR) criteria. A DDA will have a score between 0 and 7, named composite score, being 0 for noassociation and 1-7 for the number of significant measures.

Clustering diseases of the disease network.
Once the DDAs were identified, a disease network (DN) was derived using the DDAs identified according to strict criterion. Diseases (nodes) were connected based on the composite score (edges, see above). The idea behind deriving a DN was to facilitate the complex analysis between groups of diseases. With this analysis, we gain further understanding on the interplay between diseases and relationships among them, as the DN provides a more comprehensive representation than isolated pair-wise relationships. The next level of abstraction was to cluster the diseases by identifying highly connected modules within the network. Accordingly, the weighted adjacency matrix representing the DN was clustered using the network-based Markov Cluster Algorithm (MCL) 24 with default parameters.
The DN was also characterized in terms of betweenness centrality of the edges. Edge betweenness centrality was assigned to disease-disease links by using the Networkx python package 25 . This topological measure used in graph theory informs on the role of a given edge as a gatekeeper of the information flow in the network 26 . In the context of the DN, a high edge betweenness suggests a central role in the relationship between two diseases. To find which of the seven measures produced the higher betweenness-centrality values, we compared the betweenness centrality distribution of each single measure with the remaining six approaches.
Then, we used a Kolmogorov-Smirnov test of the distributions to confirm the significance of the measure.

Mapping mutations to protein interfaces
To further understand the molecular basis between two linked diseases, we characterized the Whenever possible, we characterized the mutations linked to protein interfaces by using the structure of protein complexes from 3DiD 31 . In the case of protein complexes with known structure, the interface was defined as the set of residues of each protein at less than 12 Angstroms Cb-Cb distance, as defined in 3DiD. For complexes of proteins highly similar to the known interactions (usually referred as interologs), we inferred the residues in the interface from the alignment with the corresponding template structure using Align 32 .

Assigning diseases to a unique disease MeSH category
We used a heatmap to show the pairs of significantly associated diseases. For the sake of the representation, we manually ordered the diseases represented according to MeSH disease classification tree. Since a disease can belong to more than one MeSH category, we manually    Cartoon representation of TGF-ß receptor type-2 (pale yellow) interacting with the transforming growth factor ß-3 (pale blue). Mutated residue, I50V in the TGF-ß receptor type-2 (i.e. I73 in the full sequence from UniProt), is shown in ball-and-stick representation with each atom coloured according to type: red, blue and white for oxygen, nitrogen and carbon respectively. Inset portraits a close-up on the interface region and mutated residue. Table S1. Number of genes associated to each disease. Number of genes associated to each disease (see filtering steps in Figure 1A and Methods): according to DGN1, according to DGN1 plus its GUILD expansion (exploratory set) and according to DGN2 but not in DGN1 (validation set).

Table S2. Details of the DDAs by means of strict and relaxed criteria.
Tables with the details of all disease-disease associations with k > 1 (see Methods Eqn 1). Each        List of all DDAs by means of strict (Table S3A) and relaxed criteria (Table S3B)   showing the number of direct, semi-direct and indirect associations between diseases. The first two columns show the name and numbering of the cluster and the last three columns the numbers of each association-type. and protein-protein interactions associated with them, along with the number of mutations located in the interface of the interaction (and/or the PFAM domain involved in the interaction). Columns "disease 1" and "disease 2" show the names of the two diseases with putative comorbidity. The first columns "cluster 1" and "cluster 2" show the name of the cluster to which diseases 1 and 2 belong, respectively. Columns "Ac1" and "Pfam1" show the accession numbers and PFAM domains of the proteins associated with disease 1, while "Ac2" and "Pfam2" are for disease 2. The classification of the interaction as indirect, direct or semi-direct is shown in column "type". Column "mut1" indicates the number of mutations located in the interface of the first protein (Ac1), and column "assoc1" shows how many of them are associated with a disease. Columns mut2 and assoc2 are defined as mut1 and assoc1 but for the second protein (Ac2). Clusters formed by less than three diseases are also included. In addition to the names given to the clusters, they are also numbered to include the additional clusters, up to a total of 8 clusters.
Table S5C: The same as S5B but only for pairs for which mutations can be assigned to the corresponding disease (i.e. assoc1 with disease 1 and assoc2 with disease 2). We have highlighted in orange the pairs involving asthma and arthritis rheumatoid when the mutation alters the TNF binding, as many works are recently studying the use of anti-TNF drugs for both diseases 34 . For the rest of pairs, the association is done by searching any of the mesh terms associated with the phenotype described in ClinVar or humsavar. Highlighted in yellow are shown the pairs with at least one exact match, otherwise the match can be with any of the parents or child terms in the MeSH tree. The list of mutations used to generate Table S5B and S5C is also given as a text file ("Dataset_S5").