Endophenotype Network Models: Common Core of Complex Diseases

Historically, human diseases have been differentiated and categorized based on the organ system in which they primarily manifest. Recently, an alternative view is emerging that emphasizes that different diseases often have common underlying mechanisms and shared intermediate pathophenotypes, or endo(pheno)types. Within this framework, a specific disease’s expression is a consequence of the interplay between the relevant endophenotypes and their local, organ-based environment. Important examples of such endophenotypes are inflammation, fibrosis, and thrombosis and their essential roles in many developing diseases. In this study, we construct endophenotype network models and explore their relation to different diseases in general and to cardiovascular diseases in particular. We identify the local neighborhoods (module) within the interconnected map of molecular components, i.e., the subnetworks of the human interactome that represent the inflammasome, thrombosome, and fibrosome. We find that these neighborhoods are highly overlapping and significantly enriched with disease-associated genes. In particular they are also enriched with differentially expressed genes linked to cardiovascular disease (risk). Finally, using proteomic data, we explore how macrophage activation contributes to our understanding of inflammatory processes and responses. The results of our analysis show that inflammatory responses initiate from within the cross-talk of the three identified endophenotypic modules.

share adjacent reactions in the KEGG and BIGG databases. In total, we used 5,325 such metabolic links between 921 enzymes.
(v) Protein complexes: Protein complexes are single molecular units that integrate multiple gene products. The CORUM database is a collection of mammalian complexes derived from a variety of experimental tools, from co-immunoprecipitation to co-sedimentation and ion exchange chromatography. In total, CORUM yields 2,837 complexes with 2,069 proteins connected by 31,276 links. (vii) Signaling interactions: The dataset from provides 32,706 interactions between 6,339 proteins that integrates several sources, derived from both high-throughput screens and literature curation, into a directed network in which cellular signals are transmitted by protein-protein interactions.
Note that in our analysis, we do not take into account the direction of these interactions.
(viii) Liver-specific interactions: We also include liver-specific protein-protein interaction data as many of the mediator proteins in inflammation, thrombosis, and fibrosis are synthesized in the liver.
The union of all interactions obtained from (i)-(vii) yields a network of 13,681 proteins that are interconnected by 144,414 physical interactions. The network has a power-law degree distribution with a few hubs and a substantial number of low-degree nodes, and shows other typical characteristics observed previously in biological networks, such as high clustering and short path lengths.

The effect of biased studies of human interactome on topological properties of disease genes
Current maps of the human interactome are prone to investigative biases (43,44). Since disease genes are typically the particular focus of experimental research, it is often observed that they have more established interaction partners and, therefore, higher degree in the network.
Here, we aim to explore how biased studies of the human interactome may affect observing significant clustering of seed genes in the network. To quantify the extent to which the observed topological properties of disease proteins is due to these biased studies, we repeated our analysis on an unbiased, Y2H high-throughput subset of the human interactome.
The raw observation suggests that proteins show much smaller clustering effect on unbiased proteinprotein interaction network. However, note that the interaction among substantial number of seed genes has not yet been examined in current high-throughput maps (Y2H network), and, therefore, the conclusion of such observation requires more attention.
As mentioned above, due to limited search space (number of proteins examined) and an interaction detection sensitivity of ~10%, the unbiased maps are much sparser than current LCI maps. Thus, observing a smaller clustering effect is, indeed, expected and can be explained by the incompleteness of the current maps of unbiased interactome. Moreover, LCI is not limited to protein-protein interactions and includes interactions from several sources such as metabolic, regulatory, etc.
To show that the observed topological properties of disease proteins on unbiased maps is, indeed, due to the incompleteness of the network (and not solely due to the biased nature of studies), we proceed as follow ( Fig. S7a for the flowchart): 1. For a fair comparison, we first limit the nodes of our HI to those existing in Y2H. Therefore, we characterized the subnetwork of the full network that contains Y2H network nodes. This subnetwork contains a substantially larger number of edges than the Y2H network. The latter can be viewed as an incomplete but unbiased subset of this subnetwork.
2. Next, we check whether the differences are significant or expected by chance. We randomly remove ("prune") links from this subnetwork until we reach the same number of links as in the Y2H network. In parallel, we try to keep the degree of the nodes preserved as in the Y2H network.
Our analysis shows that the observed low clustering of seed genes in unbiased maps lies within the expected range drawn by randomly pruned events (Fig, S7b). Therefore, low clustering of disease proteins can, to a great extent, be explained by the incompleteness of the network.

DIAMOnD Robustness and topological validation of module size
To check the robustness of the module-finding methodology towards false positives and genes misannotations we performed the so-called N-1 analysis where N is the original number of seed genes. In this analysis, we remove one seed each time and expand the neighborhood of N-1 seeds iteratively. At each iteration, we measure the overlap of the detected genes between the original (N seeds) and the trial (N-1 seeds) sets. This procedure allows us to define an upper threshold and limit for the size of the final module above which the outcome of the methodology is sensitive to false positive annotations of seeds. Figure S2 shows the average overlap of the detected DIAMOnD genes as opposed to the DIAMOnD iteration step. The overlap has been measured between the genes resulting from two different seed sets: the original seeds and N different configurations of trial seeds (each containing N-1 seeds). As shown in the figure the methodology is robust towards small variation of seed genes.

Fully embedded pathways
For pathway analysis, we used version 3.1 of Molecular Signature Database (MSigDB) developed by the Broad Institute, which is an integration of several different databases. Here we use pathways from KEGG, Reactome, and Biocarta. Each pathway is associated with a list of genes for which we calculate their enrichment using Fisher's exact test. However, there are a few pathways, in which their associated genes 5 are fully (%100) embedded within our detected modules and are suspected to have a direct role in inflammatory processes (Table S1 and Fig. 2d); note that, given the current coverage of the map of human Interactome, proteins belonging to the pathway "Reactome_activation_of_AP1_family_of_transcription_factor" do not directly interact with each other. :   Fig. S1. Genetic association of seed genes as compared to other genes with respect to three cardiovascular biomarkers, CRP, fibrinogen, and sICAM, as well as the specific vascular disease phenotype, VTE, in the inflammatory (pink background, first column), thrombotic (light blue background, second column), and fibrotic (light orange, third column) subnetworks. Seed genes contain more low p-value GWAS genes than other genes in the network [red circles, seed genes; green circles, endophenotype module (subnetwork); black circles, rest of network].

Additional files not embedded into this document:
Data files S1 and S2