Introduction

Cellular functions are mostly conducted in a highly modular manner1 in the context of a molecular interaction network2 whose underlying universal laws may potentially be elucidated by advanced approaches derived from network biology3. Investigation of the modular organization of interactome networks, such as protein-protein interactions (PPIs), may facilitate further explorations of the underlying molecular network mechanisms that drive human diseases4,5, This network medicine framework provides a global system-level view for discovering the potential causes of human diseases and obtaining a better understanding of the correlation between each disease and its molecular functional communities6,7, These interaction networks may be used to predict gene function8, new disease-associated genes9 and the overlapping relationships among disease phenotypes10,11, The tacit assumption of network medicine12 is that perturbations of a specific protein functional community in the PPI network will result in a disease phenotype13. Therefore, the disease module6,12, a particular neighborhood with tightly linked proteins associated with a specific phenotype, may be identified from the PPI network through topological network analysis. Kwang-ll Goh et al.6,11,14, have discovered that the corresponding protein products of the disease genes are more likely to participate in the same functional module and that proteins associated with the same disorder increase the likelihood of sharing similar biological functions; these findings have been revalidated in several other related works4.

To date, most disease module detection algorithms have been built on the basis of the findings of topological modules as functional modules with respect to a specific disease. Ruan et al.15 have used the famous network partition approach (referred to as the GN algorithm) to a colon cancer microarray dataset and have obtained the functional modules that cause colon cancer. Spirin and Mirny16 have applied three methods for group identification in the PPI network and have subsequently shown that these topological clusters correspond to protein complexes and functional modules. A clique percolation approach has been used by Zhang et al.17 to identify protein communities, and the most of their topological modules correspond to functional modules. A graph entropy approach for the identification of functional modules from the PPI network has been proposed by Kenley et al.18. These previously described methods have generated functional modules from topological modules; therefore, it is assumed that topological, functional and disease modules overlap. Thus, the functional modules correspond to topological modules12. As a result of the increased availability of PPI data and molecular functional information, it would be interesting to revisit this issue and investigate the extent to which the functional homogeneity of genes corresponds to their topological interactions.

The main contribution of this study is to investigate the functionally diverse homogeneity of topological protein modules. We initially selected seven well-investigated community algorithms for detecting topological modules in the PPI network. We determined that most modules had fewer than 10 proteins and that the modules significantly overlapped. Second, we simultaneously conducted a homogeneity analysis for each module with Gene Ontology (GO) and pathways and determined that homogeneity also exhibited a diverse distribution. Finally, we analyzed two causes of functional diversity of the modules: disease-related genes and GO term levels.

Results

Topological modules of the human PPI network

We investigated the underlying modular structure in the human protein-protein interaction network derived from STRING9 by adopting seven well-studied community detection algorithms (BGLL, Incremental BGLL (IBGLL), Newman Spectral (NS), Label Propagation (RAK), Walktrap (WT), Link Community (LC) and ClusterONE (CO); see the Materials and Methods section). Different methods yielded different protein communities with different sizes and protein memberships, thus potentially influencing our evaluation results. To validate the consistency of the community detection results produced by the different algorithms, we calculated the overlap of the communities generated by these seven methods.

As a result, we initially recognized that the proportion of small modules was larger than that of big modules for each method (as indicated in Fig. 1A), thus suggesting that small modules (with size < 10) composed most of the network (41.1%, BGLL; 77.9%, IBGLL; 93%, NS; 73.6%, RAK; 83.4%, WT; 91.1%, LC; 36.5%, CO) in all methods. Moreover, the module size distribution of overlapping module detection methods (LC and CO) approximately followed a power-law distribution, whereas the module size distributions of the other five non-overlapping community detection algorithms had longer tails than the other two distributions. However, the total number of modules produced by each method varied from 57 to 11,387. For example, the NS algorithm generated only 57 modules with a size greater than 2, whereas a protein group exists for 12,527 proteins (14,380 proteins in the String9 database). LC identified 11,387 communities with a size greater than 2, and the protein clusters overlapped. Table 1 presents an account of the communities and the largest module size in all methods.

Figure 1
figure 1

Distribution of module size and the overlap among modules. Figs A1-7 illustrate the distribution of the size of modules detected by seven different community partition methods (BGLL, IBGLL, NS, RAK, WT, LC and CO) in the PPI (String 9) network. The x-axis represents the size of module, and the y-axis describes the percentage of modules. B1–7 indicate the consistency of the community results that detected by all seven community detection methods. The x-axis represents the Jaccard similarity metric between two modules, and the y-axis represents the percentage of matched modules.

Table 1 The number of modules and the largest module size with respect to seven distinct approaches.

These modules have also been considered to be functional modules in past decades19. Second, an underlying modular structure naturally existed in the PPI network, thus indicating that the modules detected by different algorithms shared most of the common protein members. The consistency of the module families among all algorithms was measured through the Jaccard similarity metric, which evaluates significant overlap between paired sets of modules. A high Jaccard value indicates that the module sets of a specific algorithm are highly involved in other module families produced by a distinct algorithm. The results regarding the relationship between the Jaccard similarity intervals and the percentage of protein modules accompanied by different methods are presented in Fig. 1B; these results indicated that community structure/modularity was a fundamental property of the PPI network, as has been described by Zhang20 and Rives21. These modules generated by LC and CO (Fig. 1B6–B7)were easily contained by other modules that were detected by non-overlapping algorithms. Moreover, the proportion of modules with Jaccard similarity metrics less than 0.1 was quite small for IBGLL (Fig. 1B2), RAK (Fig. 1B4) and WT (Fig. 1B5); however, BGLL (Fig. 1B1) and NS (Fig. 1B3) resulted in a relatively higher proportion than IBGLL at this Jaccard interval, whereas the modules produced by NS and BGLL matched each other well. The reason for this finding is that the modules generated by IBGLL were based on BGLL, and modules with size smaller than 3 were discarded; thus, the absence of proteins contributed to the lower Jaccard metric. According to the above analysis, regardless of whether an overlapping or non-overlapping module detection algorithm was used, the most prominent consequence of these two findings was the presence of various densely linked modules that held the overall PPI network together.

Evaluating the homogeneity of topological protein modules

The reliability of GO and pathway homogeneity

Proteins showing dense interaction with one another in one module should have the same or similar functions and be described as having shared commonalities in their biological functional characteristics22. We investigated the functional homogeneity of the topological modules in the PPI network by calculating the GO homogeneity and pathway homogeneity for each module by using Equations (3) and (4) (see the Materials and Methods section). A larger value indicates relatively higher homogeneity. Furthermore, to investigate how well the discovered community structures reflected biological functions, the homogeneity results were compared with random expectations (refer to the Materials and Methods section). Finally, we determined that the topological modules exhibited excellent homogeneity compared with the expected modules without advanced planning. Fig. 2A and B depict the comparison of biological process (BP) and pathways, respectively, and the comparison results for cellular component (CC) and molecular function (MF) are shown in Supplementary Fig. 1. For example, consider method IBGLL, in which the value 0.6 (or bigger) can be considered a relative larger homogeneity value. We determined that 21.3% of the modules have a homogeneity larger than 0.6 in BP, as compared with the random control(p = 5.17E-30, chi-square test) (Fig. 2A2). This finding indicated that the proteins in densely connected sub-graphs exhibited a high tendency to share common biological functions16. However, we also found that the number of protein modules with lower homogeneity values was greater than the number of modules with higher homogeneity in terms of the GO or pathway associations. For example, only 67 of the 314 modules produced by IBGLL had homogeneity values greater than a relative higher homogeneity 0.6. In summary, the topological modules may have a greater proportion of homogeneous modules than the random controls; however, a substantial proportion (78.7%,IBGLL) of heterogeneous modules also existed. Thus, the distribution of module homogeneity is varied, and the biological functions of the topological modules are diverse.

Figure 2
figure 2

Homogeneity of BP and pathway associations compared with random control. Figs A1–7 illustrate the BP homogeneity comparisons between real and random control for all seven methods. Figs B1–7 show the pathway homogeneity comparisons between real and random control for all seven methods.

The relationship between the size and density of the modules and homogeneity

Homogeneity varied across the topological modules because small modules (size < 10) represented the largest proportion of all modules; thus, the Pearson correlation coefficient (PCC) and its corresponding p-value (Table 2) were calculated to separately evaluate the underlying correlation between the size and density of the modules and homogeneity with respect to the BP, CC, and MF. As a result, we found that module size was negatively correlated with homogeneity, thus indicating that the topological modules may obtain relatively higher homogeneity if they possess fewer protein members, and vice versa. Given the substantial number of modules generated by each algorithm, the mean and variance of the homogeneity modules of the same size were calculated. Figure 3A presents the distribution of homogeneity related to BP terms, and Supplementary Fig. 2 presents the distribution of homogeneity related to MF and CC terms. A diverse distribution of homogeneity existed in different module sizes. The methods BGLL, NS, RAK and WT detect big modules (with size > 1000) and they have relatively lower PCC between module size and homogeneity in the meantime. In order to quantify how these super modules affect the correlation between module size and homogeneity, we recalculate the PCC and its corresponding p-value by removing super modules (Supplementary Table 1). And we find that the correlation between module size and GO homogeneity have a little change except NS because the biggest module has 12527 proteins in NS and the most modules have less than 10 proteins. That means the methods which detect large modules will give rise to the relatively lower PCC between module size and homogeneity. Furthermore, the same results were obtained according to pathway for the LC and CO methods only; the results obtained from the other five non-overlapping methods indicated that the module size and pathway homogeneity had limited relevance. The reason for the lack of correlation between module size and pathway homogeneity may be that super modules existed in the module sets produced by these non-overlapping algorithms, and we recalculate the PCC and p-value by removing the super modules (size > 1000) and finally we find the module size and pathway homogeneity have positive relationship (Supplementary Table 1). This indicates the relatively larger modules are tend to include more proteins in one pathway and have relative higher homogeneity simultaneously. Furthermore, the number of pathways (1513) was relatively small in the Pathway Interaction Database (PID) database, thus possibly providing another explanation.

Table 2 Correlation between module size and homogeneity. PCC is the Pearson Correlation Coefficient between the size and homogeneity and p-value is the significance level.
Figure 3
figure 3

Homogeneity of BP and pathway associations at different module sizes. Figs. A1–7 illustrate the correlation between homogeneity and module size according to GO for all seven methods. Figs B1–7 denote the correlation between homogeneity and module size according to pathway for all seven methods.

Proteins exert their functions through interactions with one another23,24,25, the PCC (Table 3) and its corresponding p-value between the edge density and homogeneity were calculated to measure the relationship between edge density and homogeneity. We determined that edge density and homogeneity are positively correlated. Furthermore, we identified an inverse result for pathway homogeneity for nearly all methods (Table 3). This finding indicates that high density modules may tend to participate in diverse pathways. Moreover, community detection methods may fail to detect the disease modules with high pathway homogeneity because a high edge density is one of their main principles pursued. This failure may be caused by the relatively longer average distance between protein pairs in the pathway, which would not have been considered in topological modules. The results of the shortest path lengths in topological modules and pathways confirmed this observation because proteins in pathways tended to have substantially higher average shortest path lengths than topological modules (3.82 vs 2.50, respectively, p-value = 7.43E-118, t-test) according to the IBGLL method. As mentioned before, the big modules (with size > 1000) were detected by BGLL, NS, RAK and WT, we recalculate the PCC between edge density and homogeneity by removing these super modules. Finally, we find that the PCC values have a little decrease for all these four methods (Supplementary Table 2). That means the methods which detect large modules will give rise to the relatively lower PCC between edge density and homogeneity. Overall, we concluded that community detection methods based on topological features may be better suited for identifying functional modules with neighborhood structures (e.g., protein complexes), whereas these methods may not be suitable for the detection of functional modules as pathways.

Table 3 Correlation between edge density and homogeneity. PCC is the Pearson Correlation Coefficient between edge density and homogeneity and p-value is the significance level.

Module distance and phenotypic similarity

Phenotypic similarity is another metric used to measure the homogeneity of modules, as discussed by Ghiassian26. According to the investigation of disease module hypothesis10, the distance between disease modules should be negatively correlated with phenotypic similarity. In recent s, a substantial number of studies have indicated that proteins contribute to diseases with similar phenotypes tend to interact with one another more frequently27,28,29, Therefore, two modules with correspondingly similar phenotypes are assumed to have a relatively shorter topological distance in PPIs. Similarly, when two topological modules are cohesive in their common functional similarity principles, the previously described assumption should be true. Thus, the topological distance between a pair of modules and the phenotypic similarity between them were independently calculated to test this assumption (refer to the Materials and Methods section). However, interestingly, there were mostly positive correlations (e.g., PCC = 0.44, BGLL) between the distances and phenotypic similarities of topological modules (Table 4) with non-overlapping methods but that this correlation became weak with overlapping algorithms. And we find that the methods BGLL, NS, RAK and WT which detect large modules (with size > 1000) have relatively higher PCC, then we recalculate the PCCs by removing super modules. We find that the PCC values have a little change for all these four methods (Supplementary Table 3). That means the methods which detect large modules will give rise to the relatively higher PCCs between distance and phenotype similarity. This finding indicated that the molecular interactions between modules have counterintuitive correlations with their shared phenotypes, thus suggesting that there will be gaps in determining the functional modules directly from topological modules. Furthermore, this disagreement may in turn be a result of the following: (1) the incompleteness of the currently available PPI, the noise interplay between proteins30 and the biased protein-protein interactions present in the PPI network31 and (2) the potential for the proteins in one module to participate in more than one biological process, thus resulting in widely different phenotypes within one module. The results clearly indicated that the functional diversity distribution of topological modules existed for phenotypes, and further studies are necessary to investigate the complicated relationships between topological modules and functional modules.

Table 4 Correlation between module distance and phenotypic similarity. PCC is the Pearson Correlation Coefficient between module distance and phenotypic similarity and p-value is the significance level.

Disease-related modules have higher homogeneity

The detected protein communities provide insights into the methods for identifying the potential biological mechanisms of protein interactions32. Our work also revealed the diverse distribution of biological homogeneity within these modules. Furthermore, we determined that the denser edges of a module may contribute to greater homogeneity, whereas many studies have recognized that disease-associated proteins tend to exhibit more dense interactions with one another than with the other proteins in the PPI33. Thus, in this study, the proportion of disease-causing proteins located in one specific module was used to validate the potential associations between diseases and module homogeneity. For each module, we searched a disease that occupied the maximum fraction of proteins in one module and then identified the correlation between the ratio and homogeneity. Finally, we discovered that functional homogeneity had a mildly positive correlation with the maximum portion of disease-related genes (PCC = 0.20, p-value = 4.58E-04, BP, IBGLL; Table 5), thus indicating that when more proteins contributed to a common disorder within a topological module, they were typically accompanied by greater functional homogeneity. However, this positive correlation was not significant (p-value >= 0.05) for the BGLL, RAK and NS methods in terms of BP, CC and MF. According to the module size results in a previous work, the non-significant correlation may be caused by super modules (Table 1). The sizes of the largest modules were 3567 (BGLL), 8845 (RAK) and 12,527 (NS), whereas there were 14,380 proteins in the PPI network. The IBGLL method repartitioned the super modules (size >= 400) into multiple, relatively small modules, and significance emerged for all three branches in the GO analysis. Furthermore, we recalculate the PCCs between the percentage of disease-related proteins and homogeneity by removing super modules which are generated in BGLL, RAK, NS and WT (Supplementary Table 4). We find that the values of PCC are decrease that means the methods which detect large modules will give rise to the relatively lower PCC between percentage of disease-related genes and homogeneity. In conclusion, the modules that contain the most proteins related to a specific disease may exhibit greater homogeneity to some extent. This result was consistent with the disease module hypothesis and a recent investigation of disease module detection26 which has specified that disease modules are scattered across the entire PPI network rather than being located in only one uniform super module.

Table 5 Correlation between percentage of disease-related proteins and homogeneity. PCC is the Pearson Correlation Coefficient between pecentage of disease-related proteins and homogeneity and p-value is the significance level.

GO term generality contributes to higher homogeneity

Each protein within the PPI network is typically annotated by multiple GO terms. We determined that the distribution of the number of GO annotations for genes had a fat-tail distribution (Fig. 4A 13), thus indicating that most (44.6%, BP; 55.3%, CC; 71%, MF) proteins were annotated by 1-2 GO terms and that proteins (26.7%, BP; 10.5%, CC; 2.7%, MF) annotated with more than 5 GO terms indeed existed. If each protein in the modules of the PPI network were to have a substantial number of GO annotations, we would expect greater functional homogeneity in these modules. Therefore, we classified the proteins on the basis of their number of GO annotations (e.g., proteins with only 1 GO annotation) and calculated the fraction of proteins of each type in each module. In contrast to common expectations, the fraction of proteins with a low number (particularly one) of GO annotations in a given module had a strong positive correlation with the homogeneity of the module (e.g., BGLL with PCCs: 0.41, 0.31 and 0.30 for BP, CC and MF homogeneity, respectively, Table 6). The only exception is the LC method, which may be a result of its detection of overlapping communities at small scales (90% of modules had less than ten proteins). Considering the fact that super modules detected by BGLL, RAK, NS and WT, we recalculate PCCs by removing super modules and we find that PCCs have a little change for all methods (Supplementary Table 5). We have found that the methods which generate relatively smaller number of modules will give the bigger PCCs (for example: 0.60, NS). This finding may be due to general GO annotations, which are implicitly included in the parental categories of an annotated GO term for genes. We further evaluated the degree to which GO generality might contribute to the homogeneity of topological modules by examining the correlation between the tree-level of GO annotations of proteins in modules and their homogeneity.

Table 6 Correlation between the number of GO terms at different levels and homogeneity in terms of BP, CC, MF and pathway. PCC is the Pearson Correlation Coefficient between percentage of proteins annotated by r GO terms and homogeneity and p-value is the significance level.

When we considered the tree structure of GO, the percentage of modules at each level exhibited a diverse distribution (Fig. 4B 13), and a larger fraction of modules (14.5%, IBGLL) obtained greater homogeneity in terms of high-level GO terms (level is<=4). We further confirmed these findings by classifying the modules and GO terms into two categories according to level 4 and determining the significance (chi-square test) between them (Fig. 4C 13). The findings indicated that the general GO terms consistently contributed to greater homogeneity instead of indicating a specific biological meaning.

In addition, we evaluated the statistical magnitude of the proteins by counting proteins that participated in a specific pathway and the distribution among them, which approximately followed a fat-tail distribution (Fig. 4A4). The same result was obtained for the GO terms in this study, thus indicating that the general pathway contributed to greater homogeneity (Table 6).

Figure 4
figure 4

GO and pathway properties and GO level distribution. The underlying reasons for the diverse biological meaning of modules were examined from three aspects. Figures A14 indicate the distribution of GO terms and proteins. Figures B13 indicate the distribution of GO term levels, thus resulting in higher homogeneity in modules (for each method) in terms of BP, CC and MF. Figures C13 indicate the significance of general GO terms by module enrichment, and the pink bars indicate the background ratio of GO terms (level > = 4) with a ratio of the number of modules for each method for the ratio of the number of modules for each method; the p-value denotes the significance of the difference between the two ratios according to a chi-square test.

Discussion

Most biological functions arise from interactions among many molecular components, which typically form functionally related modules to exert their activities3,16,34, The identification of functional modules is a critical process for understanding the potential mechanism of molecular interactions within cells and the underlying mechanisms of complicated disease phenotypes4,35, Fortunately, the availability of various types of large-scale interactome networks36, such as PPI, signal transduction networks and metabolic networks, have paved the way for the prediction of biological functions using network-based approaches8,24.

It has been well established that the relevant genes of similar disease phenotypes have a significantly higher tendency to interact with each other and to have a higher degree of related functions than do random cases5. These related studies have developed several network medicine assumptions and/or principles, such as the disease module phenomenon, the consistency between diseases with shared phenotypes and their underlying molecular interactions12, and the overlap of topological, functional and disease modules. The overlap assumption indicates that functional modules correspond to topological modules, and a disease may be viewed as the breakdown of a functional module. Most previous studies have indicated that a disease module tends to be a functional and topological module. However, this relationship would not naturally be an inverse one. Thus, molecular interactions exert biological functions and may be used for functional predictions of proteins; however, topological modules detected solely through community discovery methods have a substantial gap that must be filled before they can be considered functional disease modules. In this manuscript, we attempted to address this issue by systematically investigating the functional homogeneity of topological modules extracted by seven widely used community detection methods from a large-scale human PPI network. We determined that the small modules comprised a substantial fraction of all modules, thus indicating a general shortcoming of community detection methods for topological module discovery. Moreover, we determined that the functional properties of topological modules are diverse and heterogeneous; thus, although most topological modules tend to be functionally homogeneous compared with random controls, there are several unavoidable factors, such as edge density, associated disease phenotypes and general GO terms, that contribute to the questionable tendency of functional homogeneity. Furthermore, when we used a recently proposed measure of disease molecular relationships, which has been shown to be a robust measure of disease module overlap, we determined that the molecular distance between topological modules positively correlated with the phenotypic similarity between topological modules. This finding indicated that a greater molecular distance between topological modules is associated with greater phenotypic similarity. Although this result is clearly counterintuitive, it might represent another detectable gap distinguishing topological modules from functional modules.

To the best of our knowledge, this study is the first systematic analysis of the differences between topological modules and their corresponding biological functions and the contributing factors related to the questionably high tendency of functional homogeneities. In this manuscript, we used only two overlapping community detection methods (LC and CO); therefore, the biological functions that may correspond to the overlapping structures should be further investigated. The correlation between distance and phenotypic similarity across modules might change when additional overlapping methods, such as CFinder37, Potts model38. Lin et al.39 have found that a topological module usually contains core and ring components and that the major biological function is exerted through core components; thus, it is necessary to consider these core components when detecting functional modules. Furthermore, we also determined that the average shortest path in the modules (i.e., 2.25 in IBGLL) was shorter than that in the pathways (i.e., 3.82 in PID), because topological modules contain only proteins exhibiting dense interaction. Thus, a combination of other valuable biological and topological information may facilitate the effective clustering of non-adjacent proteins40 into one module as a new pathway.

Methods

In this study, we mainly utilized five databases, namely, String9 (Protein-Protein interaction database)41, GO42, PID (Pathway Interaction Database)43, Disease-Connect database44 and SemRep45. The PPI network was constructed with the String9 database, which indicates the interactions between pairs of proteins. GO and PID were independently used to conduct the enrichment and homogeneity analyses for the topological protein modules. The well-established Disease-Connect disease-related gene dataset was simultaneously used in this study to investigate the relationship between protein topological modules and the diseasome.

Data Set

Protein-Protein Interaction Data

The protein-protein interaction dataset was obtained from the STRING database46, and version 9 of the STRING database (String9) was downloaded from the website41. This PPI database contains curated known and predicted protein-protein interactions. There is a score value for each protein-protein interaction, and a high score is associated with greater confidence in the protein pair’™s interactions. In our study, we managed the acquisition of high quality interactions within human cells by performing pretreatment of the String9 dataset according to the interactions with scores greater than 70047 and the proteins whose identifiers began with the string “9606”. Thus, 14,380 proteins and 218,163 protein-protein interactions were ultimately selected.

Gene Ontology

A battery of controlled and structured vocabularies (referred to as ontologies) was used to describe gene products, as provided by Gene Ontology42. Moreover, free text definitions and stable unique identifiers were assigned to each term in the GO database. The structure of the Gene Ontology terms was organized as a tree. There were three non-overlapping categories: BP, CC and MF, included in the Gene Ontology; the roots of the three categories were GO:000815 (BP), GO:0005575 (CC) and GO:0003674 (MF), and the corresponding hierarchical heights were 17, 13 and 16, respectively, as described by the GO Consortium. The properties of a specific protein are denoted by these three domains, such that BP describes the biological goals, CC describes the locations and MF describes the activities. There are 40,848 GO terms in the database, including 26,598 biological process slims, 3653 cellular component slims, and 10,697 molecular function slims. Intuitively, the GO terms at a lower level are relatively farther level from the root in the GO hierarchy48 and give rise to more specific functional annotations for proteins, whereas the higher-level terms indicate more abstract functional annotations.

Pathway database

The pathway database utilized in this study to verify the homogeneity of the topological protein com- munities was PID (Pathway Interaction Database)43. PID is composed of three other well-established pathway databases, including NCI-Nature curated data, BioCarta data and Reactome data. There are various molecule types in all three databases; however, only molecules with a corresponding molecule type marked as “protein” or “protein complex” were considered to meet the requirements of our study. Thus, we extracted 1513 pathways from PID, of which 223 pathways were selected from the NCI-Nature curated database, 254 pathways were collected from the BioCarta database, and 838 pathways were obtained from the Reactome49 database.

Disease-Gene association data

DiseaseConnect (http://disease-connect.org/) is a public web-server for the analysis and visualization of comprehensive knowledge regarding common molecular mechanism-based disease-disease connectivity44. The disease-gene relationships from GeneRIF, GeneWays and OMIM were contained in the Disease-Connect database. We ultimately extracted 4551 disease-gene relationships.

Disease-Phenotype association data

We extracted the disease-phenotype relationships from SemRep50, which identifies semantic predictions from free biomedical text. The semantic predictions extracted from SemRep formed a repository referred to as SemMedDB45, which contained approximately 82.2 million predictions. We used the table referred to as Concept to extract the disease name and phenotype name, and the relationships among them were subsequently determined from the table PREDICATION ARGUMENT. Finally, we extracted 6438 items regarding the disease-related phenotype.

Topological module detection methods

Modularity

The community structure, which indicates the phenomenon of densely linked clusters of nodes with sparser edges between them, is a common property of many complex networks51. In the past decade, there have been numerous algorithms to detect communities on the basis of the optimization of a metric referred to as modularity, a prominent formulation introduced by Newman and Girvan52 that is expressed as follows:

$$Q=\frac{1}{L}\times \sum _{i,j\in V}[{M}_{ij}-\frac{{d}_{i}{d}_{j}}{L}]\times {\rm{\Delta }}({C}_{i},{C}_{j})$$
(1)

where M is the adjacency matrix that describes the protein interaction network as a graph, \(L={\sum }_{i,j\in V}{M}_{ij}\) is the sum of weights of all edges in the graph, V denotes the set of nodes in network, \({d}_{i}={\sum }_{j\in V}{M}_{ij}\) indicates the degree of node i, C i represents the community that node i belongs to, and Δ function Δ(u, v) is equal to 1 if u = v and is equal to 0 otherwise. The value of Q was used to measure the strength of modules identified by the community detection algorithms53.

BGLL

We obtained the topological protein modules by applying the BGLL algorithm, proposed by Vincent D Blondel et al.54, to protein-protein interaction networks and precisely partitioned the protein-protein interaction network into modules with nodes that were densely inter-connected.

The best partition of the network was accompanied by the highest modularity value; the aim of the BGLL algorithm is to identify the greatest Q by optimizing function (1). There are two phases that are repeated iteratively in the BGLL algorithm. In the beginning, each node was given a different unique community; whether node i was removed into its neighbor’™s community depended on the gain of modularity, which was calculated as follows (2),

$${\rm{\Delta }}Q=[\frac{{l}_{in}+2{d}_{i,in}}{L}-{(\frac{{l}_{all}+{d}_{i}}{L})}^{2}]-[\frac{{l}_{i,in}}{L}-{(\frac{{l}_{all}}{L})}^{2}-{(\frac{{d}_{i}}{L})}^{2}]$$
(2)

where l in is the sum of the weights of the edges of the network, l all is the sum of the weights of the edges incident to the nodes in the network, d i is the sum of the weights of the edges incident to node i, d i,in is the sum of the weights of the edges from i to nodes in the network and L is the double of the sum of the weights of all edges in the network. If the ΔQ > 0, then the two communities are merged into one community. This first phase stops when no movement of an individual node increases the value of the modularity.

A new network in which the nodes are the communities attained from the first phase are constructed in the second phase. The weights of the edges between nodes in the new network are obtained by summing the weights between the relevant communities in the first phase. The two steps are repeated iteratively until there is no more gain in Q.

IBGLL(Incremental BGLL)

Considering the number of genes associated with one disease, modules with more than 400 proteins should be repartitioned. Thus, we propose a novel approach based on BGLL to partition the PPI network into various small modules with sizes under 400 proteins. There are two steps in this algorithm. First, the sub-network from the PPI was extracted with communities with over 400 proteins, on the basis of the modules detected by BGLL. Second, the algorithm referred to as BGLL was iteratively applied to the sub-graphs to obtain smaller communities. The algorithm converged when there was no module size greater than 400.

NS(Newman Spectral)

Newman drew his inspiration from graph partition and subsequently proposed a modularity-based optimization community detection algorithm in terms of the spectral attributes of the real network51. wo steps are involved in the method. First, the network is split into two sub-graphs in terms of the next-to-largest eigenvalue of the modularity matrix. Second, the modules identified in step 1 are partitioned into two modules according to the additional modularity matrix. These two steps are repeated until there is no positive eigenvalue for the modularity matrix.

RAK(Label Propagation)

Raghavan et al.55 have proposed a localized community detection algorithm referred to as RAK that is mainly for use in understanding information diffusion. Each vertex in the network is initially assigned a unique numeric label. The label for each node is substituted with the label that is dominated by its neighboring nodes. The algorithm converges when all vertex labels do not change. Finally, the vertices that share the same label comprise a community.

WT(Walktrap)

Based on the idea of random walk, a module detection algorithm with a hierarchical structure referred to as WT was designed by Pascal Pons et al.56. A new distance metric of two vertices and communities introduced by a transition matrix is used to capture topological similarities between them. A node is initially considered one community and subsequently merges two adjacent clusters into a new community in terms of the Wards method. The distance between modules is subsequently updated according to the new partition. Thus, the method terminates when only one community is reserved. In this study, the random walk length was t = 4, and the best partition was selected according to the maximal modularity.

LC(Link Community)

The previously described methods consider only node grouping, and the detected communities are non-overlapping. However, a protein may have multiple biological functions; thus, the identification of communities with overlap requires substantial work. In contrast to the methods that consider nodes alone, a hierarchical overlap cluster algorithm referred to as link community57 is presented. In this method, the similarity between links is initially calculated, and a hierarchical clustering algorithm is subsequently used to build a dendrogram in which each leaf represents an edge from the PPI network. Finally, the tree is cut according to a partition density D (in contrast to the modularity, which endures a resolution limit) to obtain the best level of the most relevant communities.

CO(ClusterONE)

Nepusz et al.58 have proposed an overlapping protein complex detection algorithm that discovers protein complexes more accurately than MCL, MCODE and CFinder. There are three main steps in CO. First, the protein with the highest degree is selected as a seed, and then, a cohesiveness measure is used to determine whether appending or removing proteins can identify densely connected communities of proteins. Second, if the degree of overlap between two communities is higher than a given threshold, then they are merged into a new community. In the third step, modules with fewer than three proteins or modules with a density below a given threshold are abandoned. After these three steps, the overlapping protein complexes are finally detected.

Functional homogeneity analysis

Homogeneity analysis

For each protein topological cluster, we calculate the homogeneity5 according to GO and pathway associations. For each module, the maximum fraction of proteins that share the same Gene Ontology annotation (or pathway) was referred to as the GO homogeneity (or pathway homogeneity); According to this definition, the GO homogeneity is calculated by Equation (3):

$${H}_{GO}=ma{x}_{i}\,[\frac{{N}_{{G}_{i}}}{{N}_{G}}]$$
(3)

where N G denotes the number of proteins within one protein module annotated by any GO term, and \({N}_{{G}_{i}}\) is the number of proteins within one protein module that shares the ith GO term. The pathway homogeneity was calculated by equation (4):

$${H}_{P}=ma{x}_{i}\,[\frac{{N}_{{P}_{i}}}{{N}_{P}}]$$
(4)

where N P is the number of proteins within one protein community that participates in any pathway, and \({N}_{{P}_{i}}\) is the number of proteins within one protein module that participates in the ith pathway.

Homogeneity of random control

For a group of proteins, we reassigned the GO (pathway) terms to annotate each protein by chance with the same number of its inherent hold. The process was as follows: if a protein was annotated by m GO terms in the source database, we randomly assigned m GO terms to this protein as its annotated GO terms. In the same way, if a protein participated in n pathways, we randomly designated n pathways to this protein. In this study, we generated 100 random instances to approach statistical significance for all seven distinct community detection algorithms.

Molecular distance between topological modules

The distance between two communities was employed to verify the topological similarity between them, and the metric introduced by Jorg10 was used to measure the network-based separation of two disease modules. The distance between two modules A and B was calculated by comparing the mean shortest distance <d AA > and <d BB > of proteins within the corresponding topological modules to the mean shortest distance <d AB > between their proteins, as computed by Equation (5).

$${s}_{AB}= < {d}_{AB} > -\frac{ < {d}_{AA} > + < {d}_{BB} > }{2}$$
(5)

Symptom similarity

We investigated the phenotypic similarity between two topological protein modules by constructing the phenotype vectors of each topological module and calculating the cosine similarity of every module pair. The process of building the vector included the following 3 steps: 1) identifying the disease caused by one protein within the module, 2) searching the phenotypes induced by the disease obtained in step 1, and 3) constructing the vector, initializing the values with zero and subsequently updating the value of the phenotype vector according to the phenotype. The phenotype vectors V A and V B obtained for modules A and B were created, respectively, and the cosine of Equation 6 was used to calculate the similarity. The hypothesis that a shorter distance was associated with the most similar phenotype between two modules was tested by initially constructing the phenotype vector for each module as follows: 1) identifying the disease-related proteins located in one common module; 2) searching for all phenotypes induced by one disease; and 3) building the phenotype vector with elements equal to the number of phenotypes. The vector creation process is presented in Supplementary Fig. 3. Next, we used the formula in Equation 5, which was inspired by a previously published study8, to calculate the distance of two modules, followed by Equation (6), which was used to obtain the biological similarity between the two phenotype vectors that corresponded to the two modules.

$$cos({V}_{A},{V}_{B})=\frac{{V}_{A}\,\ast \,{V}_{B}}{\sqrt{|{V}_{A}|}\sqrt{|{V}_{B}|}}$$
(6)