Module organization and variance in protein-protein interaction networks

A module is a group of closely related proteins that act in concert to perform specific biological functions through protein–protein interactions (PPIs) that occur in time and space. However, the underlying module organization and variance remain unclear. In this study, we collected module templates to infer respective module families, including 58,041 homologous modules in 1,678 species, and PPI families using searches of complete genomic database. We then derived PPI evolution scores and interface evolution scores to describe the module elements, including core and ring components. Functions of core components were highly correlated with those of essential genes. In comparison with ring components, core proteins/PPIs were conserved across multiple species. Subsequently, protein/module variance of PPI networks confirmed that core components form dynamic network hubs and play key roles in various biological functions. Based on the analyses of gene essentiality, module variance, and gene co-expression, we summarize the observations of module organization and variance as follows: 1) a module consists of core and ring components; 2) core components perform major biological functions and collaborate with ring components to execute certain functions in some cases; 3) core components are more conserved and essential during organizational changes in different biological states or conditions.


Supplementary Text 2: connectivity of modules
A module is relatively autonomous and often has high connectivity (C t ) within a PPI network. To observe connectivity (C t ) of a module in a PPI network, we quantified the connectivity by where n and m are the numbers of connected proteins and PPIs in a module. A C t value of 1 indicates that proteins are completely interconnected in a module. For C t of core (or ring) components, n and m are the numbers of connected core (or ring) proteins and PPIs in a module. In this study, C t of core (or ring) components were evaluated while n is larger than 3. Here, we computed C t of modules using the human PPI network. Supplementary Fig. S2A shows the C t of core and ring components, module templates, and their respective extended modules. Extended modules were extended by onelayer of PPIs and proteins in the module template (M). We assume that the module M consists of a set (P) of proteins and a set (I) of protein-protein interactions (PPIs). The one-layer-extended module of this module M includes a set (P P') of proteins and a set (I') of PPIs, where P' consists of the interacting proteins of each protein in set P; I' consists of the PPIs of the proteins in the set P P'. Among 1,519 module templates, C t values of more than 0.6 were observed in 71% (1,081) of cases.
In contrast, C t values were more than 0.6 for only 5% (71) of extended modules. Moreover, 90% of core components and 81% of ring components had C t values of ≥0.6. Similarly, 58,041 modules that were homologous to module templates had C t values of ≥0.6 in 76% of cases (44,319), whereas only 1% (842) of their extended modules had C t values of ≥0.6 ( Supplementary Fig. S3A). These results indicate that core components have the highest connectivity, and that the modules also have high connectivity.

Supplementary Text 3: biological functions of modules
Through assembly and cooperation of proteins in a PPI network, components of a module simultaneously perform certain biological functions. Based on the relative specificity similarity (RSS) 10  To elucidate biological functions of modules, we compared module templates, their core and ring components, and their extended modules. For 1,519 module templates, BP and CC AvgRSS scores were more than 0.6 in 89% and 97% of cases, respectively (Supplementary Figs. S2B and S2C), and these scores were significantly higher than those of extended modules (Mann-Whitney U test, P  0). In addition, BP and CC average AvgRSS scores of core components were higher than others, including ring components (Mann-Whitney U test, P =2e-7 for BP; P =2e-21 for CC), whole module templates (P =1e-7 for BP; P =1e-14 for CC), and extended modules (P =3e-239 for BP; P =5e-262 for CC). CC AvgRSS scores (97%) of templates were slightly higher than those of their ring components (94%) with AvgRSS scores of ≥0.6. Furthermore, BP and CC AvgRSS scores were more than 0.6 for 81% and 94% of homologous modules, respectively (Supplementary Figs. S3B and S3C). Similarly, BP and CC average AvgRSS scores for core components of homologous modules were also significantly higher than those of ring components (P =0.0036 for BP; P =3e-16 for CC).
For example, BP and CC AvgRSS scores for the CDC2-PCNA-CCNB1-GADD45B homologous module in H. sapiens were 0.79 and 0.84, but for extended modules they were only 0.43 and 0.25, respectively. The core components of this module had high BP and CC AvgRSS scores of 0.89 and 0.85, respectively. These results indicate that homologous modules of a template have highly similar biological functions and that their core components regulate similar biological processes and are often localized to the same cellular compartment.

Supplementary Text 4: GO term analysis of essential proteins
GO terms provide the descriptions of the biological process (BP), cellular component (CC), and molecular function (MF) of a protein 11 . According to a modified term frequency-inverse document frequency (TF-IDF) scoring scheme 12 , we identified 160 essential MF terms that describe the functional relationships of essential proteins and core proteins of the module families (Supplementary Table S1). First, we collected 8,364 essential proteins, called EP8364, from the DEG database and 160,598 proteins, called CG27, over 27 completed genomes. The proteins in these two sets contained at least one GO MF or GO BP terms. The "occurrence ratio" (CR t ) of a GO MF term (t) was defined as CR t = P t /T, where P t is the number of proteins with term t, and T is the total number of proteins in the given set. For example, the occurrence ratio of the term "rRNA binding" was 0.0497 in the EP8364 set for P t = 416 and T = 8,364. The distribution of the occurrence ratios of and p-value ≤0.05 (hypergeometric distribution). We discarded the terms of specific species (e.g., "azobenzene reductase activity") and those with high usage but without the specificity (e.g., "protein binding").
Among the 160 essential GO MF terms, 33 terms (21%; e.g., "acetyl-CoA carboxylase activity", UR = 9.25) were recorded for Carbohydrate and Lipid metabolisms, which mediate the energy balance of organisms and constitute various biochemical processes responsible for the formation, breakdown, and interconversion 14,15 . Further, 16 essential GO MF terms were included in Amino acid metabolism (e.g., "cysteine desulfurase activity", UR = 6.89) and RNA degradation (e.g., "3′-5′ exonuclease activity", UR = 5.27), which play an important role in energy balance through the reuse of RNA and amino acids. Purine (e.g., "ATP-dependent RNA helicase activity", UR = 5.04) and Pyrimidine (e.g., "thymidylate kinase activity", UR = 6.98) metabolisms are regarded as modular minimal cell model 16 . Generation of biological energy occurs mainly through the pathways contained in the Oxidative phosphorylation group 17 . These results demonstrate that a majority of these 160 essential GO MF terms are indispensable for the survival of an organism.

Supplementary Text 5: Microarray expression data sets of 9 tumor types
To identify genes with significant expression change between tumor and corresponding normal tissues, we collected 6 gene expression data sets, including 9 different tumor types, from GEO 18 .
Each expression data set comprising ≥ 3 tumor samples and corresponding normal samples were obtained using the most comprehensive human expression array platform (HG U133 Plus 2.0; Supplementary     Ring component proteins (Interface evolution score < 7) Supplementary Figure S5. Occurrence ratios of 160 essential GO MF terms between essential proteins, core proteins, and ring proteins Occurrence ratios of each set are only labeled with the significant enrichment, as determined by p-values of ≤0.05 (hypergeometric distribution) in each GO term.
-13 -      The occurrence ratio of a GO MF term is defined as the number of proteins annotated this terms divided by the total number of proteins in the set. c The unique ratio of a GO MF term is defined as the occurrence ratio of a GO MF term divided by the occurrence ratio in 27 species genome set. d The proteins of module templates represent the core component proteins in module families with interface evolution score (IES) ≥ 7 and at least one GO MF term annotation in GO database. The occurrence ratio of a GO MF term is defined as the number of proteins annotated this terms divided by the total number of proteins in the set. c The unique ratio of a GO MF term is defined as the occurrence ratio of a GO MF term divided by the occurrence ratio in 27 species genome set. d The proteins of module templates represent the core component proteins in module families with interface evolution score (IES) ≥ 7 and at least one GO MF term annotation in GO database.
-23 - The occurrence ratio of a GO MF term is defined as the number of proteins annotated this terms divided by the total number of proteins in the set. c The unique ratio of a GO MF term is defined as the occurrence ratio of a GO MF term divided by the occurrence ratio in 27 species genome set. d The proteins of module templates represent the core component proteins in module families with interface evolution score (IES) ≥ 7 and at least one GO MF term annotation in GO database.

Supplementary
-28 - The occurrence ratio of a GO MF term is defined as the number of proteins annotated this terms divided by the total number of proteins in the set. c The unique ratio of a GO MF term is defined as the occurrence ratio of a GO MF term divided by the occurrence ratio in 27 species genome set. d The proteins of module templates represent the core component proteins in module families with interface evolution score (IES) ≥ 8 and at least one GO MF term annotation in GO database.