Internetwork connectivity of molecular networks across species of life

Molecular interactions are studied as independent networks in systems biology. However, molecular networks do not exist independently of each other. In a network of networks approach (called multiplex), we study the joint organization of transcriptional regulatory network (TRN) and protein–protein interaction (PPI) network. We find that TRN and PPI are non-randomly coupled across five different eukaryotic species. Gene degrees in TRN (number of downstream genes) are positively correlated with protein degrees in PPI (number of interacting protein partners). Gene–gene and protein–protein interactions in TRN and PPI, respectively, also non-randomly overlap. These design principles are conserved across the five eukaryotic species. Robustness of the TRN–PPI multiplex is dependent on this coupling. Functionally important genes and proteins, such as essential, disease-related and those interacting with pathogen proteins, are preferentially situated in important parts of the human multiplex with highly overlapping interactions. We unveil the multiplex architecture of TRN and PPI. Multiplex architecture may thus define a general framework for studying molecular networks. This approach may uncover the building blocks of the hierarchical organization of molecular interactions.


Supplementary Information: Internetwork connectivity of molecular networks across species of life
[13], [9], [7] protein degrees for these networks are labeled K1 and K2 respectively. We have also given the corresponding references for each network dataset used in this study.

Abbreviations:
TRN: Transcriptional Regulatory Network PPI: Protein-Protein Interaction PPI1: PPI network from the BioGRID database [9] or other species-specific publications [2,4,6]. PPP2: PPI network from the HINT database [7]. LCC: Largest Connected Component is the subset of genes/proteins in TRN/PPI network where every gene/protein is reachable from every other gene/protein.

Definitions:
% Overlapping Proteome Coverage: Percentage of the proteome included in the multiplexes consisting of TRN and PPI1 networks or TRN and PPI2 networks, respectively. Edges: Number of edges in the TRN, PPI1 or PPI2 networks.

Supplementary methods
Simulating multiplex degree-degree coupling and redundancy Multiplexes with different degree-degree coupling and redundancy values were generated using a simulated annealing sampling approach [14]. This is achieved by randomly shuffling gene labels in the TRN network and the sampling is biased iteratively to match degree-degree coupling and redundancy with the desired values. Protein labels in the PPI network are kept fixed. The steps are described in Algorithm 1 below. 3. Calculate the difference Δ = ν(diffcor cur − diffcor prop ) + ( diffRed cur − diffRed prop ), where ν is a scaling factor; we use ν = 10. If Δ >= 0, save the labeling proposed in step 2 as the current labeling for TRN. Otherwise, accept the labeling proposed in step 2 with probability e Δ T ⁄ , where T = T 0 e −λL is a temperature variable, T 0 is the initial temperature (we use T 0 = 1000), λ is a rate parameter (we use λ = 0.01) and L is the iteration number for the simulated annealing loop. 4. Repeat steps 2-3 until diffcor cur and diffRed cur reach below pre-defined thresholds (we use diffcor cur = 0.001 and diffRed cur = 3/σ(Edges 12 null ) as the thresholds).
The node labeling for TRN achieved at the end of the simulated annealing loop has the desired degree-degree coupling and redundancy between the TRN and PPI layers. Since simulated annealing is a stochastic process, we repeat this procedure multiple times.
We can also match in-degree and out-degree couplings using this algorithm.

Multiplex null model
We use two null models to assess robustness of species multiplex--Zero-Coupling-Zero-Redundancy and Multiplex-Configuration models. The former is used to assess the effect of degree-degree coupling and redundancy on robustness. While the latter model helps to assess the impact of individual network (TRN or PPI) structure on multiplex robustness. These models are generated as follows.

Zero-Coupling-Zero-Redundancy
Under this null model, we generate null multiplexes which have no degree-degree coupling and no redundancy. We generate multiplexes from this model by setting cor , des = cor , des = Redundancy 12 des = 0 in Algorithm 1. This procedure is repeated 100 times to generate a distribution of multiplexes in the null model.

Multiplex-Configuration
Under this model, we generate null multiplex using the well-known configuration model prevalent in network research. The one-to-one correspondence between the genes in TRN and proteins in PPI is fixed. Further, the degree distribution in TRN and PPI is also unchanged. However, the edges are randomly shuffled such that the degree distribution is undisturbed. If the edge shuffling is performed for sufficient iterations, we get a randomized network structure.
Sampling a subset of multiplex with specified degree-degree coupling and redundancy.
Subsets of a multiplex with different degree-degree coupling and redundancy values were sampled using a simulated annealing sampling approach [14]. This is achieved by randomly sampling gene-protein pairs and the sampling is biased iteratively to match degree-degree coupling and redundancy with the desired values. The steps are described in Algorithm 2 below.
Algorithm 2: 1. Randomly sample gene-protein pairs from the multiplex. Calculate absolute difference between the sampled and observed degree-degree coupling and redundancy. Let us call these differences diffcor cur = |cor k,K cur − cor k,K des | for degree-degree coupling and diffRed cur = |Redundancy 12 cur − Redundancy 12 des | for redundancy, where cor k,K cur is the correlation between k (degree) in the sampled TRN network and K (degree) in the sampled PPI network and cor k,K des is the desired correlation between k and K, Redundancy 12 cur is the redundancy in the sampled multiplex and Redundancy 12 des is the desired Redundancy.
Superscript 'cur' represents that these are the current values of the variables. Save the current labels of the sampled gene-protein pairs in the multiplex. 4. Repeat steps 2-3 until diffcor cur and diffRed cur reach below pre-defined thresholds (we use diffcor cur = 0.001 and diffRed cur = 10 as the thresholds).
The sampled set achieved at the end of the simulated annealing loop has the desired degree-degree coupling and redundancy between the TRN and PPI layers of the sampled subset of the multiplex. Since simulated annealing is a stochastic process, we repeat this procedure multiple times.
We can also match in-degree and out-degree couplings using this algorithm. Sampling a random subset of gene-protein pairs from a multiplex with specified degree distributions and redundancy.

Matching gene and protein degrees to a functional set of gene-protein pairs
Subsets of a multiplex were sampled using a simulated annealing sampling approach [14]. This is achieved by randomly sampling gene-protein pairs and the sampling is biased iteratively to match gene and protein degrees to a functional set of gene-protein pairs. Further, redundancy can also be matched to a desired value. This method was used to generate the RanDP and RanDP-RZ null models.
Steps for the method are described in Algorithm 3 below.  The sampled set achieved at the end of the simulated annealing loop has the desired network degrees and redundancy between the TRN and PPI layers of the sampled subset of the multiplex. Since simulated annealing is a stochastic process, we repeat this procedure multiple times.
An alternative method to sample subsets that match sampled network degrees to a functional subset is to bin the degrees in a 3D space. However, this becomes problematic as the number of geneprotein pairs in each bin drastically reduces as we move to a 3D space. For Algorithm 3, we bin kin, kout and K separately into 1D bins and then sample subsets.

Matching gene and protein degree distributions to a functional set of gene-protein pairs
Algorithm 3 is restrictive in the sense that it ensures that the degree tuple (kin, kout, K) in the sampled subset matches a functional set of gene-protein pairs. This ensures that the sampled subset matches the functional set in CD. For the RanDP-noCD null model, we relax this restriction. To achieve this, we replace degree vectors in Algorithm 3 with degree distributions. Further, RanDP-noCD does not impose a redundancy condition on the sampled subsets. Therefore, redundancyrelated conditions are omitted from Algorithm 3 for RanDP-noCD. With these modifications, Algorithm 3 samples subsets which match only in degree distributions to the functional set. Figure S1: Degree-degree coupling across species for HINT. Scatter plot for degree (K) of proteins in the PPI network versus out-degree (kout) of genes in TRN for eukaryotes, A) yeast, B) C. elegans, C) fly, D) mouse and E) human. Both K and kout values have been logtransformed after adding 1. The PPI networks are for the HINT database (Methods). Linear interpolated fits between K and kout are also shown (green line) with 95% confidence region shaded in gray. Degree-degree coupling (CD) values and corresponding p-values (two-tailed z-test using Fisher's z-transformation) are also shown. Figure S2: Redundancy coupling across species for HINT. Distribution of redundancy (CR) for gene-protein pairs for species multiplex and the randomly shuffled null model for eukaryotes, A) yeast, B) C. elegans, C) fly, D) mouse and E) human. For each gene-protein pair, CR is quantified by the number of redundant edges incident on that gene-protein pair. Shuffled null model is generated by randomly shuffling labels on genes in TRN, while keeping protein labels fixed in PPI. Jensen Shannon divergence (JSD) (along with 95% CI) between distributions of CR in organismal and shuffled multiplexes is also shown. Attack curves are shown for six species multiplexes which were not considered in our final analysis. PPI database for a given panel is annotated next to the species name. Along with the attack curves for the species (red), attack curves for the Zero-Coupling-Zero-Redundancy null model are also shown (gray). Under this null model, multiplexes have no degree-degree coupling and no redundancy. Figure S4: Robustness, degree-degree coupling and redundancy coupling versus number of genes in the multiplex. A) Degree-degree coupling (CD), redundancy coupling (CR) and robustness (R) versus number of genes in the multiplex. Spearman's correlation coefficient (with p-values) with number of genes is also mentioned. B) Degree-degree coupling (CD), redundancy coupling (CR) and robustness (R) versus number of fraction coverage (number of genes in the multiplex divided by the total number of protein-coding genes). Spearman's correlation coefficient (with p-values) with number of genes is also mentioned. In all the panels, error bars show 95% CI. Figure 4B (main text). A) Lower interval for the 95% CI for Figure 4B. Upper interval for the 95% CI for Figure 4B. Figure S6: Robustness vs degree-degree and redundancy couplings with different relationship between in-degree and out-degree based degree-degree couplings. We sample a subset of gene-protein pairs from the species multiplexes (sizes for the subsets are: S. cerevisiae-1000, C. elegans-500, D. melanogaster-2000, M. musculus-300, H. sapiens-500) with specific CD and CR values. We repeat the sampling 100 times. We explore CD and CR over a grid. For each point over the 2D grid, the heatmap shows the robustness (R) value. R is computed by comparing RobustArea of any point over the grid against the lower-left point of the grid (with CD = 0, CR = 0). For each point over the grid, for each of the sampled subset, targeted attack is performed. Mean R values are shown here. A) CD in = 0.5 CD out . BioGRID PPI networks are used. A) CD in = 2 CD out . HINT PPI networks are used. CD out : degree-degree coupling between kout and K, CD in : degree-degree coupling between kin and K. Figure S7: Redundancy and robustness versus number of pathogen-related genes in the human multiplex against RanDP null model. A) CR for pathogen-related genes (against RanDP null model) versus number of pathogen genes. B) R for pathogen-related genes (against RanDP null model) versus number of pathogen genes. C) R versus CR for pathogen-related genes against RanDP null model. D) R versus CR for pathogen-related genes while controlling for number of genes. R and CR are regressed against number of genes using loess regression and residuals are plotted. Pathogen names are annotated with the PPI database used for the analysis. For all the panels, Spearman rank correlation values, ρ, are also listed along with p-values (twotailed z-test using Fisher's z-transformation) are also shown. Figure S8: Redundancy and robustness versus number of pathogen-related genes in the human multiplex against RanDP-RZ null model. A) CR for pathogen-related genes (against RanDP-RZ null model) versus number of pathogen genes. B) R for pathogen-related genes (against RanDP-RZ null model) versus number of pathogen genes. C) R versus CR for pathogenrelated genes against RanDP-RZ null model. D) R versus CR for pathogen-related genes while controlling for number of genes. R and CR are regressed using loess regression against number of genes and residuals are plotted. Pathogen names are annotated with the PPI database used for the analysis. For all the panels, Spearman rank correlation values, ρ, are also listed along with p-values (two-tailed z-test using Fisher's z-transformation) are also shown. Figure S9: Redundancy and robustness versus number of pathogen-related genes in the human multiplex against RanDP-noCD null model. A) CR for pathogen-related genes (against RanDP-noCD null model) versus number of pathogen genes. B) R for pathogenrelated genes (against RanDP-noCD null model) versus number of pathogen genes. C) R versus CR for pathogen-related genes against RanDP-noCD null model. D) R versus CR for pathogen-related genes while controlling for number of genes. R and CR are regressed using loess regression against number of genes and residuals are plotted. Pathogen names are annotated with the PPI database used for the analysis. For all the panels, Spearman rank correlation values, ρ, are also listed along with p-values (two-tailed z-test using Fisher's z-transformation) are also shown. Figure S10: Redundancy and robustness versus number of disease-related genes in the human multiplex against RanDP null model. A) CR for disease-related genes (against RanDP null model) versus number of disease genes. B) R for disease-related genes (against RanDP null model) versus number of disease genes. C) R versus CR for disease-related genes (against RanDP null model). D) R versus CR for disease-related genes while controlling for number of genes. R and CR are regressed using loess regression against number of genes and residuals are plotted. Disease names are annotated with the PPI database used for the analysis. For all the panels, Spearman rank correlation values, ρ, are also listed along with p-values (two-tailed z-test using Fisher's z-transformation) are also shown. Figure S11: Redundancy and robustness versus number of disease-related genes in the human multiplex against RanDP-RZ null model. A) CR for disease-related genes (against RanDP-RZ null model) versus number of disease genes. B) R for disease-related genes (against RanDP-RZ null model) versus number of disease genes. C) R versus CR for disease-related genes (against RanDP-RZ null model). D) R versus CR for disease-related genes while controlling for number of genes. R and CR are regressed using loess regressions against number of genes and residuals are plotted. Disease names are annotated with the PPI database used for the analysis. For all the panels, Spearman rank correlation values, ρ, are also listed along with p-values (two-tailed z-test using Fisher's z-transformation) are also shown. We sample 100 gene-protein pairs from the human multiplex. We performed the sampling experiment multiple times, each time generating a set with a different value of redundancy coupling (CR). 'Random' refers to a set of randomly sampled gene-protein pairs. For each value of CR, sampling was repeated 100 times. Partial attack curves for gene-protein pairs sampled from the human multiplex with different values of CR are shown. Increasing CR reduces area under the attack curve and also reduces robustness (inset). Error bars show 95% confidence intervals. HINT PPI network was used for the simulations. Figure S16: Impact of size of multiplex on robustness. We compared robustness between two subsets of gene-protein pairs, with different sizes, extracted from the same species multiplex over a grid of values for CD and CR. For each pair of CD and CR values over the grid, we sample subsets of two different sizes for each species such that CD and CR values are the same for the two sampled subsets. Each cell in the heatmap shows relative robustness as the cohen's d difference between the RobustAreas of the larger and smaller subsets with the same CD and CR values. The sizes of the two subsets for each species are: S. cerevisiae-1000 and 1250, D. melanogaster-2000 and 2500, H. sapiens-500 and 750. For each subset, sampling was performed 100 times. A), B) and C) represent the mean, lower limit for the 95% CI and upper limit for the 95% CI for relative robustness between the two subsets. All panels use the BioGRID PPI. Figure S18: Comparison between multiplexes of different sizes can reveal the impact of redundancy coupling on robustness. We compared robustness between two subsets of gene-protein pairs, with different sizes, extracted from the same species multiplex over a grid of values for CR for the two subsets. Values on the y-axis are the CR values for the smaller subset against which we compare robustness of the larger subset. CR values for the larger subset are shown on the x-axis. For each pair of CR values over the grid, the smaller sized subset has CR value shown on the y-axis, while the larger subset has CR value shown on the x-axis. Each heatmap corresponds to one value of CD, which is same for both the subsets. Each cell in the heatmap shows relative robustness as the cohen's d difference between the RobustAreas of the larger and smaller subsets. The sizes of the two subsets for each species are: S. cerevisiae-1000 and 1250, D. melanogaster-2000 and 2500, H. sapiens-500 and 750. For each subset, sampling was performed 100 times. Moving along the x-axis while keeping the y-axis value fixed, we see that robustness (R) increases with increasing CR of the larger sized multiplex. For each row in a heatmap, R is computed by comparing the larger subset against the smaller subset. All panels use the BioGRID PPI. A), B) and C) show heatmaps for S. cerevisiae, D. melanogaster and H. sapiens respectively. Only the lower triangular part of the matrix is shown. Figure S19: Comparing species degree-degree coupling against randomly shuffled multiplexes for BioGRID PPI. We compared species degree-degree coupling (dashed line) against the distribution (histogram) of degree-degree coupling for randomly shuffled multiplexes (PPI fixed and gene labels randomly shuffled in TRN). Z-score values show that species degree-degree couplings are statistically significant. Figure S20: Comparing species degree-degree coupling against randomly shuffled multiplexes for HINT PPI. We compared species degree-degree coupling (dashed line) against the distribution (histogram) of degree-degree coupling for randomly shuffled multiplexes (PPI fixed and gene labels randomly shuffled in TRN). Z-score values show that species degreedegree couplings are statistically significant.

Supplementary Figure S21: Multiplex attack curves for different species against the
Multiplex-Configuration null model. Relative size of Mutually Connected Giant Component (MCGC) is plotted as a function of the fraction of gene-protein pairs attacked and removed from the multiplex (Main Text, Methods). Attack curves are shown for five species using PPI networks from either BioGRID or HINT databases (Main Text, Methods); database for a given panel are annotated next to the species name. Along with the attack curves for the species (red), attack curves for the Multiplex-Configuration null model are also shown (gray). Under this null model, gene and protein degrees in TRN and PPI are kept fixed, while edges are randomly shuffled. Further, the one-to-one correspondence between TRN genes and PPI proteins is kept fixed. Figure S22: Degree-degree and redundancy coupling for species against the Multiplex-Configuration null model. (Left) Robustness (R) to targeted attack against the configuration null model. (Middle) Size robustness (rather than RobustArea, RobustSize is used for robustness. RobustSize is the number of gene-protein pairs that need to be attacked to break the multiplex) to targeted attack against the configuration null model. (Right) Degree-degree coupling against the configuration null model. Error bars show 95% CI. Figure S23: Functionally important genes and proteins are topologically important in TRN against RanDP null model. A) Fly and human essential genes and proteins have lower robustness (R) to targeted attack in TRN against RanDP null model. However, yeast essential genes and proteins are more robust. Panels B-D show results for human multiplex. B) Pathogen-related genes and proteins have lower R to targeted attack in TRN for most pathogens against RanDP null model. C) Disease-related genes and proteins have lower R to targeted attack for most diseases against RanDP null model. D) Oncogenes (tumor suppressor genes (TSGs)) and proteins have lower R to targeted attack against RanDP null model. In all the panels, database used for PPI networks is annotated on the y-axis. In all the panels, relative robustness (R) is measured against RanDP null model. Error bars show 95% CIs. Figure S33: Comparison between kout distribution for functionally important genes and RanDP null model: Jensen Shannon Divergence (JSD) between kout distributions for functionally important genes-A) essential, B) pathogen-related, C) disease-related and D) oncogenes and TSG-and RanDP null model sampled genes (red). As a control, we also show JSD between functionally important genes and uniform distribution (blue). JSD is computed by first binning kout in the species multiplex such that each bin has at least 30 genes. Binned kout values are used for JSD computation. Error bars show 95% CI. Statistical significance, for α = 0.05, is marked by stars. Statistical significance is estimated using a two-tailed z-test without correction for multiple hypothesis testing. Figure S34: Comparison between K distribution for functionally important genes and RanDP null model: Jensen Shannon Divergence (JSD) between K distributions for functionally important genes-A) essential, B) pathogen-related, C) disease-related and D) oncogenes and TSG-and RanDP null model sampled genes (red). As a control, we also show JSD between functionally important genes and uniform distribution (blue). JSD is computed by first binning K in the species multiplex such that each bin has at least 100 genes. Binned K values are used for JSD computation. Error bars show 95% CI. Statistical significance, for α = 0.05, is marked by stars. Statistical significance is estimated using a two-tailed z-test without correction for multiple hypothesis testing. genes and RanDP-noC null model: Jensen Shannon Divergence (JSD) between K distributions for functionally important genes-A) essential, B) pathogen-related, C) disease-related and D) oncogenes and TSG-and RanDP-noC null model sampled genes (red). As a control, we also show JSD between functionally important genes and uniform distribution (blue). JSD is computed by first binning K in the species multiplex such that each bin has at least 100 genes. Binned K values are used for JSD computation. Error bars show 95% CI. Statistical significance, for α = 0.05, is marked by stars. Statistical significance is estimated using a two-tailed z-test without correction for multiple hypothesis testing. by higher CR for pathogen genes and proteins against RanDP null model. C) (Left) Disease-related genes and proteins have lower R to targeted attack for most diseases against RanDP null model. (Right) Lower R is accompanied by higher CR for disease genes and proteins against RanDP null model. D) (Left) Oncogenes (tumor suppressor genes (TSGs)) and proteins have lower R to targeted attack against RanDP null model. (Right) Lower R is accompanied by higher CR for oncogenes (TSGs) and proteins against RanDP null model. In all the panels, database used for PPI networks is annotated on the y-axis. In all the panels, relative robustness (R) is measured against RanDP null model, and CR values are calculated as the difference in redundancy against the null model. Error bars show 95% CIs. For all the panels, CR is calculated as the mean difference in the number of redundant edges between the functionally important gene-protein pairs and the null model.