Critical controllability in proteome-wide protein interaction network integrating transcriptome

Recently, the number of essential gene entries has considerably increased. However, little is known about the relationships between essential genes and their functional roles in critical network control at both the structural (protein interaction network) and dynamic (transcriptional) levels, in part because the large size of the network prevents extensive computational analysis. Here, we present an algorithm that identifies the critical control set of nodes by reducing the computational time by 180 times and by expanding the computable network size up to 25 times, from 1,000 to 25,000 nodes. The developed algorithm allows a critical controllability analysis of large integrated systems composed of a transcriptome- and proteome-wide protein interaction network for the first time. The data-driven analysis captures a direct triad association of the structural controllability of genes, lethality and dynamic synchronization of co-expression. We believe that the identified optimized critical network control subsets may be of interest as drug targets; thus, they may be useful for drug design and development.


Results
Efficient computation of optimized critical and redundant subsets in scale-free networks. The novel algorithm for determining optimized sets of nodes with critical and redundant control roles has striking advantages compared with existing methods. These advances are based on two mathematical propositions that can be applied to real-world scale-free networks (see Methods for details). The network structure plays a key role in determining the control categories of the nodes. Random networks, for example, exhibit a controllability pattern that is drastically different from that of the scale-free networks (see Fig. 1(a,b)). The following two main features are essential for decreasing the complexity of the MDS and the critical set computation in scale-free networks. (1) Using algorithmic procedures developed in 15 for determining the critical and redundant nodes in a network of size |V| = n, it is necessary to solve ILPs n times, in addition to the original computation for the MDS of size |M|, in the original network. By contrast, using the novel propositions that automatically pre-determine a subset of critical and redundant nodes, we had to solve the ILPs with a much reduced number. (2) In addition, the ILPs solved with the newly proposed method have a significantly smaller number of constraints (equations), in addition to the trivial constraints (compare Eq. 1 with Eqs 2-5 in Supplementary Information (SI)). These two features are crucial for decreasing the computational time and for identifying critical and redundant sets in large networks. Figure 2(a-f) shows an example of the new algorithm computation. For a detailed explanation, see methods section and SI. Fig. 2(i) shows the results of the computational experiments on scale-free networks with degree exponent γ = 2 and average degree < k> = 2. The result shows that the algorithm not only is able to efficiently determine the critical and redundant nodes in large networks of up to 10,000 nodes but also 180 times faster than existing algorithmic procedures. In order to examine the efficiency in more detail, we fitted the relation between the number of nodes and CPU time t in terms of t = a · b n for each of the algorithms by using Mathematica's FindFit function after taking t log 10 because the original problem is NP-hard and thus it is expected that CPU time grows exponentially with the number of nodes, and Fig. 2(i) also suggests such a dependency. The results were a = 41399.0, b = 1.00116 for the previous algorithm, and a = 4800.43, b = 1.00038 for the proposed algorithm, which also suggests that the proposed algorithm is much faster than the previous one. Note that difference between b = 1.00116 and b = 1.00038 is far from small because they are bases of exponential functions. Further computational experiments using scale-networks constructed with average degree < k> = 3 and different values for the degree exponent γ also confirm this improvement (see Fig. S8, in SI). In particular, in the case of scale-free networks with γ = 2.5 and average degree < k> = 3, the algorithm could compute the MDS and critical and redundant sets simultaneously for a network size of up to 25,000 nodes (see Table S3 in SI). Provided that the graph problem is NP-hard, these achievements are remarkable. The relationships with network structures from Propositions 1 and 2.2 (see Ref. 15) (see methods section) suggest that scale-free networks have the ideal structure in which the proposed algorithm can work most efficiently, as explained in the Methods section. Because most real-world networks follow a scale-free topology, in particular the analysed protein networks in this work (see Fig.  S9 and Table S4), the proposed algorithm may potentially lead to multiple applications in various fields.
The optimized subset of proteins involved in critical control is enriched with the most highly connected proteins and largely depleted of the less connected ones. Using the newly discovered critical set algorithm, we first evaluated the extent to which the set of critical proteins is enriched with highly connected proteins. The analysis was performed using protein interaction networks corresponding to eight organisms as shown in Fig. 3. The data used to build the networks were obtained from the High-quality INTeractomes (HINT) database V2.0 22 for all organisms, except for E. coli, as the data for this bacterium were collected from The example shows that the network structure strongly affects the controllability roles of the nodes. Whereas in random networks the critical nodes are almost absent, in scale-free networks, they have a noticeable representation impacting the controllability of the network. Random networks, by contrast, are more flexible because they depend on intermittent control nodes. The proposed algorithmic procedure relies on the structure of the scale-free networks and on the existence of highly connected nodes to pre-determine a large number of critical and redundant nodes without using ILP. (c) The results of the algorithm computed on the human protein interaction network. (d) The fraction of proteins for each control category in all analysed PPI networks. The set of proteins involved in critical control is very small (less than 10% in almost all cases). The smallest critical control set is found in the H. sapiens and represents 6% of all proteins.
ref. 23. The number of proteins and interactions in each organism are summarized in Table S1 (see the Methods section for details). The critical set is systematically enriched for higher-degree nodes. By contrast, the analysis of the set of redundant nodes reveals a depletion of nodes with high degrees and a small enrichment in nodes for low degrees. The remaining nodes are classified as intermittent and accordingly are sometimes engaged in network control. The computational analysis shows that in almost all cases, this set is also moderately enriched when the degree of nodes increases. However, the degree of enrichment is in most cases several times lower than that of the maximum degree observed for critical nodes. The case of the human protein network clearly illustrates the observed general tendency in all organisms. Taken together, high-degree nodes tend to be critical, medium-degree nodes may tend to engage in network control intermittently and low-degree nodes are more associated with redundant and occasionally with intermittent network control roles.
Proteins engaged in critical network control are also enriched in essential gene products. The finding that the most highly connected proteins in the cell are associated with critical network control poses the question of whether these proteins are also directly required to sustain life. To answer this question, we collected essential genes from the Database of Essential Genes (DEG) 3 . To assess humans, we also compiled data from the Online Gene Essentiality Database (OGEE) 24 . The numbers of essential genes compiled for each organism and the databases from which they were extracted are summarized in Table S2 (see SI). For each organism, we then mapped a gene product for each essential gene to the protein interaction network. To precisely map the gene product onto the proteins, we used the UniProt (Swiss) and UniProt (Swiss & TrEMBL) database annotations. Using the resulting data, we defined a contingency table and evaluated the statistical significance of the association between critical proteins and essential genes by applying Fisher's exact test. The two-tailed exact p-values, presented in Table 1, showed that optimized proteins engaged in critical control were statistically significantly enriched with essential genes for all analysed networks (P < 0.05) except for that of S. pombe, which is not surprising given the few statistics available for this organism, as shown in Fig. S9 and Table S1. The enrichment factors for all the analysed organisms are shown in Fig.  S10. In consistency with the p-value analysis, all organisms except for the S. pombe exhibit high enrichment factors.
As shown above, one of the main findings of this work is the high association between the critical control node set and the essential gene set. Another is the high association between the critical control genes and the high degree genes. These results raise question of to what extent the critical control accounts for the association with the essentiality, when hubness (degree) is controlled. To address this question, we defined and computed an enrichment factor that captures this dependency. The results for all analyzed organisms are shown in Fig. S11. It is clear from the figure that critical nodes are enriched in essential genes for most degree intervals except very low  To measure the statistical proportion of proteins engaged in a given control set S (critical, intermittent and redundant) according to their degrees, we performed the following enrichment computation, which was also used by Wuchty 9 . First, proteins were classified according to their degree in logarithmic bins of increasing size. For each bin class i, we computed the frequency of proteins with degree k as , . Next, we calculated the fraction of proteins with degree k that also appeared in a given set S as , . Then, the enrichment of proteins with degree k that appear in the control set S (critical, intermittent and redundant) in bin i was computed as , . Positive values of this function indicate the enrichment of degree k for the critical, intermittent and redundant set, respectively. Negative values indicate depletion of degree k. degree and high degree nodes. It is to be noted that most of high degree nodes are also critical nodes and thus the enrichment factor would be 0 or very close to 0. It should also be noted that the number of essential nodes is small for low degree nodes and thus the enrichment factor is sensitive to small perturbations.
It is also important to note that the analysis of biological data, such as the protein interaction network has always suffered from some research induced by the fact that important gene (such as possibly essential genes) are studied better so they may have more interactions identified than those of less important genes. Therefore, this fact should also be taken into account when interpreting the results. Our study uses data from eight organisms, including H. sapiens. In particular, we compiled three different datasets for essential genes in humans to minimize this statistical issue.
Essential proteins engaged in critical control are also enriched in each Gene Ontology annotation category. To further understand the biological associations and implications of the proteins engaged in network control, we compiled all the Gene Ontology (GO) annotations (biological process BP, molecular function MF and cellular component CC) using the UniProt database. Each protein was then associated with its annotated GO terms. Each protein was also classified into one of five categories based on whether it is engaged in critical, intermittent, or redundant network control; represents an essential gene product; or is an essential gene product and simultaneously involved in critical control. The categories for the critical, intermittent and redundant are mutually exclusive. The category for the essential genes can contain genes from any control category. Finally, essential genes that are classified as critical are also examined in a separate category. The enrichment of each category was then calculated as shown in the Methods section. The essential proteins engaged in critical control (orange) also show the highest enrichment in biological process, molecular function and cellular component annotations (see Fig. 4). However, it is unclear which is the second best, essential proteins (green) or proteins engaged in critical control (red), because it largely depends on organisms. In particular, the second best differs even within E. coli essential genes datasets. The above mentioned observations suggest that the currently available PPI network data combined with essential genes datasets are useful to derive strong or general tendencies but are not enough to derive detailed or weak signals. The observed enrichment decreases when the network control engagement decreases from critical, intermittent and redundant, the latter showing the lowest values and even depletion scores (Fig. 4). This tendency is universal to all the analysed organisms and suggests evidence of a triad association of lethality, network control and biological functionality.
To further analyse the biological functions associated to the critical control in more detail, we examine each annotated term and functional class inside each GO category. The compiled results are available as supplementary files (see Excel files). We have selected the top-five functional classes for each category BP, MF and CC ordered by the number of critical and essential critical genes. The associated enrichment factors and the two-tailed p-values for the Fisher's exact test are also displayed. The results are shown in Table 2, which corresponds to the H. sapiens 1 dataset for essential genes. For the H. sapiens 2 and H. sapiens OGEE datasets see SI (Tables S6 and S7).
On the other hand, we have examined data corresponding to functional and evolutionary patterns for the eukaryotic orthologous groups or KOGs 25,26 . The results are shown in Fig. S12 and indicate that the essential genes involved in critical control are quite localized and mainly enriched for transcription (K) (P = 4.37 × 10 −4 ), signal transduction mechanisms (T) (P = 5.28 × 10 −5 ) and replication (L) (P = 6.68 × 10 −2 ) functional classes. A similar analysis was done only for MDS in Khuri & Wuchty in Fig. 5 19 , although they did not investigate critical control and only examined the category of MDS and MDS with essential genes. Interestingly, a similar set of highly enriched functional classes (K,L,T) were identified in their analysis for the MDS enriched with essential genes. For statistical details related to Fig. S12  Dominance of critical co-expressed genes across human tissues. Using the available microarray data from samples of 36 different healthy human tissues 27 , we investigated whether the fraction of co-expressed genes involved in critical control is homogeneously enriched across human tissues. The results shown in Fig. 5 indicate that, although subject to noticeable variations, co-expressed critical genes are enriched in all 36 healthy tissue samples. Moreover, the subset of co-expressed critical genes that are also annotated as essential genes exhibits the highest enrichment across human tissues. By contrast, fractions of redundant co-expressed genes exhibit depletion in almost all human tissues. The ubiquitous feature of the critical and essential critical nodes can be observed in Fig. 6(a-d). This figure shows the fraction of genes in a particular subset S among those whose transcripts are expressed in t tissues, from 1 to 36. The fraction of critical and essential critical proteins increases when the number of tissues in which they  Table S2. Additional statistical data is shown in Table 2  are expressed also increases. This tendency shows that network control roles also require an optimized subset of housekeeping genes that are present in the majority of tissues. When the fraction of intermittent and redundant genes is examined, the correlation is reverted progressively, indicating that they are not ubiquitous in all tissues.
Although the analysis shown in Fig. 6 indicates that network control roles require a set of genes that are always present (co-expressed) in the majority of tissues, it does not answer the question on the specific evolutionary

Figure 5. Enrichment of co-expressed genes for each category: critical (red), intermittent (blue), redundant (purple), essential (green) and simultaneously essential and critical control roles (orange) across 36 different healthy human tissues.
The subset of co-expressed critical genes that are also annotated as essential genes exhibit the highest enrichment in almost all tissues. The analysis is performed for three different datasets of essential genes as shown in Table S2. The results for H. sapiens (1) dataset are shown in figure. The results for the rest of datasets for H. sapiens are shown in Fig. S20 in SI.
conserved biological functions associated to each of these genes. To address this question we proceed as follows: First, we have selected all genes associated to critical control (which also includes essential critical genes). We then only extracted from this set, those genes that are expressed in 35 or more different tissues simultaneously (all samples include 36 different healthy tissues). Then, we then mapped the KOG biological functions on the resulting set of genes. Genes present in the KOG set but absent in the compiled PPI network were discarded in the analysis. The gene names as well as the KOG functions are displayed in Table 3 for the H. sapiens 1 dataset. See SI in Tables S8 and S9 for the rest of datasets. Moreover, the number of critical co-expressed genes in each functional class is visualized using a bar plot in Figs S13 and S14. The Information processing and storage (red) and cellular processes and signalling (blue) are the KOG categories with the highest number of critical genes simultaneously co-expressed in most tissues.
Linking structural controllability with dynamic co-expression synchronization. To perform complex functions, genes associated with control roles must coordinate the activity of numerous other genes in a synchronized fashion. Essential genes, for example, have been reported to show an expression pattern that is synchronized with a large number of other genes 28 . Here, we find that the average co-expression coefficient of the optimized set of essential genes involved in critical network control shows the greatest synchronization with all other genes in the cell (see Fig. 7(b,f)). The intermittent and redundant nodes show a lower average correlation with other genes, consistent with what we would expect for a less specialized functional role in cellular control ( Fig. 7(f)). This result is important because it captures a direct link between the structural controllability of genes and dynamic transcriptional synchronization features for the first time.

Discussion
In this work, we have presented an efficient computation of optimized critical control and redundant control subsets in scale-free networks that drastically surpasses existing algorithms; the computational time is reduced by as much as 180 times, and the network size for application is expanded by as much as 25 times, from 1,000 to 25,000 nodes in the best cases. Because most real-world networks follow a scale-free topology, in particularly the protein networks discussed in this work, the proposed algorithm may potentially lead to multiple applications in various fields. The developed algorithmic procedure allowed a detailed examination of the control categories of proteins and their relationships with human orthologues of mouse essential genes in proteome-wide protein interaction networks for eight organisms. The optimized subset of proteins involved in critical control was found to be enriched for the most highly connected proteins and largely depleted of less connected ones. Proteins engaged in critical network control were also enriched in essential gene products. The optimized subset of essential proteins engaged in critical control was also enriched in each Gene Ontology annotation category. This observed tendency is universal to all the analysed organisms and suggests evidence of a triad association of lethality, network control and biological functionality. The analysis of transcriptome-wide gene expression profiling corresponding to 36 Figure 6. The fraction of genes in a particular set of critical (a), essential and critical (b), essential (c), intermittent (d) or redundant (e) genes among genes whose transcripts are expressed in t tissues, from 1 to 36. For each figure (a-d), the scale for the grey bars is marked on the right axis and indicates the number of proteins found in each tissue. The fraction of critical and essential critical proteins increases when the number of tissues in which they are expressed also increases. The essential gene dataset corresponds to H. sapiens (1). The results from two other human essential gene datasets are shown in SI.
Scientific RepoRts | 6:23541 | DOI: 10.1038/srep23541 types of normal human tissues also produced a novel observation correlating structural controllability properties and dynamic transcription features. The fraction of genes associated with critical roles and their overlap with essential genes increased when the number of tissues in which they are expressed also increased. This tendency showed that control roles also require the presence of an optimized subset of housekeeping genes in the majority of tissues. Moreover, our findings revealed that the average co-expression coefficient of the optimized subset of essential genes associated with critical control showed the highest synchronization with all other genes in the cell, capturing a direct link between structural controllability of genes, the gene's lethality and the synchronization of dynamic co-expression for the first time. We believe that the identified optimized critical network control subsets may be of interest as drug targets; thus, they may be useful in drug design and development.

Methods
Protein Interaction Database. The data corresponding to protein-protein interactions for the organisms shown in Table 1 were compiled from the High-quality INTeractomes (HINT) database 22 using binary interactomes for all organisms. The protein interactions for H. sapiens correspond to the high-throughput interactome. In addition, the interactions corresponding to E. coli were collected from a separate data source 23 . Databases of Essential Genes. The essential genes for the analysed organisms were compiled from the Database of Essential Genes (DEG) database 3 . For humans, we considered three datasets in our study. Two of them are accessible from the DEG: the results from Liao and Zhang in 29 , with 118 genes, and the results from Georgi et al. in 5 , with 2,452 genes. The latter dataset is referred to in the results as H. sapiens 1, and the two datasets together form H. sapiens 2. These genes represent human orthologues of mouse essential genes, which are associated with 46 phenotypic categories with prenatal, perinatal and postnatal lethality. An additional dataset from OGEE (Online Gene Essentiality Database) was also included in the analysis 24 , and it is referred to as H. sapiens (OGEE). The numbers of essential genes compiled for each organism are summarized in Table S2.

UniProt (Swiss and Swiss & TrEMBL) Database.
The annotations for the proteins were retrieved from the UniProt database. We used both the Swiss and Swiss-TrEMBL repositories for the mapping of essential genes on proteins. These are used to compute the two-tailed P-values using Fisher's exact test (see Table 1). The Gene Ontology annotations for biological processes, molecular function and cellular component were also compiled from the UniProt database.
. Then, the enrichment of proteins in a GO(i) for the critical set C was computed as Next, each value for enrichment was classified into one of three groups according to whether its GO term was related to biological process, molecular function or cellular component. Finally, for all the enrichment values E GO i C ( ) corresponding to the GO terms in each annotation class, a box-and-whisker plot was constructed. In each box, the bottom and top indicate the first q 1 and third q 3 quartiles (the 25 th and 75 th percentiles, also called hinges), and the bold blue band inside the box represents the second quartile (the median). The upper (lower) whisker extends from the hinge to the highest (lowest) value that is within 1.5 x IQR of the hinge, where IQR is the inter-quantile range or distance between q 1 and q 3 . Data not included between the upper and lower whiskers are plotted with dots (also called outliers). The separations between the different parts of the box-plot denote the degree of dispersion (spread) in the data set.
Gene expression data and analysis. In our analysis, we used data from genome-wide expression profiling corresponding to 36 types of normal human tissues 27 . The number of genes in a particular subset S (essential, critical, critical and essential, intermittent and redundant) with an average gene co-expression coefficient < ρ> is denoted as ρ N S . The total number of genes with an average gene co-expression coefficient < ρ> is N ρ . Then, the fraction of genes in a particular subset S, among those whose average gene co-expression coefficient with other genes is < ρ> , reads as = . This measure is used in Fig. 7 (vertical axis). The average gene co-expression coefficient of gene i with respect to all other genes j reads as ρ ρ , where ρ ij is the Pearson Correlation Coefficient (PCC) between genes i and j.< ρ> i is presented in Fig. 7 (horizontal axis), where index i is dropped for readability purposes.
Two key procedures are performed to process the data, which correspond to the analogous protocol used in 28 .
(1) Microarray Data Normalization: The raw gene expression data for 36 tissues 27 were first normalized to mean-centre for different tissues. The normalization factor was obtained for each tissue by summing up the data for the probes that are "present" (significant p-value) in all 36 tissues. The term present refers to genes with an expression level that has sufficient statistical significance (P-value < 0.01). (2) Handling multi-probe genes: Some genes are tagged by multiple probes in the data from Ge et al. 27 . To assign a single value of gene expression correlation, we took the best (largest) Pearson correlation coefficient among  (1). The results for two other human essential gene datasets are shown in SI. The linear fit to the data (black line) also shows the 95% confidence band in the same colour as the data points. When it is too small, it may not be visible. The exact p-values for the analysis of statistical significance analysis are shown in SI, Table S5.
Scientific RepoRts | 6:23541 | DOI: 10.1038/srep23541 possible pairs of two gene probes to capture the best possible coordinated activities of the two genes among different tissues. The PCC was calculated with the log (normalized data). The probes that are "absent" (no significant p-value) in all 36 tissues were discarded from the analysis. For a few cases of probes mapped to multiple gene annotations, we simply prioritized the annotation for the largest expression value observed (the more reliable observation) and its role category, for example, essential and critical.
An algorithmic procedure for efficiently computing optimized critical and redundant subsets in scale-free networks. Nacher and Akutsu presented an algorithm to compute the optimal critical and redundant sets of nodes 15 . The algorithm, although providing the optimal and exact solution using ILP, needs to solve the MDS problem n times, where n is the number of nodes (see Supplementary Information (SI) for details). Therefore, if compared with the computation of a single MDS, the method requires much more CPU time, and the feasible size of networks is limited to approximately 1,000 nodes. This severe limitation is problematic in the analysis of critical network control features in many real-world networks, including cellular networks. To by-pass the strong computational requirements of the problem, we designed an algorithm that exploits the scale-freeness property observed in most of the real-world networks. First, the mathematical analysis presented by Nacher and Akutsu 15 derived an important proposition related to the nodes engaged in critical control:

Proposition 2.2 (Critical proposition).
If node v has two or more neighbouring nodes with degree k = 1, v is a critical node 15 .
Here, we present a new proposition to identify nodes engaged in a redundant control role.

Proposition 1 (Redundant proposition).
If all neighbours of a node v are critical nodes, v is a redundant node.
Using the above two propositions and taking advantage of the heterogeneous degree distribution observed in scale-free networks, we can substantially enlarge the computable size of networks. The degree distribution P(k) of a scale-free network for nodes of degree k follows a power law with k −γ functional form, where γ indicates the degree exponent of the network. This means that most of the nodes have a low degree; many of them are degree k = 1, and only a small fraction of nodes are highly connected. Therefore, using proposition 2.2 15 , (1) we can efficiently pre-compute an optimal subset of critical nodes. This allows us to substantially reduce the number of times we need to solve ILPs. In addition, (2) because the number of constraints, in addition to the trivial ones, is significantly smaller, the ILP can solve each ILP faster and therefore provide optimal and exact solutions for larger networks (see SI for details).
The algorithm for computing the optimized set of redundant nodes benefits from Proposition 1. Because highly connected nodes tend to be critical nodes, there are many low-degree nodes connected only to critical nodes. Therefore, (1) by direct application of proposition 1, a large number of redundant nodes can be pre-determined before computing the ILPs. As in the critical set case, (2) the number of constraints to be solved in the model for each ILP is drastically reduced, which allows faster computations in larger networks.
Taken together, our results show that a simple procedure surprisingly pre-determines a large number of critical (c) and redundant (r) nodes m = c + r in a network of size n. Then, the role of the remaining unassigned nodes (n-m) can be determined using the algorithm proposed by Nacher and Akutsu 15 , which only requires (n-m) ILP computations with lower constraint complexity.