Introduction

Genome sequencing can, at least in an idealized world, list the repertoire of what a cell could possibly do; expression profiling, on the other hand, reflects what the cell actually is doing. Selective or differential gene expression profiles in specific cells, therefore, add valuable contextual information. It is quite natural to connect the differential gene expression profiles to disease states, whether they are genetic diseases or not. An overwhelming number of studies in this vein have been published: e.g., Refs. 1,2,3,4,5,6 to name just a few. Essentially all of these approaches make the assumption that genes are the units of biological functionality.

Even if the assumption cannot be denied, it has recently been pointed out that the relationships among proteins, not just properties of individual proteins, are essential ingredients in characterizing the entity of biological functions. The relationships can be binary protein-protein interactions (PPIs)7,8,9,10 or formation of stable structural and functional units called protein complexes11,12,13,14,15. Proteins tend to function as members of complexes and dysfunctions of different proteins in the same complex generally lead to similar disorders. Research has been conducted trying to identify disease-associated protein-protein interactions, signaling pathways and protein complexes by the integrated computational analysis of heterogeneous data sources16,17,18,19,20,21,22.

Human diseases usually occur in one or more specific tissues and organs, while different types of organs and tissues make use of selective sets of expressed genes, protein-protein interactions and protein complexes23. Genes predominantly expressed in one or a few biologically similar tissue types are defined as tissue-selective genes24. Similarly, protein complexes showing significantly higher abundance levels in one or limited tissues are considered as tissue-selective complexes. Tissue-selective genes and complexes could be disease markers and potential drug targets. Although many approaches have been developed to identify tissue-selective genes and their relationships to diseases24,25,26,27,28,29, the identification of tissue- and disease-selective complexes is still in its infancy due to the lack of adequate coverage on experimental proteomic data, so that gene expression levels have been used instead of protein abundance20,30,31.

In this paper, by using the optimization algorithm for estimating differential abundance levels of protein complexes introduced in Ref. 15, we attempt to define the human tissue- and cancer-selective protein complexes. More specifically, we use the recently released E-MTAB-62 gene expression profile dataset32 and focus on 39 solid tissue cancers and 25 different normal tissues from some of which the cancers are originated (Table 1). From the abundance profiles of complexes, we classify the complexes associated with cancers and tissues into four different categories called Patterns 1–4, where the complexes over-expressed in cancers but under-expressed in originated normal tissues are considered as most relevant and analyzed in terms of the bipartite relation between cancers and complexes. Finally, we show that the correlation structures of different cancers and tissues are preserved in our complex-based study, in comparison to the results from individual gene expression levels.

Table 1 List of solid cancers and their originated normal tissues. Cancers were selected from the file “E-MTAB-62.sdrf.txt” whose columns “Characteristics [4 meta-groups]” and “Characteristics [Blood/NonBlood meta-groups]” are “neoplasm” and “non blood”, respectively. Cancer name and its originated normal tissue are taken from “Characteristics [DiseaseState]” and “Characteristics [OrganismPart]” of the file, respectively

Results

Differentially expressed protein complexes in normal tissues

First, we present our results of the differentially expressed protein complexes in normal tissues. For each of 25 solid tissues under study, using the average abundance levels over all the other tissues as the control set, we extracted over (under)-expressed complexes with a change more than a factor two, or less than a factor 1/2 (Table S1 and S2). A total of 106 and 209 distinct protein complexes were found over- and under-expressed in normal tissues, respectively. See Table S3 for the number of complexes differentially expressed in each tissue. The distributions of the number of different tissues in which complexes are over- or under-expressed are shown in Fig. 1. It can be seen that most complexes are over- and under-expressed only in a small number of tissues, suggesting that a large fraction of complexes predicted by our method exhibits a high extent of tissue selectivity. Note that the tissues are (of course) not completely independent from one another, which may be responsible for some multiple numbers of tissues in which complexes are differentially expressed.

Figure 1
figure 1

Distributions of number of overlapped tissues for over-expressed (a) and under-expressed (b) complexes, in normal tissues.

For each over- or under-expressed complex in normal tissues, we count the number of tissues where it is over- or under-expressed and define the number as the number of overlapped tissues.

In the CORUM (Comprehensive Resource of Mammalian protein complexes33) database, which we use for our complex list, functions of protein complexes are annotated by the Functional Catalogue (FunCat) scheme, whose hierarchical structure allows browsing for protein complexes with particular cellular functions or localizations33,34. However, among all the 2837 mammalian protein complexes in the CORUM database, only 148 have information concerning specific animal tissue of the complex. Because of this lack of tissue-specific annotation, only 5 of the 106 over-expressed complexes predicted by our method have tissue annotation. As shown in Table 2, among the 5 complexes, 4 complexes are consistent with the annotation, suggesting the validity of our result. For instance, “thymus” (our predicted tissue) and “bone marrow” (CORUM) are compatible, as both of those are hot spots of T cell production and maturation35. They are both considered (the only) “primary lymphoid organs”35.

Table 2 Comparison of our results with tissue information of complexes in CORUM. Boldface marks consistent results

Differentially expressed protein complexes in solid cancers

As in the normal tissue case, for each of 39 solid tissue cancers, using the abundance levels in the originated normal tissue as the control set, we extract over(under)-expressed complexes with more (less) than 2-fold (1/2-fold) changes, respectively (Tables S4 and S5). A total of 283 and 294 distinct complexes were identified over- and under-expressed in the cancers, respectively. We call these complexes cancer-associated complexes. Again, from the distributions of the number of different cancers in which complexes are over- or under-expressed, shown in Fig. 2, we can observe the high degree of cancer selectivity of the complexes. The fact that several cancers are derived from the same normal tissues seems to be responsible for the larger number of overlapped cancers compared to the number of overlapped normal tissues in Fig. 1 and in fact, such cancer-cancer correlations will be presented later.

Figure 2
figure 2

Distributions of number of overlapped cancers for over-expressed (a) and under-expressed (b) complexes, in cancers.

For each over- or under-expressed complex in cancers, we count the number of tissues where it is over- or under-expressed and define the number as the number of overlapped cancers.

The most fundamental assumption of our approach is to treat the complexes as a functional unit, instead of individual component proteins. In other words, differential abundance profiles for complexes are more relevant than the ones for individual genes, since each gene may play different functional roles in different complexes, resulting in the situation that expression levels over different contexts are effectively “averaged out.” In Table 3, we compare over-expressed protein complexes of brain tumor with their up-regulated component genes which were shown associated with nerve system cancers in GeneCards36. We use the t-test to test if a gene is differentially expressed in the brain tumor and control samples. For such a large number of genes being simultaneously tested, the FDR37 corrected p-values are used for screening differentially expressed genes. We consider genes with at least 2-fold change of log ratio for average expression level and FDR at most 0.05 as up-regulated in brain tumor. It can be seen that in complexes identified over-expressed in brain tumor by our algorithm, only a small fraction of component genes associated with nerve system cancers was up-regulated. Such a large difference is strong evidence supporting the fundamental assumption of complexes' relevance to biological functions and dysfunctions compared to individual genes.

Table 3 Comparison of protein complexes over-expressed in brain tumor and their up-regulated component genes

Considering that the database E-MTAB-62 we used is an integration of data generated in different laboratories, we conducted a within-laboratory comparison on over-expressed complexes in brain tumor to see to what extent our result is replicated across studies. The samples of brain tumor and normal brain tissue came from 2 and 6 different laboratories, respectively. By combining brain tumor samples from one lab with normal brain samples from another lab, we got 12 different sample sets. We ran our algorithm on each sample set and identified complexes over-expressed in brain tumor. As shown in Figure 3, most complexes identified by our algorithm are also identified by at least half of the sample sets. Then we ran our algorithm on each of the brain tumor and normal brain tissue samples, respectively. By t-test and multiple testing corrections on the resulting complex abundance matrix of large samples, we identify complexes statistically over-expressed in brain tumor with FDR < 0.05. A total of 29 complexes identified over-expressed by this sample replication method are also identified by our method which used the average of samples as input (See Figure 3). These comparisons suggest the robustness of our algorithm on different data resources.

Figure 3
figure 3

Cross-validation of complexes over-expressed in brain tumor identified by our method by within-laboratory comparison, sample replication method and GSEA.

We also compare our algorithm to a gene set testing approach, the Gene Set Enrichment Analysis (GSEA)38. Using the CORUM complexes as gene sets, we conducted GSEA analysis on expression data of brain tumor and normal brain tissue. This method identifies 227 complexes that were significantly enriched in brain tumor tissue (FDR < 25%). As shown in Figure 3 and Table 3, 9 of the 34 complexes over-expressed in brain tumor identified by our method are also identified by GSEA. From Table 3 we see that relatively more up-regulated genes appeared in the overlapped complexes, which is the principle of identifying enriched gene sets by GSEA. Complexes identified over-expressed only by our algorithm include genes reported associated with nerve system cancers, suggesting they may related with brain tumor. However, these complexes are not detected by GSEA because few genes were up-regulated. This comparison suggests that our algorithm, which considers stoichiometry of complexes from global point of view, could add some new information in complex prediction.

From Figure 3 we can see that several complexes, such as Anti-HDAC2 complex, SMN complex, EIF3 complex, CDC2-CCNA2 complex, are well identified over-expressed in brain tumor by all the four methods, suggesting strong expression signals of these complexes in brain tumor. Complexes such as CDC2-CCNA2 complex, Anti-HDAC2 complex and WINAC complex are more obviously associated with brain tumor due to their high fraction of component proteins related with nerve system cancer (see Table 3). However, from GeneCards and GoPubmed database, all the five component proteins of SMN complex (small nuclear ribonucleoprotein B, D, E, F, G) are not associated with nerve system cancer although they are highly associated with neurologic manifestations and neurodegenerative diseases. Our computations found this complex and its five component proteins are significantly over-expressed in brain tumor, indicating its relationship with brain tumor. More research deserves to be undertaken to validate such results.

Expression patterns of cancer-associated complexes in normal tissues

For complexes differentially expressed in a cancer, we compare their abundance levels in the cancer tissue with those in the originated normal tissue and in the other normal tissues. Specifically, we mapped the differentially expressed complexes in each cancer to each normal tissue and classified differential expressions of these complexes according to the following four patterns:

  • Pattern 1: over-expressed in the cancer tissue but under-expressed in the normal tissue

  • Pattern 2: over-expressed in the cancer tissue as well as in the normal tissue

  • Pattern 3: under-expressed in the cancer tissue but over-expressed in the normal tissue

  • Pattern 4: under-expressed in the cancer tissue as well as in the normal tissue

For each cancer, we count the number of complexes in each tissue of different Patterns (see Table S6). Then for each cancer, we list the number of complexes in the tissue from which it originated, along with the largest number of complexes among the other tissue other than its originated tissue, classified as the different Patterns (see Table S7). Figure 4 shows the distribution of the four differential expression patterns of cancer-associated complexes in their originated normal tissues. It can be seen that the dominant expression patterns are Patterns 1 (57.2%) and 3 (27.1%), whereas Patterns 2 and 4 complexes in originated normal tissues (1.15% and 3.87%) are minorities. In Table S7, we list the comparison of the four patterns in cancers' originated normal tissues with those in the other normal tissue with the maximum number of cancer-associated complexes. Table S7 shows that, compared with those in the other normal tissues, Pattern 1 complexes in originated normal tissues are much more numerous (57.2% vs. 22.6%); Pattern 2 and 4 complexes in originated normal tissues are much fewer (1.15% vs. 17.5% for Pattern 2; and 3.87% vs. 22.94% for Pattern 4); and Pattern 3 complexes has no significant difference (27.1% vs. 26.9%). Moreover, by the t-test, the expressions of Pattern 3 complexes in originated normal tissues have no significant difference from those in other normal tissues; whereas the expressions of Pattern 1, 2 and 4 complexes are significantly different from those in the other normal tissues, respectively.

Figure 4
figure 4

Differentially expressed complexes in cancers and originated normal tissues.

Log-ratio of abundance in cancers (vertical axis) are defined with respect to the originated normal tissues and that in normal tissues (horizontal axis) are defined with respect to all the other normal tissues. The log-ratio values in the “normal” range (–1, 1) are excluded for both cancers and normal tissues. Four different patterns are noted according to their differential abundance levels in cancers and their originated tissues.

From these observations, we can conclude that solid cancers tend to over-express complexes that are under-expressed in the normal tissues of the cancers' origin (Pattern 1). In other words, complexes that are not supposed to be expressed in a specific tissue but are over-expressed in this tissue can be related to cancers. Furthermore, solid cancers could over-express (or under-express) part of complexes that are over-expressed (or under-expressed) in normal tissues other than the cancer's tissue of origin (Patterns 2 and 4). These patterns could complement earlier findings on single gene expression pattern in cancers. For example, it was reported that genes over-expressed in human leukemias were rarely over-expressed in hematopoietic tissues39. Generally, cancers over-express only a fairly small part of genes that are selectively expressed in their originated tissues25. On the other hand, under-expressed complexes in cancers do not have statistically significant tendency to be over-expressed in the originated normal tissues (Pattern 3), which can be interpreted to mean that the lack of necessary complexes does not tend to cause cancers, in contrast to the existence of unnecessary complexes in Pattern 1.

It is known that one form of cancer can affect many tissues, not only the tissue from which it originated. The expression patterns of cancer-associated complexes may indicate the cancer-tissue relations. One interesting way to verify the cancer-tissue relations from an external source is to use the Web search engine40. Our basic assumption is that the more Web pages Google finds from the search query with ‘[cancer name][tissue name] ', the more probably the tissue is related to the cancer. We measure cancer-tissue “Google correlation” (‘Google page' column in Table S6). For a specific cancer A, most Google correlation values for ‘[cancer A][originated tissue of cancer A]’ pair are ranked on the top among all the ‘[cancer A][tissue name].’ More precisely, 14 of the 39 cancers have the largest number of Google correlation value with their originated tissues. This result validates our assumption. In addition, from Table S6, for each cancer, we calculated the Pearson correlation coefficient between columns ‘Google pages' and column ‘Patterns 1–4,’ as shown in Table S8.

The statistical significance test suggested that cancer-associated complexes are expressed according to Patterns 1, 2 or 4. Thus, we took the maximum values of Pearson correlation coefficient for Patterns 1, 2 and 4 and show them in the last column of Table S8. Most (about 3/4) of the Pearson correlation coefficients in the last column are positive, suggesting a positive correlation between cancer-tissues relations from Google correlation and those from the number of cancer-associated complexes with differential abundance levels.

Bipartite complex-cancer relations and common complexes associated with the same cluster of cancers

The previous subsection suggests that most cancer-associated complexes are Pattern 1 complexes in the originated normal tissues, i.e., over-expressed in the cancer tissue but under-expressed in the originated normal tissue. Thus we focus on these Pattern 1 complexes and investigate the bipartite network between cancers and Pattern 1 complexes in cancer tissues. We constructed a bipartite network between cancers and Pattern 1 complexes, in which a cancer node is connected to a complex node if and only if this complex is a Pattern 1 complex of this cancer. In the bipartite network, we measured the topological similarity of the vertices according to the following Jaccard similarity index:

where Nu is the set of neighbors of node u. Then Ward's clustering, a hierarchically agglomerative clustering method, was used to cluster the nodes in the network41. The hierarchical clustering starts off with each node being its own cluster and the distance between nodes u and v is defined as d(u, v) = 1 − J(u, v). At each step, pair of clusters (u, v) with the smallest distance d(u, v) is selected to be merged as a single cluster and distance measures between clusters are updated as the weighted sum of distances according to the Lance-Williams algorithm42 and the process is repeated until all nodes have been combined into one cluster, represented as a dendrogram with a hierarchical structure. In our case, d(u, v) = 2 is used as the threshold for cutting the hierarchical tree to yield the clustering structure. Figure 5 shows that some cancers are clustered because of their common over-expressed complexes and also some complexes are clustered together.

Figure 5
figure 5

Bipartite network of cancers and protein complexes of Pattern 1.

Triangles (circles) represent cancers (complexes), respectively. The numbers (and corresponding colors) on vertices show the clustering structure defined with the Jaccard similarity index (see the text).

We classify the 39 cancers under study into six categories according to Medical Subject Headings (MeSH43) annotation of their originated tissue categories: nerve tissue neoplasm, connective and soft tissue neoplasm, head and neck neoplasm, urogenital tissue neoplasm, digestive system neoplasm and respiratory tract neoplasm. Biologically, cancers originated from same tissue should be correlated to some extent. In Table 4, we list the cluster indexes of the cancers in Figure 5 and their originated tissues. It can be seen that cancers originated from the same tissue category are clustered together. Figure 5 shows that cancers in the clusters 4, 5, 6 tend to link with complexes in clusters 10, 20 and 18 respectively, suggesting the association of these complex groups with nerve tissue cancers (cluster 4) and connective tissue cancers (cluster 5 and 6) respectively. To verify the correlation of the complexes in cluster 10 with nerve tissue cancers (cluster 4), we searched GoPubMed44 with complex names or gene names of the complex component proteins (in January of 2012) and listed the results in Table 5. A total of 13 of the 17 complexes show rank 1 association with cancers compared with all diseases, implying the important functions of these complexes in the occurrence or development of cancers. The associations of most complexes with nerve system diseases and nerve system cancers rank on the top of “All of Diseases” (more than 20 disease items) and “Neoplasms by Site” (more than 10 cancer tissue items) lists, respectively, demonstrating a high degree of correlation of complexes in cluster 10 with nerve system cancers. Moreover, proteins in some complexes such as cell cycle kinase complex CDK5, SMARCA2/BRM-BAF57-MECP2 complex and SMARCA2/BRM-BAF57-MECP2 complex have been extensively reported to be associated with eye cancer retinoblastoma, specifically implying the functions of these complexes in nerve systems cancers. In addition, 5 complexes in Table 5, CDC2-CCNA2 complex, Cell cycle kinase complex CDK5, Anti-HDAC2 complex, Emerin complex 52 and WINAC complex, are also identified over-expressed in brain tumor by GSEA (see Table 3), which cross-validates the correlation of these complexes with nerve system cancer. Similarly, the associations of complexes in cluster 20 with connective tissue cancers were shown in Table S9.

Table 4 Cancers classified by categories of their originated tissues and topology of the cancer-complex association network in Fig. 4
Table 5 GoPubMed search results for the associations of complexes in cluster 10 with nerve tissue cancer. (Complexes with higher specificity are shown in the boldface.) Disease hits: number of PubMed papers indicating the association of searched item with diseases; neoplasms hits/rank: number of PubMed papers indicating the association of searched item with cancers and the rank of paper numbers in “All of Diseases” item of GoPubMed results. Association with nerve tissue cancer: number of PubMed papers indicating the association of searched item with nerve system diseases (box in the first row) and nerve system cancers (box in the second row) and the rank of paper numbers in “All of Diseases” and “Neoplasms by Site,” respectively

Cancer-cancer correlations deduced from gene expression and complex abundance profiles

From our results, we see that many complexes predicted by our algorithm are important biological modules involved in the occurrence and development of solid cancers and these modules suggest correlations of cancers to some extent. To verify if the predicted complexes could reflect the relationships between different cancers as the original gene expression data do, we hierarchically clustered the gene expression profile and complex abundance profile of all cancers and normal tissues under study, respectively. Similarity between groups is defined as the mean Pearson correlation coefficient between the sample profiles (hierarchical clustering trees in Figs. S1 and S2). Three large tissue categories include more cancers—soft tissue, nerve tissue and urogenital tissue are clustered together in both cases; i.e. both clustering results show the correlations of cancers and normal tissues of similar tissue categories.

Similarly, according to the relative gene expression level and complex abundance of the cancers against their originated normal tissues by log-ratio values, we hierarchically clustered the cancers, respectively (Figs. 6 and 7). Figure 7 shows the heatmap of hierarchical clustering of the 39 cancers compared to each other, according to relative complex abundance of cancer against its originated normal tissue. Similar to the heatmap in Fig. 6, the clusters of cancers in Fig. 7 are mostly consistent with their tissue categories. We partitioned the cancers into 4 clusters according to the hierarchical trees in Figs. 6 and 7, respectively. Then we applied overlap score to quantity the similarity between the two partitions of cancers respectively generated from gene expression and protein complex profiles45,46 and got the value of overlap score as 0.72. We then generated 200 pairs of random clusters of the cancers, in which the cluster sizes are the same as in the real data. The average overlap score of the random ensemble was calculated as 0.24, while the z-score46 for the overlap score of the two real partitions was 8.15, suggesting a fairly high extent of overlap between the two partitions of cancers with statistical significance. These results suggest that our predictions of complexes extract cancer modules from the expression data while not changing the inherent correlations of the data. Therefore, we can see that they reflect the intrinsic relationships among different cancers.

Figure 6
figure 6

Heat map and hierarchical clustering of 39 cancers.

Similarity between cancers is defined as the Pearson correlation coefficient between the log-ratio expression profiles of genes that cancers contain.

Figure 7
figure 7

Heat map and hierarchical clustering of 39 cancers.

Similarity between cancers is defined as the Pearson correlation coefficient between the log-ratio abundance profiles of complexes that cancers contain.

Discussion

Studies on the differential gene expression levels have added significant values to the genome-wide analyses having focused on genome sequencing, due to their condition-dependent dynamic nature. In other words, they indicate how the biological functions are phenomenologically realized for given “blueprints” of genome sequences and different environments. Our method can successfully identify cancer-associated complexes. We believe that it, from the assumption that protein complexes are real biological functional units, leads us to one step closer to biological reality.

Our optimization procedure is based on linear programming (polynomial in computational time), implying that our method is feasible for future, larger studies. The method, as we apply it in this paper, rests on the assumption that expression levels are strongly correlated to protein abundance. Although signals from Affymetrix arrays used in our data sets can differ from the absolute protein abundance, considering the dataset's broad coverage in terms of both cancers and various tissues, this study provides a novel approach that can be adopted by other researchers who are possibly in possession of better datasets currently or in the future, we believe. Moreover, the advantage of protein-complex-based approaches, other than the identification of cancer-specific complexes, could be investigated further in the future.

Methods

Gene expression dataset

For gene expression data, we use recently released E-MTAB-62 in the Array Express repository32. It is an integration of 206 different experiments and 5372 samples generated in 163 different laboratories, including 369 different cell and normal tissue types, diseases and cell lines. The most important aspect of this dataset is that all the data are from the same platform, pass data quality checks and get normalized so that we can compare the expression levels across different cancers/tissues. CEL files of samples that did not pass quality checks were removed. The retaining 5372 CEL files were normalized by Robust Multi-array Average (RMA) method, i.e., the raw intensity values were background corrected, log2 transformed and then quantile normalized47. In this work, we studied 39 solid tissue cancers (708 samples) and 25 normal solid tissue types (440 samples), in which 18 normal tissue types were where these cancers originated and thus were used as control sets (see Table 1).

Protein complex dataset

For the list of human protein complexes, we use the Comprehensive Resource of Mammalian protein complexes (CORUM) database33, where 1343 complexes and 2315 component proteins (the expression profiles of 2064 of these 2315 proteins are listed in E-MTAB-62 data) are listed in total as a core data. Among the core data, 1338 complexes, at least one of component proteins of which is assigned with the expression profile in E-MTAB-62 data, are used in our analysis.

Estimation of abundance levels of complexes based on optimization

The detailed background and procedure of our optimization algorithm is described in Ref. 15. Assume that the copy number of protein i (; N is the number of proteins) and the number of complex j (; where M is the number of complexes) are given by Pi and cj, respectively. Also, suppose that we denote the number of protein i in the complex jas Sij, where Sij = 0 if the complex j does not include the protein i as its component. In the ideal situation where all the proteins in a cell are of the exact amount to be used in forming a complex, the variable sets and satisfy

The question is how to determine (variables) with known values of and (constants). However, since the number of proteins N is usually larger than the number of complexes M, the set of linear equations above is over-determined, so in general it is not possible to satisfy all the equations in Eq. (1). In reality, therefore, the number of proteins in a cell should be greater than or equal to that necessary to form complexes, i.e., , which is the basic constraint of our optimization scheme. Instead of finding an exact solution satisfying Eq. (1), we try to minimize the deviation from the ideal situation in Eq. (1), given by the object function

where the summation is only for indices i where Pi > 0. Now, for the given values of Pi and , our basic strategy is to determine cj values that minimize DA in Eq. (2) and this problem is numerically solved by the linear programming (LP) technique. Moreover, after the determination of cj values, if some values of Piare unknown, we can assign those values of Pi using Eq. (1) for the ideal situation. This optimization is based on an assumption that organisms have been evolved in a way that increases efficiency by reducing wasted resources.

In this work, the average expression level of gene encoding protein i is used as the Pi-value33 and the composition matrix Sij is approximated by the binary value ( = 1 if protein i is included in complex j, 0 otherwise). Ideally, it is more realistic to estimate protein complex levels from protein abundance as mRNA expression level cannot completely represent the true protein abundance. However, although several large proteomics data sets are available48,49, currently there are no equally rich genome-wide protein abundance data sets for tumor versus normal tissue samples. Several studies have found mRNA and protein expression levels to be well correlated50,51. It is reported that approximately 40% of the variation in mammalian protein abundance is explained by mRNA levels51. It is known that signals from Affymetrix arrays used in our data sets can differ from the absolute protein abundance. However, our method does not, strictly speaking, need to use absolute abundances - it is sufficient that the relative abundances are accurately measured, since all the objective functions and constraints in our linear programming (LP) optimization are strictly linear by definition. Therefore, the direct usage of gene expression levels as protein abundance is not free from errors, but it could yield reasonable results.

Identification of differentially expressed complexes in cancers and normal tissues

For each cancer or tissue case, individual genes' expression profiles are averaged over different samples in the E-MTAB-62 dataset and the set is used as the input data of set. Our optimization procedure minimizing Eq. (2) will yield the set, i.e., complexes' abundance levels for the cancer or tissue. Then the abundance levels of all complexes in different cancers are compared with the abundance levels in the corresponding normal tissue in which these cancers originated; while the abundance levels of all complexes in different normal tissues are compared with the average abundance levels in all the other normal tissues. Over-expression(under-expression) of a complex is defined as at least 2-fold (at most 1/2-fold) change of abundance level.

Overlap score

We use overlap score to measure the overlap extent of cancer clusters respectively generated from gene expression and protein complex profiles45,46. Consider two different categories A and B (for example, two partitions of cancers got by different clustering methods) and assume each cancer is associated with a subset (cluster) of the partitions of A and B. Let ϕA(i) and ϕB(j) denote the fraction of cancers in cluster iA and jB (i = 1,2,…, m; j = 1,2,…, n), respectively. Let ϕAB(i,j) denote the joint frequency of i and j, i.e., the fraction of cancers that are partitioned in both cluster i A and j B. In a random distribution of clusters the expectation value of ϕAB(i, j)is ϕA(iB(j). If the clusters of differ partitions are overlapping, some ϕAB(i,j), the ones that overlap, will be larger than ϕA(iB(j), while for the others, ϕAB(i,j) will be smaller than ϕA(iB(j). Thus, the overlapping of clusters in partitions A and B can be quantitatively measured by:

Since the value of μ is affected by finite sizes, it is hard to judge if a μ-value indicates a good or bad overlap. Therefore, we normalize the μ-value against those of the perfect overlaps and define overlap score of partitions A and B as follows:

The value of v is between 0 and 1 and it is 1 for perfect matches.