Article | Open

Cross-tissue Analysis of Gene and Protein Expression in Normal and Cancer Tissues

  • Scientific Reports 6, Article number: 24799 (2016)
  • doi:10.1038/srep24799
  • Download Citation
Published online:


The central dogma of molecular biology describes the translation of genetic information from mRNA to protein, but does not specify the quantitation or timing of this process across the genome. We have analyzed protein and gene expression in a diverse set of human tissues. To study concordance and discordance of gene and protein expression, we integrated mass spectrometry data from the Human Proteome Map project and RNA-Seq measurements from the Genotype-Tissue Expression project. We analyzed 16,561 genes and the corresponding proteins in 14 tissue types across nearly 200 samples. A comprehensive tissue- and gene-specific analysis revealed that across the 14 tissues, correlation between mRNA and protein expression was positive and ranged from 0.36 to 0.5. We also identified 1,012 genes whose RNA and protein expression was correlated across all the tissues and examined genes and proteins that were concordantly and discordantly expressed for each tissue of interest. We extended our analysis to look for genes and proteins that were differentially correlated in cancer compared to normal tissues, showing higher levels of correlation in normal tissues. Finally, we explored the implications of these findings in the context of biomarker and drug target discovery.


In recent years, techniques used to conduct tissue-wide analysis of gene expression, such as microarrays and RNA sequencing technologies (RNA-Seq), have become widely used1,2. An example of an effort in this area is the Genotype-Tissue Expression (GTEx) project, which contains RNA-Seq measurements from 43 different tissues in hundreds of samples3. As central functional units in many complex biological pathways, proteins are also a subject of much interest in various areas of translational medicine, including diagnostic biomarker discovery, drug discovery and personalized medicine. Measuring global protein levels directly in human tissue samples, however, has traditionally presented many challenges, as well as major difficulties including reproducibility4. Recent advances in protein analysis technology provide methods for exploring the relationship between mRNA expression and protein abundance. For example, mass spectrometry (MS)5 has been used to map the proteomes of yeasts, worms, and flies6. MS-based comparative analysis of several human cell lines has also been conducted7. Other technologies, such as immunohistochemistry, produce images that aid in measuring levels of protein expression8. However, these methods do not scale to capture genome-wide protein measurements. Thus, despite advances in high-throughput methods for proteomics, the relationship between gene expression and protein abundance is still unclear.

Comparative studies have found that correlations between mRNA and protein levels in model organisms can be relatively weak and uncertain or moderately positive9, and that they vary between experiments and organisms. For example, Gygi et al. observed a moderately positive Pearson correlation of R = 0.48 when studying a subset of proteins in S. cerevisiae. Correlations vary greatly among genes, depending on regulatory processes that govern the rates of translation and protein degradation10. Schwanhausser et al. obtained similar results in mouse fibroblasts, where expression of a subset of 5000 genes was moderately correlated with protein levels (R = 0.44)11. In humans, Gry et al. described a similar lower relationship in 23 cell lines, where R values ranged from 0.25 to 0.52, depending on the methodology that was applied12. Studies of gene and protein expression correlation in cancer tissues are less common, and findings are often contradictory. For example, in bone osteosarcoma, squamous cell carcinoma and brain glioblastoma, concordance has been shown to be high (R values ranging from 0.58 and 0.63)7, while in lung adenocarcinomas, R values ranged from −0.467 to 0.44213. In the aforementioned projects, sample sizes were small and analysis focused on a subset of genes.

To date, a comprehensive study of the correlation between gene expression and protein abundance in the human body has yet to be performed. With the release of two independent MS-based draft maps of the human proteome in May 2014, researchers now have unprecedented access to a database of system-wide tissue-specific protein levels14,15. One of these efforts, the Human Proteome Map Project (HPM), cataloged proteins corresponding to 84% of the known protein-encoding genes in the human genome across 30 different human tissues14. Combined with the existing transcriptomic data libraries from GTEx, these seminal advances have created an opportunity to explore the central dogma of biology and document correlations between gene expression and protein levels.

Here, we present a large-scale analysis of protein abundance and gene expression across a diverse set of human tissues. We examined ~80% of human genes for tissue- and gene-specific correlations. Furthermore, we extend our initial analysis to determine which genes and tissues are correlated in their protein abundance and gene expression in cancer tissues to inform the identification of new drug targets and biomarkers.


We analyzed the GTEx and HPM datasets and found that 16,561 genes and their corresponding proteins were represented in both repositories across 14 tissues. More than 39,000 transcripts, consisting mainly of non-coding RNAs, were present only in the GTEx dataset, while 688 proteins were present only in the HPM data (Supplementary Fig. S1). These data were excluded from our analysis. The number of genes expressed in each tissue in this combined dataset varied from 9,937 (heart) to 13,054 (testis). The number of proteins present per tissue varied from 5,022 (esophagus) to 11,030 (testis; Supplementary Table S1).

Correlating Gene and Protein Expression Across Tissues

For each tissue, GTEx transcriptomic samples were paired with a corresponding proteomic measurement from the HPM dataset. A Spearman correlation was calculated for each pair. Fig. 1A illustrates this analysis (comparison 1), while Fig. 2 shows the distributions of correlations per tissue. Correlations ranged from 0.36 to 0.50, with a median of 0.45. Esophageal tissue had the lowest score distribution and pancreatic tissue had the highest (supplementary Fig. S2). These results agree with previous attempts to correlate mRNA expression with protein abundance10,11,12.

Figure 1: Analysis overview.
Figure 1

(A) Analysis of normal tissues based on GTEx and HPM expression data. (1) Tissue-specific correlations (2) Gene-specific correlations (3) Concordance/discordance analysis (4) Functional analysis including drug targets and biomarkers. (B) Cancer and normal tissues analysis based on TCGA data. (1) Tissue-specific correlation (2) Differential ranking analysis (3) Therapeutic drug target analysis.

Figure 2: Tissue-specific correlations of protein-gene expression.
Figure 2

Tissues are ranked by their average Spearman correlation value.

We compared tissue similarity by performing principal component analysis (PCA) and by comparing the topology of hierarchical clusters based on either GTEx or HPM data (Fig. 3). In both PCA plots, there were similarities between different tissue types, with prostate, colon, and urinary bladder samples clustering together in both gene and protein expression analysis. Testis, frontal cortex, and spinal cord samples were outliers. PCA revealed a high similarity between GTEx samples from the same tissues (also shown previously16), which led us to use medians for all relevant samples per tissue in the next step of the analysis.

Figure 3: Tissue-specific correlations of protein-gene expression.
Figure 3

(A) PCA based on gene expression. (B) PCA based on protein expression.

We found a matching clustering pattern through hierarchical clustering approaches (Supplementary Fig. S4A,B) based on gene and protein expression. The gene expression dendrogram in Fig. S4A shows three main clusters with p-values  < 0.05 (marked in red). They include the following-cluster one: ovary, colon, urinary bladder, prostate; cluster two: lung, kidney; cluster three: spinal cord, testis. The HPM dendrogram in Supplementary Fig. S4B shows four main groups with significant correlations for all tree branches (p-value < 0.05, significant clusters marked in red): cluster one: heart, esophagus; cluster two: frontal cortex, spinal cord; cluster three: prostate, colon, urinary bladder; and cluster four: liver, adrenal gland, kidney, ovary, testis, lung, pancreas. To measure similarity between the two trees, we used cophenetic correlation based on Spearman correlations. The analysis returned a weak value (c = 0.25), illustrating the different gene and protein expression landscapes. The only significant similarity between the dendrograms was the clustering of the prostate gland, colon and urinary bladder tissues.

We used sample similarities to examine the relationships of gene and protein expression between different tissues. We first correlated all tissues based on gene expression (Fig. 4A) and protein expression (Fig. 4B) using the Spearman correlation metric (see Supplementary Fig. S2 for data distributions). The tissues with the highest correlation based on gene expression were observed on the diagonal of the matrix, which was expected. For many tissues in the GTEx dataset, gene expression correlation values were very similar among tissues forming a large cluster (rho values: 0.81–1). Expression in the frontal cortex was very similar to that of the spinal cord (rho values: 0.85–1; Fig. 4A). Gene expression correlation in testicular samples was low with all tissues except itself.

Figure 4
Figure 4

(A) Tissue-specific correlations of gene expression (GTEx). (B) Tissue-specific correlations of protein expression (HPM). (C) Tissue-specific correlations of protein-gene expression.

Highly correlated protein expression among tissues was once again on the diagonal of the matrix. We saw high concordance in the following tissues: colon, urinary bladder, and prostate (rho values: 0.83–1), spinal cord and frontal cortex (rho values: 0.765–1), and ovary and testis (rho values: 0.87–1; Fig. 4B).

Finally, we correlated the tissues based on their protein abundance and gene expression across all tissues analyzed (Fig. 4C). Overall correlations were lower than gene-gene correlations (highest values), and protein-protein comparisons. In many tissues, we found that a gene-protein pairing in a single tissue had one of the highest correlation values; however, we found many other tissue pairings in which one tissue’s gene expression profile was correlated more strongly with protein expression from another tissue, suggesting similarities between tissues (Fig. 4C for median correlation values and Supplementary Fig. S3 for sample pairwise correlation values). An example is the frontal cortex and spinal cord. Both are part of the central nervous system, and both share similar gene and protein expression profiles. In other cases, such as testis, connections between gene and protein expression were asymmetric. A gene-protein correlation heatmap (Fig. 4C) shows that while GTEx testis gene expression had similarities to many tissues in HPM, the HPM protein expression of the testis only seemed to correlate with the GTEx testis and ovary samples.

Identifying significantly correlated genes and proteins across all tissues

In order to identify genes that were highly correlated with protein expression across all tissues, we computed Spearman correlations for each gene, correlating mRNA and protein expression across all 14 tissues. Figure 1A has an illustration of the analysis (comparison 2). Overall, statistically significant correlation between mRNA expression and protein abundance was observed in only 1,012 genes out of 16,561 (6.1%). Rho values ranged from 0.77 to 1 (p-value <0.05; Supplementary Fig. S2). Of these genes, 262 (~25%) were expressed in only one tissue. They were predominantly in the testis (~47%) and the frontal cortex (~21%), both of which are known for unique expression patterns17,18. The remaining 750 genes showed a significant correlation between mRNA expression and protein abundance across at least two tissues (Supplementary Table S2). Furthermore, there was correlation across all tissues in 169 out of 1,012 genes (17%).

In this group of genes and proteins with highly correlated expression, we found representatives of different cellular classes, such as the splicing factor SRSF6, the demethylase KDM5A, the ribosome-binding protein of the endoplasmic reticulum RRBP1, and the immune system regulator HLA-A. The genes in these examples share basic essential cellular functions. GO annotation analysis revealed a high enrichment of genes involved in oxidation reduction (Benjamini adjusted p-value: 9.3E-09), and various transport activities, including ion transport (Benjamini adjusted p-value: 2.6E-05) and transmembrane transporter activity (Benjamini adjusted p-value: 1.3E-04), both performing basic functions of the cell.

This highly correlated group of genes and proteins is of special interest because it contains genes and proteins that are known biomarkers and drug targets. We carried out functional analysis of highly correlated genes and the full gene set by using Ingenuity Pathway Analysis (IPA). We identified 101 genes known to be biomarkers in our 1,012 highly correlated genes set (Table 1), comprising 10% of the highly correlated genes. When the same analysis was performed on the full gene set, only 6.1% of genes were identified as biomarkers. This comparison shows enrichment for biomarker genes in the highly correlated gene set (p-value: 8.1E-06, chi-square test). One example is CKB, Creatine Kinase chain B (rho: 0.84, p-value: 2.8E-04). CKB is a biomarker used in drug safety experiments and a suggested biomarker for cardiovascular diseases19 and squamous cell lung cancer20. It is quantified in serum for accurate measurement by different proteomic methods.

Table 1: Highly correlated biomarkers, with correlation score and corrected p-value, and number of tissues used for correlation calculation.

We also studied known drug targets in the full and highly correlated gene sets. The highly correlated set was enriched for drug targets (142 genes out of 1,012, p-value: 0.02, chi-square test) compared to the full set (1,551 genes out of 16,551). The 142 highly correlated genes are targets for 449 different drugs according to the DrugBank21 (Table 2). Drug targets include genes such as Protein Kinase C, Alpha (PRKCA, rho: 0.77, p-value: 0.04) which is targeted by tamoxifen, and Gamma-Aminobutyric Acid A Receptor, Alpha 3 (GABRA3, rho; 1, p-value: < 2.2E-16) which is targeted by diazepam.

Table 2: Complete list of highly correlated drug targets across all tissues.

Tissue-specific concordance-discordance analysis

In the full dataset, we identified a set of 983 genes, which were not expressed in any of the tissues, but for which protein expression was observed. These genes were highly enriched in sensory perception annotations (see Supplementary Table S3 for full annotations list with p-values). We also identified 1,200 proteins which were not expressed in our dataset, but corresponding mRNA expression was observed. These genes were also highly enriched in the regulation of transcription (see Supplementary Table S4 for full annotations list with p-values).

We further characterized several groups of interest with respect to gene and protein expression by tissue. We analyzed four “corner case” groups: high gene expression-high protein expression, high gene expression-low protein expression, low gene expression-high protein expression, and low gene expression-low protein expression (see Materials and Methods and Supplementary Figs S2 and S5). We examined genes whose over- or under-expression was in the top 10% in a given tissue, and found that, on average, 240 genes fell into the high gene expression-high protein expression group, 20 fell into the low gene expression-high protein expression group, 25 fell into the high gene expression-low protein expression group and 138 fell into the low gene expression-low protein expression group (Supplementary Figs S2 and S5). The functions of genes in these groups are well-characterized11. For example, genes whose expression was low tend to have longer 3′ UTRs with AU rich elements and specific TF binding sites.

We studied known drug targets in the context of these four gene sets (Supplementary Fig. S6, Supplementary Tables S5–S8). Most targets had high and concordant gene-protein expression (70–91% across tissues), showing that expression of target genes as measured with microarrays corresponded well to protein levels in the treated tissue. This finding was also true for genes with low and concordant gene-protein expression, the second biggest group (8–19% across tissues). Targets with high gene expression-low protein expression and low gene expression-high protein expression were poorly represented (0–4% and 0–11% across tissues, respectively). These two discordant groups should get close attention, as therapeutic decisions based on gene expression can lead to critical mistakes.

Identifying differentially correlated genes based on protein and gene expression in cancer

We studied differential gene-protein correlation in cancers, using mRNA and protein data from the Cancer Proteome Atlas (TCPA). TCPA is a cancer functional proteomics database and is a part of the Cancer Genome Atlas (TCGA) project22. Like our analysis for normal tissues, we compared the Spearman correlation between mRNA and protein expression patterns for 153 genes in 10 types of cancer from TCPA with 8 corresponding normal tissues (adrenal gland, urinary bladder, colon, frontal cortex, lung, ovary, pancreas, and testis) from the GTEx and HPM datasets (Fig. 1B, marked as comparison number 1). Figure 5 shows our results. Interestingly, Spearman RNA-protein correlations for this set of genes were lower in almost all cancer types compared to corresponding normal tissues. The correlation was similar in only one cancer (lung adenocarcinoma, LUAD).

Figure 5: Tissue-specific correlations of protein-gene expression for 10 tumor types and corresponding normal tissues: adenoid cystic carcinoma (ACC—adrenal gland: pheochromocytoma and paraganglioma (PCPG)—adrenal gland; bladder urothelial carcinoma (BLCA)—urinary bladder; colon adenocarcinoma (COAD)—colon; lower grade glioma (LGG)—frontal cortex; lung adenocarcinoma (LUAD)—lung; lung squamous cell carcinoma (LUSC)—lung; ovarian serous cystadenocarcinoma (OV)—ovary; pancreatic adenocarcinoma (PAAD)—pancreas.
Figure 5

To further explore differences between cancer and normal tissues, we investigated changes in the relationship between gene and protein expression by gene, in tissue pairs. For each cancer-normal tissue pair, we compared the ranking of genes based on the correlation of gene and protein expression. Table 3 summarizes genes that showed differential behavior between cancer and normal tissues (See Supplementary Tables S9 and S10 for full data on genes that were correlated in cancer). One highly correlated gene in cancer vs. normal was Y box binding 1 (YBX1), which was differentially correlated in ACC, COAD, LUAD and LUSC. YBX1 is a transcription factor and a splicing factor. It controls many genes involved with cancer, such as p5323. It is known for overexpression in cancer, and is involved in malignant progression of colorectal adenocarcinoma24,25 and lung cancer26, among other tumors. It is also associated with poor survival. A recent paper suggests that tRNA-derived fragments can suppress breast cancer progression via binding to YBX1, and by that mechanism, prevent it from binding pro-oncogenic transcripts27. RAC-alpha serine/threonine-protein kinase (AKT1) is another gene that was highly correlated in cancer (LGG, PAAD and TGCT). AKT1 is a target in cancer therapy28 via several drugs29. There was a strong correlation between its mRNA and protein levels in some of the cancer types in this study, but not in normal tissues. Because it has three known isoforms30, we hypothesize that targeting AKT1 at the mRNA level and identifying a cancer-specific isoform may lead to better chemotherapies.

Table 3: Genes with higher or lower gene and protein correlation in cancer vs. normal tissues.

Finally, we compared the genes with differential gene-protein expression relationships to drug targets in the DrugBank database21. Table 4 lists all drug targets that were differentially correlated between cancer and normal tissues. An example of a gene with highly correlated protein expression in cancer is receptor tyrosine-protein kinase erbB-2 (ERBB2), which is targeted in LUAD and LUSC that by ado-trastuzumab emtansine, afatinib, and trastuzumab. Another example is Proto-oncogene Tyrosine-protein Kinase (SRC), which is targeted by bosutinib, dasatinib, and ponatinib in COAD. Gene expression measurements in both of these cases and others can be used (or are already in use) as biomarkers and drug targets. Other genes on our list showing differential protein-gene expression correlation in cancer have the potential to be used as new biomarkers and drug targets.

Table 4: Drug targets with higher or lower gene and protein correlation in cancer vs. normal tissues, and related drugs.


The publication of datasets such as the Human Proteome Map (HPM)14 enables researchers to ask new questions regarding proteins in different tissues. Although it is clear that the number of detected proteins does not resemble the full scope of the protein landscape, due to alternative splicing and other similar processes28, the HPM dataset provides an informative glimpse into the human proteome. Our first goal in the current analysis was to explore connections between mRNA levels and protein abundances, in a large-scale study, using publically available data in normal human tissues. We achieved this goal by combining proteomic data from the HPM project and mRNA expression data from the GTEx project.

Our analysis revealed a positive gene-protein expression correlation (0.36–0.5) for the majority of human tissues. For a subset of genes, there was a statistically significant relationship between mRNA expression and protein abundance in all measured tissues. Furthermore, for each tissue we identified a set of genes and proteins that were concordantly expressed at high and low levels, and a much smaller set of genes and proteins that were discordantly expressed. Surprisingly, for genes that were highly correlated across all tissues, or highly concordantly expressed in individual tissues, we could not find an overall strong functional enrichment within the group. This result suggested that a diverse set of genes is under similar regulatory pressure that shapes their correlation. Furthermore, we identified two interesting groups, those with gene expression but no protein expression, and vice versa. Non-detectable expression may be due to technical limitations, a short half-life of the mRNA or protein, or, in the case of RNA expression only, non-coding RNAs. Surprisingly, non-detectable levels of gene expression had no effect on the levels of the detected protein expression, suggesting fast translation in the case of a short half-life or efficient translation from a small amount of mRNA in the case of a low level of mRNA.

We performed functional analysis of gene sets that are known biomarkers and drug targets. A bottleneck in the usage of biomarkers is the long and expensive process required for proteomic measuring of each sample. Measuring mRNA expression levels is cheaper, but is insufficient to determine protein levels because correlations between mRNA expression and protein abundance are relatively low. We found that gene-protein expression of biomarkers and drug targets was more correlated than expected across all tissues and in tissue-specific analysis. These biomarkers are already in use for diagnosis, prognosis, to monitor disease progression, drug efficacy, drug safety and/or response to treatment. mRNA expression and protein abundance was highly correlated in these biomarkers, implying that mRNA expression levels may, in some cases, be sufficient to determine protein abundance. This approach would allow for easier estimates of protein levels.

Although the highly correlated biomarkers and drug targets were the vast majority, we found that gene and protein expression was uncorrelated for some biomarkers and drug targets. These genes are important for further study and for consideration in the context of drug discovery or therapeutic decisions. Gene expression measurements that do not correspond to protein measurements could lead to inaccurate therapeutic decisions. Clearly, many processes can affect drug efficacy, such as gene and protein expression for drug transport29.

In order to investigate the relationships of gene and protein expression in disease, we used TCPA, a cancer functional proteomics dataset that is a part of the TCGA project. By choosing genes based on their differential correlation of mRNA and protein expression, we were able to identify a set of such genes across all cancer types and within each cancer tissue that contains a number of known biomarkers and therapeutic targets. This list can be used in combination with the traditional differential expression analysis for new biomarker and drug target discovery. We hypothesize that testing a patient for highly correlated mRNA transcript before treating him with a drug against the corresponding protein could improve treatment outcomes by improving treatment target specificity.

Our study has several limitations that should be recognized. The first is the source of our data. Unfortunately, the GTEx and HPM data were not collected in a single experiment on the same samples. This fact may affect measurements of expression and correlation across the two experiments. To overcome this limitation, we used the Spearman correlation metric in our tissue-specific and gene-specific analysis. This approach is based on rank rather than on expression values. Another limitation is our ability to work with available protein data. Recently, Protein Atlas, another source of protein expression based on antibody staining to determine protein expression, was published8. Due to collection methodology, protein expression data in that resource is categorical and cannot be used in the same way as the continuous HPM data. For this reason, it was omitted from our analysis. Finally, our cancer analysis examined only 153 genes. These genes were carefully chosen for the study. Therefore, it is not surprising to find a connection between them and cancer. As more untargeted proteomics-based normal and tumor datasets are generated in the next several years, the approaches presented here can be applied more extensively.

In conclusion, we have performed a systematic analysis to examine the central dogma of biology and look for similarities and differences between protein and gene expression in a large number of healthy and cancer tissues. We found that different tissues are differently correlated in their protein and gene expression. We also found that gene and protein expression was highly correlated across tissues in a small set of genes. We further identified a set of genes that were concordantly abundant in each tissue based, on gene and protein levels. This analysis found that a statistically significant proportion of these genes are known biomarkers and therapeutic targets. We also examined genes with discordant gene and protein expressions levels, and hypothesize that those with opposite behavior should be under proper consideration if being detected only by gene expression levels. We finally extended our analysis to diseased tissues, focusing on cancer and identified lower gene protein concordance in cancer in comparison to normal tissues. We posit that concordance and discordance in gene and protein expression has important implications for therapeutics and diagnostics.

Materials and Methods

Dataset sources

The mRNA data in this analysis was extracted from the Genotype-Tissue Expression (GTEx) Project (, which is based on RNA-seq data expression levels in various human tissue samples3. The GTEx project contains data from a combination of sources and technologies, including gene expression microarrays and RNA sequencing. Of the 53 tissue types in the project, we chose 18 tissues corresponding to 14 HPM tissues. The final subset of the GTEx data included RNA sequencing of 921 tissue samples, each of which belonged to one of the 18 tissue subtypes, while esophagus, colon, and heart had more than one tissue. To remove bias, transcript reads on a gene-level basis were normalized for gene length, resulting in transcript data that was in the form of reads per kilobase per million (RPKM).

Protein data was extracted from the Human Proteome Map (HPM) Project ( Data in this project is based on mass spectroscopy (MS) protein levels in various tissue samples across the systems of the human body14. As part of the HPM project, 30 histologically normal tissue types were profiled in totality, 17 of which were adult human tissues. For each tissue type, samples from three individuals were pooled and analyzed using MS. The project detected proteins corresponding to 17,294 genes, or ~84% of the protein-encoding regions of the human genome. In this work, 14 of the 17 adult tissues corresponding to tissue gene expression data existed in the GTEx database are in use. A subset of 16,571 genes that were commonly measured in HPM and GTEx datasets were used in this work.

Both datasets were evaluated in terms of data quality, detection limit, and number of samples per category. Expression values <1 in both datasets were counted as zero values.

For the cancer analysis, RNA-seq and reverse-phase protein arrays (RPPA) data from the Cancer Proteome Atlas (TCPA) was extracted for ten cancer types: adenoid cystic carcinoma (ACC), pheochromocytoma and paraganglioma (PCPG), bladder urothelial carcinoma (BCLA), colon adenocarcinoma (COAD), lower grade glioma (LGG), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), ovarian serous cystadenocarcinoma (OV), pancreatic adenocarcinoma (PAAD), and testicular germ cell tumors (TGCT). These cancers correspond to eight normal tissues: adrenal gland (ACC and PCPG), urine bladder (BCLA), colon (COAD), frontal cortex (LGG), lung (LUAD and LUSC), ovary (OV), pancreas (PAAD), and testis (TGCT). A total of 153 genes were compared between TCPA and GTEx-HPM data.

Tissue correlation

The Spearman correlation was calculated between gene and protein measurements across all tissues. Correlation values are presented in a heatmap. Clustering of data was done by pvclust30, based on correlation as the distance method and Ward’s method as the hierarchical clustering method. Statistically significant clusters (p-value:  < 0.05) are marked in red. Dendrogram comparison using cophenetic correlation was performed using the R package dendextend

Identifying and analyzing highly correlated genes across and within different tissues

An adjusted p-value of the Spearman correlation was used on the full dataset to create a list of 1,012 highly correlated genes across all tissues. From this set, 750 genes with correlations of two or more tissues were chosen for further annotation.

Functional analysis for highly correlated genes across and within different tissues

Functional annotation was performed for the group of highly correlated genes across all tissues, and for 14 groups of tissue-specific highly correlated genes, using several tools. The GO annotation Molecular function was conducted using DAVID31. REVIGO32 was used for summarizing GO annotation terms. Biomarker gene annotation was performed using IPA ( A comparison with drug targets was performed using data from the DrugBank database21, while confining the search to drug targets.

Concordant and discordant expression analysis in GTEx and HPM dataset

We divided the data into four groups by using a top and bottom 10% criteria for each tissue. A high gene expression-high protein expression group included all data points whose gene and protein expression were in the top 10%. The high gene expression-low protein expression group contained all data points in the top 10% in the gene expression dataset and those in the bottom 10% in the protein expression dataset. The low gene expression-high protein expression group included all data points in the bottom 10% for gene expression and the top 10%, while the low gene expression-low protein expression group contains data points in the bottom 10% in both gene and proteins expression datasets. Drug target analysis was performed as explained for highly correlated genes across all tissues.

Characterization of correlation and differential ranking in the cancer dataset

Spearman correlation between protein and gene expression was calculated across all tissues and within each tissue. For ranking comparison, the GTEx-HPM dataset was reduced to the same 153 genes as the cancer dataset. We sorted both datasets based on Spearman’s rho and ranked the results. We then compared the ranking based on gene-protein expression in normal and cancer datasets. A gene was considered as differentially correlated if the difference between the ranking based on the cancer dataset and the ranking based on the normal expression set was at least 30.

Functional annotation for cancer and normal data

Drug target analysis was performed as explained for highly correlated genes across all tissues only for genes showing differential correlation in at least one cancer tissue.

Additional Information

How to cite this article: Kosti, I. et al. Cross-tissue Analysis of Gene and Protein Expression in Normal and Cancer Tissues. Sci. Rep. 6, 24799; doi: 10.1038/srep24799 (2016).


  1. 1.

    , & RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10, 57–63, doi: 10.1038/nrg2484 (2009).

  2. 2.

    , , , & Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621–628, doi: 10.1038/nmeth.1226 (2008).

  3. 3.

    The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585, doi: 10.1038/ng.2653 (2013).

  4. 4.

    et al. A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat Methods 6, 423–430, doi: 10.1038/nmeth.1333 (2009).

  5. 5.

    & Mass spectrometry-based proteomics. Nature 422, 198–207, doi: 10.1038/nature01511 (2003).

  6. 6.

    , , & Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol 4, 117, doi: 10.1186/gb-2003-4-9-117 (2003).

  7. 7.

    et al. Defining the transcriptome and proteome in three functionally different human cell lines. Mol Syst Biol 6, 450, doi: 10.1038/msb.2010.106 (2010).

  8. 8.

    et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419, doi: 10.1126/science.1260419 (2015).

  9. 9.

    & Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet 13, 227–232, doi: 10.1038/nrg3185 (2012).

  10. 10.

    , , & Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720–1730 (1999).

  11. 11.

    et al. Global quantification of mammalian gene expression control. Nature 473, 337–342, doi: 10.1038/nature10098 (2011).

  12. 12.

    et al. Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics 10, 365, doi: 10.1186/1471-2164-10-365 (2009).

  13. 13.

    et al. Discordant protein and mRNA expression in lung adenocarcinomas. Mol Cell Proteomics 1, 304–313 (2002).

  14. 14.

    et al. A draft map of the human proteome. Nature 509, 575–581, doi: 10.1038/nature13302 (2014).

  15. 15.

    et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587, doi: 10.1038/nature13319 (2014).

  16. 16.

    et al. Human genomics. The human transcriptome across tissues and individuals. Science 348, 660–665, doi: 10.1126/science.aaa0355 (2015).

  17. 17.

    Unique chromatin remodeling and transcriptional regulation in spermatogenesis. Science 296, 2176–2178, doi: 10.1126/science.1070963 (2002).

  18. 18.

    et al. Temporal dynamics and genetic control of transcription in the human prefrontal cortex. Nature 478, 519–523, doi: 10.1038/nature10524 (2011).

  19. 19.

    et al. Accurate quantification of cardiovascular biomarkers in serum using Protein Standard Absolute Quantification (PSAQ) and selected reaction monitoring. Mol Cell Proteomics 11, M111 008235, doi: 10.1074/mcp.M111.008235 (2012).

  20. 20.

    et al. Identification of candidate biomarkers for early detection of human lung squamous cell cancer by quantitative proteomics. Mol Cell Proteomics 11, M111 013946, doi: 10.1074/mcp.M111.013946 (2012).

  21. 21.

    et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 42, D1091–1097, doi: 10.1093/nar/gkt1068 (2014).

  22. 22.

    et al. TCPA: a resource for cancer functional proteomics data. Nat Methods 10, 1046–1047, doi: 10.1038/nmeth.2650 (2013).

  23. 23.

    et al. Y-box factor YB1 controls p53 apoptotic function. Oncogene 24, 8314–8325, doi: 10.1038/sj.onc.1208998 (2005).

  24. 24.

    et al. Identification of Y-box binding protein 1 as a core regulator of MEK/ERK pathway-dependent gene signatures in colorectal cancer cells. PLoS Genet 6, e1001231, doi: 10.1371/journal.pgen.1001231 (2010).

  25. 25.

    et al. Knockdown of Yboxbinding protein1 inhibits the malignant progression of HT29 colorectal adenocarcinoma cells by reversing epithelialmesenchymal transition. Mol Med Rep 10, 2720–2728, doi: 10.3892/mmr.2014.2545 (2014).

  26. 26.

    et al. YB-1, the E2F pathway, and regulation of tumor cell growth. J Natl Cancer Inst 104, 133–146, doi: 10.1093/jnci/djr512 (2012).

  27. 27.

    et al. Endogenous tRNA-Derived Fragments Suppress Breast Cancer Progression via YBX1 Displacement. Cell 161, 790–802, doi: 10.1016/j.cell.2015.02.053 (2015).

  28. 28.

    & Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457–463, doi: 10.1038/nature08909 (2010).

  29. 29.

    et al. Endogenous gene and protein expression of drug-transporting proteins in cell lines routinely used in drug discovery programs. Drug Metab Dispos 37, 2275–2283, doi: 10.1124/dmd.109.028654 (2009).

  30. 30.

    & Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22, 1540–1542, doi: 10.1093/bioinformatics/btl117 (2006).

  31. 31.

    et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4, P3 (2003).

  32. 32.

    , , & REVIGO summarizes and visualizes long lists of gene ontology terms. Plos One 6, e21800, doi: 10.1371/journal.pone.0021800 (2011).

Download references


Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM079719. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information


  1. Institute for Computational Health Sciences, University of California, San Francisco, California, United States of America

    • Idit Kosti
    • , Nishant Jain
    • , Dvir Aran
    • , Atul J. Butte
    •  & Marina Sirota


  1. Search for Idit Kosti in:

  2. Search for Nishant Jain in:

  3. Search for Dvir Aran in:

  4. Search for Atul J. Butte in:

  5. Search for Marina Sirota in:


M.S. and A.J.B. conceived of the study. I.K. and N.J. carried out the analysis. D.A. helped with the cancer related analysis. I.K., N.J., D.A., M.S. and A.J.B. participated in discussions, wrote and edited the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Marina Sirota.

Supplementary information


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Creative Commons BYThis work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit