Introduction

The high-throughput gene expression profiling technologies facilitate screening expression levels for thousands of genes simultaneously. One of the main objectives for analyzing gene expression profiles is to identify genes differentially expressed (DE) in cancer compared with normal control1. Many methods have been proposed to identify DE genes2,3,4,5 and a popular choice is Significance Analysis of Microarrays (SAM) based on t-test statistic6. It has been reported that the t-test is biased to genes with low expression levels3,6 because a gene with low expression level may have a large absolute t-statistic due to its small variance, even when its mean difference between two conditions is small7. SAM was proposed to correct this bias. However, due to logarithmic transformation of data in SAM, the differences of log-scaled expression levels between two conditions are actually the logarithms of their fold change (FC) ratios. Because genes with low expression levels are more likely to reach large FCs than genes with high expression levels, SAM is also biased to genes with low expression levels8. Compared with genes expressed at low levels, genes expressed at high levels are more likely to be involved in some functionally conserved pathways such as oxidative phosphorylation9, glutathione metabolism10,11,12 and proteasome13 with important biological significances.

In a recent study8, we have proposed an algorithm, named the pairwise difference (PD) algorithm, to identify DE genes in small-scale cell line experiments, which typically measure only two or three technical replicates for each of two different experimental conditions, respectively. Briefly, by pairing technical replicates under two conditions, the algorithm identifies DE genes with top-ranked absolute expression differences between the two types of cells which are significantly reproducible in independent paired technical replicates8. Compared with SAM and other commonly used methods, PD can exclusively identify many DE genes with high expression levels in both two types of cells8. However, this algorithm cannot be used directly to identify DE genes between two types of tissue samples (e.g., cancer and normal control) because the biological replicates of each type of tissue may have large between-individual differences.

In this study, in consideration that tissue samples are biological replicates with between-individual differences, we averaged the gene expression profiles separately for two types of samples in a dataset to construct a cancer-normal pair, and then applied PD to identify DE genes using multiple cancer-normal pairs separately constructed from several independent datasets or sub-datasets of a dataset. Using datasets for lung cancer and esophagus cancer, we demonstrated the applicability and power of this strategy in finding functionally important DE genes highly expressed in both cancer and normal tissues that tend to be missed by SAM.

Results

The applicability of the PD algorithm to multiple datasets

Firstly, for each of the three datasets for lung cancer and normal samples (see Table 1), we separately averaged the gene expression profiles for cancer and normal samples in each datasets to construct a paired average gene expression profiles, referred to as a cancer-normal pair. Then, for every cancer-normal pair, all genes were ranked according to their absolute average differences (AD) of expression levels between cancer and normal samples in descending order. As shown in Fig. 1a, the consistency scores of the deregulation directions of the top n (n = 1000, 2000, 3000, 4000, 5000) genes between every two cancer-normal pairs were all higher than 91.8%, which were all significantly higher than what expected by chance (binomial test, all p < 2.2 × 10−16) (see Methods for details). We did similar analyses in two datasets for esophagus cancer (Table 1) and found that the consistency scores of the deregulation directions of the top n (n = 1000, 2000, 3000, 4000, 5000) genes between the two datasets were all higher than 96.42%, as shown in Fig. 1b. These results suggested that the differential expression signals between every two independent cancer-normal pairs for a particular cancer were significantly reproducible.

Table 1 Description of the datasets used in this study.
Figure 1
figure 1

Consistency scores between two datasets for a cancer.

The consistency scores between the top n (n = 1000, 2000, 3000, 4000, 5000) genes ranked by absolute average expression differences for every two cancer-normal pairs were evaluated in (a) three datasets for lung cancer (GSE19188, GSE19804 and GSE27262). and (b) two datasets for esophagus cancer (GSE20347 and GSE29001).

We further did a random experiment to show that the differential expression signals were irreproducible when there were no phenotype differences between two groups of samples. Using the GSE19804 dataset with 60 lung cancer samples and 60 normal samples, we randomly permuted sample labels two times to produce two datasets of artificial “cancer” and “normal” samples, and then calculated the consistency score of the deregulation directions of the top 1000 genes sorted by the average expressions difference between the two artificial cancer-normal pairs. The random experiment was repeated 1000 times. As expected, the average of the 1000 consistency scores was 49.83% with 0.1954 of standard deviation. These results suggested that the differential expression signals were irreproducible when there were no phenotype differences between two groups of samples.

Then, regarding every cancer-normal pair as an independent pair of technical replicates, we used the PD algorithm to identify reproducible DE genes between the lung cancer and normal control of three datasets. The two parameters of the algorithm, the initial step and the consistency threshold, were set as 300 and 95%, respectively, as suggested previously8. With the above two parameters, PD identified a list of 6,092 DE genes for lung cancer, and this list of DE genes was denoted as C3. In comparison, 10,865, 12,287 and 10,945 DE genes were identified by SAM with 5% FDR control in the GSE19188, GSE19804 and GSE27262 datasets, respectively. The consistency scores of the overlapped DE genes between C3 and the DE genes identified by SAM in the three datasets were 99.83%, 100% and 100%, respectively (Table 2). Similarly, PD identified 3,498 DE genes based on the two datasets of esophagus cancer, denoted as C2, and the consistency scores with DE genes identified by SAM in the two datasets were both 100% (Table 2).

Table 2 The consistency scores of the DE genes identified by both PD and SAM.

On the other hand, approximately 9.3–22.1% of the DE genes in C3 identified by PD were not identified by SAM. As shown in Fig. 2, the average expression levels of the DE genes exclusively identified by PD were rather high in both cancer and normal samples of the three datasets, while the average expression levels of most DE genes exclusively identified by SAM were quite low in cancer and/or normal samples. Similar results were observed based on the two datasets for esophagus cancer (Supplementary Figure S1). Thus, the PD algorithm can identify DE genes expressed highly in both cancer and normal tissues, which tend to be missed by SAM.

Figure 2
figure 2

The distributions of the average expression levels for DE genes identified exclusively by PD or SAM for lung cancer.

Red crosses represent the DE genes exclusively identified by PD in C3, and black dots represent the DE genes exclusively identified by SAM in datasets (a) GSE19188, (b) GSE19804, (c) GSE27262, respectively. The average expression levels of DE genes in normal samples (x-axis) and cancer samples (y-axis) were plotted. The average expression levels above 5,000 were set to 5,000.

The applicability of the PD algorithm to a single dataset

We used the dataset GSE27262 for 25 lung cancer samples and 25 normal samples to exemplify the feasibility of the PD algorithm in analyzing a single dataset. Firstly, we divided this dataset evenly into two sub-datasets according to the GSM series numbers of samples: set 1 and set 2 with 12 and 13 pairs of cancer and normal samples, respectively (Table 3). Then, we transformed the two sub-datasets into two independent cancer-normal pairs of averaged gene expression profiles. With the same parameter setting as above, PD identified 3,789 DE genes, denoted as S2. 3,386 of these 3,789 DE genes overlapped with C3 and the consistency score between S2 and C3 was 100%. When dividing the GSE27262 into four small sub-datasets (Table 3), PD identified 4,157 DE genes, denoted as S4. The consistency score between S4 and C3 was 99.94%.

Table 3 The consistency scores of DE genes identified by PD from sub-datasets of GSE27262 and three datasets for lung cancer.

Similarly, when dividing the dataset GSE29001 evenly into two and four small sub-datasets, respectively, PD identified 1,738 and 2,298 DE genes for esophagus cancer. The consistency scores between the two lists of DE genes with the DE genes in C2 were 100% and 99.88% (Table 4), respectively.

Table 4 The consistency scores of DE genes identified by PD from sub-datasets of GSE29001 and two datasets for esophagus cancer.

Taking together, the above results demonstrated that PD can work well by dividing a dataset evenly into several sub-datasets with sample sizes as small as about six for each type of samples.

Significant functional pathways detected by the PD algorithm

Here, we used the above dataset GSE27262 for lung cancer and the dataset GSE29001 for esophagus cancer to demonstrate that most of the pathways significantly enriched with DE genes found by PD tend to be missed by SAM.

With 10% FDR control, the DE genes in S2 found by PD for lung cancer were significantly enriched in 14 pathways (Fig. 3a). However, none of these pathways was identified as significant by enrichment analysis with the same FDR control for the 10,945 DE genes found by SAM with 5% FDR control. When focusing on the most significant DE genes found by SAM, with the same number of DE genes in S2, 13 of the 14 significant pathways were still unfound (Fig. 3a). Besides, the DNA replication pathway19 commonly found by PD and SAM, the other 13 significant pathways are mainly associated with lung cancer, including pentose phosphate pathway20, oxidative phosphorylation9,21, cysteine and methionine metabolism22, glutathione metabolism10,11,12, biosynthesis of amino acids23, ribosome24, proteasome13, protein processing in endoplasmic reticulum25,26, phagosome27 and TNF signaling pathway28. These conservative pathways included many DE genes highly expressed in both cancer and normal tissues, which tended to be missed by SAM. For example, among the 16 DE genes found exclusively by PD in the TNF signaling pathway, the average expression level of CCL2 was ranked at the top 3.2% and 1% of all the measured genes in the cancer and normal samples, respectively. The difference between the average expression level of this gene in the cancer samples and its average expression level in the normal samples was as large as 1678.72, whereas the average of the corresponding differences for all the DE genes identified by SAM was only 245.03. It has been reported that this gene may play an important role in the development of lung cancer29. For another example, the average expression level of TNFAIP3 was ranked at the top 7.4% and 3.2% of all the measured genes in the cancer and normal samples, respectively. The difference between the average expression level of this gene in the cancer samples and its average expression level in the normal samples was 625.43. This gene has been reported as a negative regulator of NF-kappa B activation as well as TNF-mediated apoptosis30 and its underexpression can promote the progression of lung cancer31. The detailed information about these 16 DE genes was shown in Supplementary Table S1.

Figure 3
figure 3

The comparison of functional pathways enriched with DE genes separately identified by PD and SAM.

The biological pathways significantly enriched with DE genes identified by PD (using two subsets of each dataset, S2) and by SAM in (a) GSE27262 for lung cancer, (b) GSE29001 for esophagus cancer. The most significant DE genes identified by SAM, with the same number of the DE genes found by PD, were used for pathway enrichment analyses. The p values of the KEGG pathways were adjusted by Benjamini and Hochberg (FDR = 10%), and −log10(p) was used to generate the heat map.

Similarly for esophagus cancer, the four pathways significantly enriched with DE genes in S2 identified by PD were all missed by SAM (Fig. 3b). These significant pathways included pathways for oxidative phosphorylation32, glutathione metabolism33, ribosome34,35 and proteasome36.

The above pathway enrichment analyses demonstrated that the PD algorithm can capture important cancer-associated pathways with highly expressed DE genes, including many housekeeping genes (see Discussion), which might play important roles in oncogenesis, whereas most of these pathways tend to be missed by SAM. The results also provided extra evidence supporting the reliability of the DE genes found by PD because a list of DE genes can be significantly enriched in pathways only when it contains sufficient real DE genes37,38.

Discussion

In this paper, we extended the application of the PD algorithm to the identification of DE genes between cancer and normal tissue samples based on several independent datasets or sub-datasets of a dataset. The application results for lung and esophageal cancer showed that PD can exclusively identify many DE genes with high expression levels in both cancer and normal samples, which tend to be missed by the commonly used SAM. Functional enrichment analyses of DE genes identified by PD showed that it can exclusively identify many significant biological pathways related to the development of cancers. Especially, the results demonstrated that the PD algorithm could efficiently identify DE genes by dividing a dataset evenly into several sub-datasets with sample sizes as small as about six for each type of samples. In general, for researchers with their own experimental data, we would recommend them making use of independent datasets in public data sources, in cases that such data exist, in order to increase the power and accuracy of biological discovery.

Notably, in our functional analysis examples for lung cancer and esophagus cancer, four pathways were commonly identified by PD but missed by SAM. These four pathways were well known cancer-related pathways for oxidative phosphorylation, glutathione metabolism, ribosome and proteasome. These biological pathways are related to two important cancer hallmarks, the metabolic network (the oxidative phosphorylation and glutathione metabolism pathways) and genome duplication network (ribosome) according to the cancer hallmarks network framework proposed by Wang et al.39. Reprogramming of metabolism is an important mechanism supporting the growth and division of cancer cell40. Genome duplication plays an important role on tumor formation and can activate several cancer hallmarks network41,42. These conservative cancer hallmarks or pathways all included many highly expressed housekeeping genes playing essential roles in the pathogenesis of cancer. For example, in the ribosome pathway, among the 45 DE genes found exclusively by PD in the GSE27262 dataset for lung cancer (Supplementary Table S2), 35 genes were housekeeping genes reported by Zhu et al.43. The average expression levels of these 35 housekeeping genes were all ranked among the top 20% of all the measured genes in both the cancer and normal samples. It is known that housekeeping genes maintain the basic needs for a cell to survive44,45,46, and thus their deregulations tend to induce human diseases including cancer47,48. For examples, the overexpression of RPSA may be positively correlated with the angiogenesis of lung cancer49,50, the overexpression of RPL19 promotes malignant proliferation of lung cancer cells51, and the underexpression of RPS3, a critical regulator of DNA repair and apoptosis52, might accelerate the development of lung cancer. Such cancer-related housekeeping genes tend to be evolutionarily conserved and play critical roles in carcinogenesis together with tissue-specific less-conservative cancer-related genes53.

Although PD can exclusively identify many important cancer-associated genes with high expression levels which play important functional roles in carcinogenesis, it has its own shortcomings. A major limitation is that it still cannot obtain DE genes with FDR control. Obviously, the higher the consistency threshold was set, the lower the rate of false positives of DE genes identified between two independent sample pairs. However, the FDR has a complex relationship with the parameter of consistency threshold. Besides, some DE genes and pathways identified by SAM were missed by PD which is biased to genes with high expressions. For example, DE genes identified by SAM from the dataset GSE27262 for lung cancer were enriched in the fanconi anemia pathway related with risk of lung adenocarcinoma54,55. However, this pathway was missed by DE genes identified by PD. In this pathway, 13 DE genes were identified by SAM but not by PD. The average expression levels of the 13 genes were among the bottom 70% and 61% of all the measured genes of all the cancer and normal samples, respectively. These results demonstrate that, different from SAM, PD tends to miss DE genes with low expression levels. Therefore, the PD algorithm is not a substitution but an effective complement to current approaches for analyzing DE genes of tissue datasets with biological replicates.

Methods

Data and data pre-processing

Multiple gene expression datasets for lung cancer and esophageal cancer were collected from Gene Expression Omnibus (GEO)56. Detail information about these datasets used in this study were described in Table 1. For each dataset, the raw data (.CEL files) was pre-processed using the robust average (RMA) algorithm57,58. Then each probe-set ID was matched to its Entrez gene ID. If multiple probesets were matched to the same gene, the expression value for the gene was referred to as the arithmetic mean of the values of the multiple probesets (on the log2 scale).

Identification of reproducible DE genes

The pairwise difference (PD) algorithm8 was originally designed for analyzing small-scale cell line data with two or three technical replicates for each of two different cell lines. Since technical replicates for a cell line have no biological difference, every two independent pairs of technical replicates for two different cell lines can be regarded as independent experiments to identify DE genes through reproducibility evaluation. However, because tissue samples from different individuals are biological replicates with large biological variations among individuals, every two paired samples for two types of tissues cannot be regarded as reproducible independent experiments. In order to reduce the influence of biological variations among samples with the same phenotype, we used several independent datasets to construct multiple cancer-normal pairs by averaging a set of gene expression profiles separately for each of the two phenotypes. Specifically, for each dataset, we calculated the mean non-log-transformed expression values of each gene in the normal samples (type N) and cancer samples (type C), respectively, to form a paired average gene expression profiles for cancer and normal tissues. For a given pair j consisting of one type N sample and one type C sample, the mean values of gene i in the type N sample and type C sample, denoted as and , respectively, were calculated as following:

where n1 and n2 were the numbers of samples in type N and type C, respectively. xik was the expression value of gene i in a type N or type C sample.

Then, for gene i, the average expression difference between two phenotypes of a given cancer-normal pair j, denoted as Dij, was calculated as following:

If the value was larger (or smaller) than zero, then gene i was defined as up-regulation (or down-regulation) in type C sample. Regarding multiple cancer-normal pairs constructed from independent datasets as independent experiments, we could identify DE genes through reproducibility evaluation with the same PD algorithm descried in details in our original paper8. Briefly, all genes in each cancer-normal pair were sorted in descending order by their absolute pairwise expression differences between two phenotypes and divided into blocks by the initial step of 300. The significantly reproducible DE gene lists between the decreasingly ranked blocks of each two independent pairs were identified if their consistency scores were higher than a pre-settled consistency threshold (here, 95%).

Reproducibility evaluation of two DE gene lists

For two DE gene lists from two different datasets sharing k DE genes, of which s genes had the consistent directions (either up-regulation or down-regulation) in type C samples, the consistency score was calculated as s/k. The cumulative binomial distribution model59 was used to estimate the probability of observing at least s of k DE genes with the consistent directions by chance:

in which pe is the probability of one gene having the consistent direction in two DE gene lists by random chance (here, pe = 0.5). A DE genes list is considered significantly reproducible if the p value of the consistency score is <0.01.

Pathway enrichment analysis

Functional enrichment analysis was done based on the Kyoto Encyclopaedia of Genes and Genomes60. The hypergeometric distribution model was used to identify biological pathways that were significantly enriched with DE genes61, the probability of observing at least k genes in a pathway by chance can be computed as follow:

n is the number of DE genes identified from N genes in a dataset and k of them are annotated in a pathway with m genes.

The p values were adjusted using the Benjamini and Hochberg procedure62, controlling the False Discovery Rate (FDR) at the 10% level.

Additional Information

How to cite this article: Huang, H. et al. Identifying reproducible cancer-associated highly expressed genes with important functional significances using multiple datasets. Sci. Rep. 6, 36227; doi: 10.1038/srep36227 (2016).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.