Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Original Article
  • Published:

Selection of differentially expressed genes in microarray data analysis

Abstract

One common objective in microarray experiments is to identify a subset of genes that express differentially among different experimental conditions, for example, between drug treatment and no drug treatment. Often, the goal is to determine the underlying relationship between poor versus good gene signatures for identifying biological functions or predicting specific therapeutic outcomes. Because of the complexity in studying hundreds or thousands of genes in an experiment, selection of a subset of genes to enhance relationships among the underlying biological structures or to improve prediction accuracy of clinical outcomes has been an important issue in microarray data analysis. Selection of differentially expressed genes is a two-step process. The first step is to select an appropriate test statistic and compute the P-value. The genes are ranked according to their P-values as evidence of differential expression. The second step is to assign a significance level, that is, to determine a cutoff threshold from the P-values in accordance with the study objective. In this paper, we consider four commonly used statistics, t-, S- (SAM), U-(Mann–Whitney) and M-statistics to compute the P-values for gene ranking. We consider the family-wise error and false discovery rate false-positive error-controlled procedures to select a limited number of genes, and a receiver-operating characteristic (ROC) approach to select a larger number of genes for assigning the significance level. The ROC approach is particularly useful in genomic/genetic profiling studies. The well-known colon cancer data containing 22 normal and 40 tumor tissues are used to illustrate different gene ranking and significance level assignment methods for applications to genomic/genetic profiling studies. The P-values computed from the t-, U- and M-statistics are very similar. We discuss the common practice that uses the P-value, false-positive error probability, as the primary criterion, and then uses the fold-change as a surrogate measure of biological significance for gene selection. The P-value and the fold-change can be pictorially shown simultaneously in a volcano plot. We also address several issues on gene selection.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1
Figure 2
Figure 3

Similar content being viewed by others

References

  1. Simon R, Wang SJ . Use of genomic signatures in therapeutics development in oncology and other diseases. The Pharmacogenom J 2006; 6: 166–173.

    Article  CAS  Google Scholar 

  2. Smyth GK, Yang YH, Speed TP . Statistical issues in cDNA microarray data analysis. In: Brownstein MJ, Khodursky A (eds) Functional Genomics: Methods and Protocols. Methods in Molecular Biology. Humana Press: Totowa, NJ, 2003; 224: 111–136.

    Chapter  Google Scholar 

  3. Golub T, Slonim D, Tamayo P, Huard C, Gassenbeek M, Mesirov J et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286: 531–537.

    Article  CAS  Google Scholar 

  4. Tsai CA, Chen CH, Lee TC, Ho IC, Yang UC, Chen JJ . Gene Selection for sample classifications in microarray experiments. DNA Cell Biol 2004; 23: 607–614.

    Article  CAS  Google Scholar 

  5. Tusher VG, Tibshirani R, Chu G . Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98: 5116–5121.

    Article  CAS  Google Scholar 

  6. Efron B, Tibshirani R, Storey J, Tusher V . Empirical bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96: 1151–1160.

    Article  Google Scholar 

  7. Tsai CA, Chen YJ, Chen JJ . Testing for differentially expressed genes with microarray data. Nucleic Acids Res 2003; 31: e52.

    Article  Google Scholar 

  8. Westfall PH, Young SS . Resampling-Based Multiple Testing. John Wiley & Sons: New York, 1993.

    Google Scholar 

  9. Benjamini Y, Hochberg Y . Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 1995; 57: 289–300.

    Google Scholar 

  10. Hsueh H, Chen JJ, Kodell RL . Comparison of methods for estimating number of true null hypothesis in multiplicity testing. J Biopharm Stat 2003; 13: 675–689.

    Article  Google Scholar 

  11. Hsueh H, Tsai CA, Chen JJ . Incorporating the number of the true hypotheses to improve power in multiple testing: application to gene microarray data. J Stat Comput Simulation, to appear.

  12. Delongchamp RR, Bowyer JF, Chen JJ, Kodell RL . Multiple testing strategy for analyzing cDNA array data on gene expression. Biometrics 2004; 60: 774–782.

    Article  Google Scholar 

  13. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999; 96: 6745–6750.

    Article  CAS  Google Scholar 

  14. Jin W, Riley RM, Wolfinger RD, White KP, Passador-Gurgel G, Gibson G . The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nat Genet 2001; 29: 389–395.

    Article  CAS  Google Scholar 

  15. Qin LX, Kerr KF . Contributing Members of the Toxicogenomics Research Consortium. Empirical evaluation of data transformations and ranking statistics for microarray analysis. Nucleic Acids Res 2004; 32: 5471–5479.

    Article  CAS  Google Scholar 

  16. van’t veer LJ, Dai H, van de vijver MJ, He YD, Hart AAM, Mao M et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415: 530–536.

    Article  Google Scholar 

  17. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003; 34: 267–273.

    Article  CAS  Google Scholar 

  18. Larkin JE, Frank BC, Gavas H, Sultana R, Quackenbush J . Independence and reproducibility across microarray platforms. Nat Methods 2005; 2: 337–343.

    Article  CAS  Google Scholar 

  19. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC et al. Multiple-laboratory comparison of microarray platforms. Nat Methods 2005; 2: 345–349.

    Article  CAS  Google Scholar 

  20. Members of the Toxicogenomics Research Consortium. Standardizing global gene expression analysis between laboratories and across platforms. Nat Methods 2005; 2: 351–356.

    Article  Google Scholar 

  21. Tsai CA, Wang SJ, Chen DT, Chen JJ . Sample size for gene expression microarray experiments. Bioinformatics 2005; 21: 1502–1508.

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J J Chen.

Additional information

The views presented in this paper are those of the authors and do not necessarily represent those of the US Food and Drug Administration.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Wang, SJ., Tsai, CA. et al. Selection of differentially expressed genes in microarray data analysis. Pharmacogenomics J 7, 212–220 (2007). https://doi.org/10.1038/sj.tpj.6500412

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/sj.tpj.6500412

Keywords

This article is cited by

Search

Quick links