Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

From patterns to pathways: gene expression data analysis comes of age

Abstract

Many different biological questions are routinely studied using transcriptional profiling on microarrays. A wide range of approaches are available for gleaning insights from the data obtained from such experiments. The appropriate choice of data-analysis technique depends both on the data and on the goals of the experiment. This review summarizes some of the common themes in microarray data analysis, including detection of differential expression, clustering, and predicting sample characteristics. Several approaches to each problem, and their relative merits, are discussed and key areas for additional research highlighted.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: The advantage of permutation-based adjustment for multiple testing.
Figure 2: Two pattern-discovery techniques.
Figure 3: An overview of the process for building a prediction model to classify samples.

Katie Ris

Similar content being viewed by others

References

  1. Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science 282, 699–705 (1998).

    Article  CAS  PubMed  Google Scholar 

  2. DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).

    Article  CAS  PubMed  Google Scholar 

  3. Schena, M. et al. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl Acad. Sci. USA 93, 10614–10619 (1996).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Wodicka, L., Dong, H., Mittmann, M., Ho, M.H. & Lockhart, D.J. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol. 15, 1359–1367 (1997).

    Article  CAS  PubMed  Google Scholar 

  5. Churchill, G.A. Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32, 490–495 (2002).

    Article  CAS  PubMed  Google Scholar 

  6. Yang, Y.H. & Speed, T. Design issues for cDNA microarray experiments. Nature Rev. Genet. 3, 579–588 (2002).

    Article  CAS  PubMed  Google Scholar 

  7. Fambrough, D., McClure, K., Kazlauskas, A. & Lander, E.S. Diverse signaling pathways activated by growth factor receptors induce broadly overlapping, rather than independent, sets of genes. Cell 97, 727–741 (1999).

    Article  CAS  PubMed  Google Scholar 

  8. Holstege, F.C. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717–728 (1998).

    Article  CAS  PubMed  Google Scholar 

  9. Li, C. & Hung Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, research0032 (2001).

  10. Roberts, C.J. et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science 287, 873–880 (2000).

    Article  CAS  PubMed  Google Scholar 

  11. Ideker, T., Thorsson, V., Siegel, A.F. & Hood, L.E. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol. 7, 805–817 (2000).

    Article  CAS  PubMed  Google Scholar 

  12. Zar, J.H. Biostatistical Analysis, 663 (Prentice-Hall, Upper Saddle River, NJ, 1999).

    Google Scholar 

  13. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116–5121 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).

    Article  CAS  PubMed  Google Scholar 

  15. Model, F., Adorjan, P., Olek, A. & Piepenbrock, C. Feature selection for DNA methylation based cancer classification. Bioinformatics 17 Suppl 1, S157–S164 (2001).

    Article  PubMed  Google Scholar 

  16. Zhan, F. et al. Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. Blood 99, 1745–1757 (2002).

    Article  CAS  PubMed  Google Scholar 

  17. Ben-Dor, A., Friedman, N. & Yakhini, Z. Scoring genes for relevance. Technical Report 2000-38 (Institute of Computer Science, Hebrew University, Jerusalem, 2000).

  18. Park, P.J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52–63 (2001).

  19. Quackenbush, J. Microarray data normalization and transformation. Nature Genet. 32, 496–501 (2002).

    Article  CAS  PubMed  Google Scholar 

  20. Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).

    Google Scholar 

  21. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).

    Google Scholar 

  22. Westfall, P.H. & Young, S.S. Resampling-Based Multiple Testing, 340 (John Wiley & Sons, New York, 1993).

    Google Scholar 

  23. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, 289–300 (1995).

    Google Scholar 

  24. Chatfield, C. The Analysis of Time Series: An Introduction (5th ed.), 283 (Chapman & Hall, London, 1996).

    Google Scholar 

  25. Shumway, R.H. & Stoffer, D.S. Time Series Analysis and Its Applications, 560 (Springer Verlag, New York, 2000).

    Book  Google Scholar 

  26. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65–73 (1998).

    Article  CAS  PubMed  Google Scholar 

  28. Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Aach, J. & Church, G.M. Aligning gene expression time series with time warping algorithms. Bioinformatics 17, 495–508 (2001).

    Article  CAS  PubMed  Google Scholar 

  30. Filkov, V., Skiena, S. & Zhi, J. Analysis techniques for microarray time-series data. J. Comput. Biol. 9, 317–330 (2002).

    Article  CAS  PubMed  Google Scholar 

  31. Raychaudhuri, S., Stuart, J.M. & Altman, R.B. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455–466 (2000).

  32. Landgrebe, J., Wurst, W. & Welzl, G. Permutation-validated principal components analysis of microarray data. Genome Biol. 3, research0019 (2002).

  33. Holter, N.S. et al. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad. Sci. USA 97, 8409–8414 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Alter, O., Brown, P.O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Bittner, M. et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406, 536–540 (2000).

    Article  CAS  PubMed  Google Scholar 

  36. Khan, J. et al. Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res. 58, 5009–5013 (1998).

    CAS  PubMed  Google Scholar 

  37. Jain, A.K. & Dubes, R.C. Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, NJ, 1988).

    Google Scholar 

  38. Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA 95, 334–339 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Alizadeh, A.A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).

    Article  CAS  PubMed  Google Scholar 

  40. Yona, G. Methods for global organization of all known protein sequences. PhD. thesis (Institute of Computer Science, Hebrew University, Jerusalem, Israel, 1999).

  41. Kohonen, T. Self-Organizing Maps (Springer, Berlin, 1997).

    Book  Google Scholar 

  42. Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA 96, 2907–2912 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Ben-Dor, A., Shamir, R. & Yakhini, Z. Clustering gene expression patterns. J. Comput. Biol. 6, 281–297 (1999).

    Article  CAS  PubMed  Google Scholar 

  44. De Smet, F. et al. Adaptive quality-based clustering of gene expression profiles. Bioinformatics 18, 735–746 (2002).

    Article  CAS  PubMed  Google Scholar 

  45. Heyer, L.J., Kruglyak, S. & Yooseph, S. Exploring expression data: identification and analysis of coexpressed genes. Genome Res. 9, 1106–1115 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Sharan, R. & Shamir, R. CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 307–316 (2000).

    CAS  PubMed  Google Scholar 

  47. Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E. & Ruzzo, W.L. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001).

    Article  CAS  PubMed  Google Scholar 

  48. Fraley, C. & Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Amer. Stat. Assoc. 97, 611–631 (2002).

    Article  Google Scholar 

  49. Hastie, T. et al. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1, research0003 (2000).

  50. Yeung, K.Y., Haynor, D.R. & Ruzzo, W.L. Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001).

    Article  CAS  PubMed  Google Scholar 

  51. McShane, L.M. et al. Methods of assessing reproducibility of clustering patterns observed in analysis of microarray data. Bioinformatics 18, 1462–1469 (2002).

    Article  CAS  PubMed  Google Scholar 

  52. Kerr, M.K. & Churchill, G.A. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA 98, 8961–8965 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Gordon, A.D. Classification (Chapman & Hall/CRC, Boca Raton, FL, 1999).

    Google Scholar 

  54. Ben-Hur, A., Elisseeff, A. & Guyon, I. A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 6–17 (2002).

  55. Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a dataset via the gap statistic. J. Roy. Statist. Soc. B 63, 411–423 (2001).

    Article  Google Scholar 

  56. Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7, 673–679 (2001).

    Article  CAS  PubMed  Google Scholar 

  57. Armstrong, S.A. et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genet. 30, 41–47 (2002).

    Article  CAS  PubMed  Google Scholar 

  58. Pomeroy, S.L. et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442 (2002).

    Article  CAS  PubMed  Google Scholar 

  59. Sorlie, T. et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci. USA 98, 10869–10874 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. van 't Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

    Article  CAS  PubMed  Google Scholar 

  61. Chung, C.H., Bernard, P.S. & Perou, C.M. Molecular portraits and the family tree of cancer. Nature Genet. 32, 533–540 (2002).

    Article  CAS  PubMed  Google Scholar 

  62. Dudoit, S., Fridlyand, J. & Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576. (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).

  63. Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA 99, 6567–6572 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Hedenfalk, I. et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344, 539–548 (2001).

    Article  CAS  PubMed  Google Scholar 

  65. Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA 98, 15149–15154 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Mitchell, T.M. Machine Learning, 414 (WCB McGraw-Hill, Boston, 1997).

    Google Scholar 

  67. Califano, A., Stolovitzky, G. & Tu, Y. Analysis of gene expression microarrays for phenotype classification. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 75–85 (2000).

    CAS  PubMed  Google Scholar 

  68. Brown, M.P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Furey, T.S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000).

    Article  CAS  PubMed  Google Scholar 

  70. Breiman, L. Bagging predictors. Machine Learning 24, 123–140 (1996).

    Google Scholar 

  71. Schapire, R.E., Freund, Y., Bartlett, P. & Lee, W.S. Boosting the margin: a new explanation for the effectiveness of voting methods. Annls Stat. 26, 1651–1686 (1998).

    Article  Google Scholar 

  72. Schapire, R.E. The strength of weak learnability. Machine Learning 5, 197–227 (1990).

    Google Scholar 

  73. Breiman, L. Manual on Setting Up, Using, and Understanding Random Forests v3.1. (University of California at Berkeley, Berkeley, CA, 2002).

    Google Scholar 

  74. Shipp, M.A. et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Med. 8, 68–74 (2002).

    Article  CAS  PubMed  Google Scholar 

  75. Ben-Dor, A. et al. Tissue classification with gene expression profiles. J. Comput. Biol. 7, 559–583 (2000).

    Article  CAS  PubMed  Google Scholar 

  76. Su, A.I. et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 61, 7388–7393 (2001).

    CAS  PubMed  Google Scholar 

  77. Bo, T. & Jonassen, I. New feature subset selection procedures for classification of expression profiles. Genome Biol. 3, research0017 (2002).

  78. Butte, A.J. & Kohane, I.S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418–429 (2000).

  79. Liang, S., Fuhrman, S. & Somogyi, R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 18–29 (1998).

  80. Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).

    Article  CAS  PubMed  Google Scholar 

  81. Ideker, T.E., Thorsson, V. & Karp, R.M. Discovery of regulatory interactions through perturbation: inference and experimental design. Pac. Symp. Biocomput. 305–316 (2000).

  82. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. & Young, R.A. Combining location and expression data for principled discovery of genetic regulatory network models. Pac. Symp. Biocomput. 437–449 (2002).

  83. Pe'er, D., Regev, A., Elidan, G. & Friedman, N. Inferring subnetworks from perturbed expression profiles. Bioinformatics 17 Suppl 1, S215–S224 (2001).

    Article  PubMed  Google Scholar 

  84. Segal, E., Taskar, B., Gasch, A., Friedman, N. & Koller, D. Rich probabilistic models for gene expression. Bioinformatics 17 Suppl 1, S243–S252 (2001).

    Article  PubMed  Google Scholar 

  85. Yoo, C., Thorsson, V. & Cooper, G.F. Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Pac. Symp. Biocomput. 498–509 (2002).

  86. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. & Young, R.A. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput. 422–433 (2001).

  87. Potter, J.D. At the interfaces of epidemiology, genetics and genomics. Nature Rev. Genet. 2, 142–147 (2001).

    Article  CAS  PubMed  Google Scholar 

  88. Kohane, I.S. Bioinformatics and clinical informatics: the imperative to collaborate. J. Am. Med. Inform. Assoc. 7, 512–516 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R. & Kohane, I.S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA 97, 12182–12186 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A.F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 Suppl 1, S233–S240 (2002).

    Article  PubMed  Google Scholar 

  91. Chiang, D.Y., Brown, P.O. & Eisen, M.B. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics 17 Suppl 1, S49–S55 (2001).

    Article  PubMed  Google Scholar 

  92. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. & Church, G.M. Systematic determination of genetic network architecture. Nature Genet. 22, 281–285 (1999).

    Article  CAS  PubMed  Google Scholar 

  93. Holmes, I. & Bruno, W.J. Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 202–210 (2000).

    CAS  PubMed  Google Scholar 

  94. Shatkay, H., Edwards, S., Wilbur, W.J. & Boguski, M. Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 317–328 (2000).

    CAS  PubMed  Google Scholar 

  95. Masys, D.R. et al. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 17, 319–326 (2001).

    Article  CAS  PubMed  Google Scholar 

  96. Jenssen, T.K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).

    CAS  PubMed  Google Scholar 

  97. Staunton, J.E. et al. Chemosensitivity prediction by transcriptional profiling. Proc. Natl Acad. Sci. USA 98, 10787–10792 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Radmacher, M.D., McShane, L.M. & Simon, R. A paradigm for class prediction using gene expression profiles. J. Comput. Biol. 9, 505–511 (2002).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

I thank Gene Brown, Lenore Cowen, Steve Haney, Andrew Hill, Steve Rozen and Timm Triplett for helpful discussions and comments.

Author information

Authors and Affiliations

Authors

Ethics declarations

Competing interests

The author declares no competing financial interests.

Supplementary information

Rights and permissions

Reprints and permissions

About this article

Cite this article

Slonim, D. From patterns to pathways: gene expression data analysis comes of age. Nat Genet 32 (Suppl 4), 502–508 (2002). https://doi.org/10.1038/ng1033

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng1033

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing