Review Article | Published:

From patterns to pathways: gene expression data analysis comes of age

Nature Genetics volume 32, pages 502508 (2002) | Download Citation

Subjects

Abstract

Many different biological questions are routinely studied using transcriptional profiling on microarrays. A wide range of approaches are available for gleaning insights from the data obtained from such experiments. The appropriate choice of data-analysis technique depends both on the data and on the goals of the experiment. This review summarizes some of the common themes in microarray data analysis, including detection of differential expression, clustering, and predicting sample characteristics. Several approaches to each problem, and their relative merits, are discussed and key areas for additional research highlighted.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    et al. The transcriptional program of sporulation in budding yeast. Science 282, 699–705 (1998).

  2. 2.

    , & Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).

  3. 3.

    et al. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl Acad. Sci. USA 93, 10614–10619 (1996).

  4. 4.

    , , , & Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol. 15, 1359–1367 (1997).

  5. 5.

    Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32, 490–495 (2002).

  6. 6.

    & Design issues for cDNA microarray experiments. Nature Rev. Genet. 3, 579–588 (2002).

  7. 7.

    , , & Diverse signaling pathways activated by growth factor receptors induce broadly overlapping, rather than independent, sets of genes. Cell 97, 727–741 (1999).

  8. 8.

    et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717–728 (1998).

  9. 9.

    & Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, research0032 (2001).

  10. 10.

    et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science 287, 873–880 (2000).

  11. 11.

    , , & Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol. 7, 805–817 (2000).

  12. 12.

    Biostatistical Analysis, 663 (Prentice-Hall, Upper Saddle River, NJ, 1999).

  13. 13.

    , & Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116–5121 (2001).

  14. 14.

    et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).

  15. 15.

    , , & Feature selection for DNA methylation based cancer classification. Bioinformatics 17 Suppl 1, S157–S164 (2001).

  16. 16.

    et al. Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. Blood 99, 1745–1757 (2002).

  17. 17.

    , & Scoring genes for relevance. Technical Report 2000-38 (Institute of Computer Science, Hebrew University, Jerusalem, 2000).

  18. 18.

    , & A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52–63 (2001).

  19. 19.

    Microarray data normalization and transformation. Nature Genet. 32, 496–501 (2002).

  20. 20.

    , , & Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).

  21. 21.

    A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).

  22. 22.

    & Resampling-Based Multiple Testing, 340 (John Wiley & Sons, New York, 1993).

  23. 23.

    & Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, 289–300 (1995).

  24. 24.

    The Analysis of Time Series: An Introduction (5th ed.), 283 (Chapman & Hall, London, 1996).

  25. 25.

    & Time Series Analysis and Its Applications, 560 (Springer Verlag, New York, 2000).

  26. 26.

    , , & Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).

  27. 27.

    et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65–73 (1998).

  28. 28.

    et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).

  29. 29.

    & Aligning gene expression time series with time warping algorithms. Bioinformatics 17, 495–508 (2001).

  30. 30.

    , & Analysis techniques for microarray time-series data. J. Comput. Biol. 9, 317–330 (2002).

  31. 31.

    , & Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455–466 (2000).

  32. 32.

    , & Permutation-validated principal components analysis of microarray data. Genome Biol. 3, research0019 (2002).

  33. 33.

    et al. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad. Sci. USA 97, 8409–8414 (2000).

  34. 34.

    , & Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

  35. 35.

    et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406, 536–540 (2000).

  36. 36.

    et al. Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res. 58, 5009–5013 (1998).

  37. 37.

    & Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, NJ, 1988).

  38. 38.

    et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA 95, 334–339 (1998).

  39. 39.

    et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).

  40. 40.

    Methods for global organization of all known protein sequences. PhD. thesis (Institute of Computer Science, Hebrew University, Jerusalem, Israel, 1999).

  41. 41.

    Self-Organizing Maps (Springer, Berlin, 1997).

  42. 42.

    et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA 96, 2907–2912 (1999).

  43. 43.

    , & Clustering gene expression patterns. J. Comput. Biol. 6, 281–297 (1999).

  44. 44.

    et al. Adaptive quality-based clustering of gene expression profiles. Bioinformatics 18, 735–746 (2002).

  45. 45.

    , & Exploring expression data: identification and analysis of coexpressed genes. Genome Res. 9, 1106–1115 (1999).

  46. 46.

    & CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 307–316 (2000).

  47. 47.

    , , , & Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001).

  48. 48.

    & Model-based clustering, discriminant analysis, and density estimation. J. Amer. Stat. Assoc. 97, 611–631 (2002).

  49. 49.

    et al. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1, research0003 (2000).

  50. 50.

    , & Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001).

  51. 51.

    et al. Methods of assessing reproducibility of clustering patterns observed in analysis of microarray data. Bioinformatics 18, 1462–1469 (2002).

  52. 52.

    & Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA 98, 8961–8965 (2001).

  53. 53.

    Classification (Chapman & Hall/CRC, Boca Raton, FL, 1999).

  54. 54.

    , & A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 6–17 (2002).

  55. 55.

    , & Estimating the number of clusters in a dataset via the gap statistic. J. Roy. Statist. Soc. B 63, 411–423 (2001).

  56. 56.

    et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7, 673–679 (2001).

  57. 57.

    et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genet. 30, 41–47 (2002).

  58. 58.

    et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442 (2002).

  59. 59.

    et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci. USA 98, 10869–10874 (2001).

  60. 60.

    et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

  61. 61.

    , & Molecular portraits and the family tree of cancer. Nature Genet. 32, 533–540 (2002).

  62. 62.

    , & Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576. (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).

  63. 63.

    , , & Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA 99, 6567–6572 (2002).

  64. 64.

    et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344, 539–548 (2001).

  65. 65.

    et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA 98, 15149–15154 (2001).

  66. 66.

    Machine Learning, 414 (WCB McGraw-Hill, Boston, 1997).

  67. 67.

    , & Analysis of gene expression microarrays for phenotype classification. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 75–85 (2000).

  68. 68.

    et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).

  69. 69.

    et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000).

  70. 70.

    Bagging predictors. Machine Learning 24, 123–140 (1996).

  71. 71.

    , , & Boosting the margin: a new explanation for the effectiveness of voting methods. Annls Stat. 26, 1651–1686 (1998).

  72. 72.

    The strength of weak learnability. Machine Learning 5, 197–227 (1990).

  73. 73.

    Manual on Setting Up, Using, and Understanding Random Forests v3.1. (University of California at Berkeley, Berkeley, CA, 2002).

  74. 74.

    et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Med. 8, 68–74 (2002).

  75. 75.

    et al. Tissue classification with gene expression profiles. J. Comput. Biol. 7, 559–583 (2000).

  76. 76.

    et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 61, 7388–7393 (2001).

  77. 77.

    & New feature subset selection procedures for classification of expression profiles. Genome Biol. 3, research0017 (2002).

  78. 78.

    & Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418–429 (2000).

  79. 79.

    , & Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 18–29 (1998).

  80. 80.

    , , & Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).

  81. 81.

    , & Discovery of regulatory interactions through perturbation: inference and experimental design. Pac. Symp. Biocomput. 305–316 (2000).

  82. 82.

    , , & Combining location and expression data for principled discovery of genetic regulatory network models. Pac. Symp. Biocomput. 437–449 (2002).

  83. 83.

    , , & Inferring subnetworks from perturbed expression profiles. Bioinformatics 17 Suppl 1, S215–S224 (2001).

  84. 84.

    , , , & Rich probabilistic models for gene expression. Bioinformatics 17 Suppl 1, S243–S252 (2001).

  85. 85.

    , & Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Pac. Symp. Biocomput. 498–509 (2002).

  86. 86.

    , , & Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput. 422–433 (2001).

  87. 87.

    At the interfaces of epidemiology, genetics and genomics. Nature Rev. Genet. 2, 142–147 (2001).

  88. 88.

    Bioinformatics and clinical informatics: the imperative to collaborate. J. Am. Med. Inform. Assoc. 7, 512–516 (2000).

  89. 89.

    , , , & Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA 97, 12182–12186 (2000).

  90. 90.

    , , & Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 Suppl 1, S233–S240 (2002).

  91. 91.

    , & Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics 17 Suppl 1, S49–S55 (2001).

  92. 92.

    , , , & Systematic determination of genetic network architecture. Nature Genet. 22, 281–285 (1999).

  93. 93.

    & Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 202–210 (2000).

  94. 94.

    , , & Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 317–328 (2000).

  95. 95.

    et al. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 17, 319–326 (2001).

  96. 96.

    , , & A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21–28 (2001).

  97. 97.

    et al. Chemosensitivity prediction by transcriptional profiling. Proc. Natl Acad. Sci. USA 98, 10787–10792 (2001).

  98. 98.

    , & A paradigm for class prediction using gene expression profiles. J. Comput. Biol. 9, 505–511 (2002).

Download references

Acknowledgements

I thank Gene Brown, Lenore Cowen, Steve Haney, Andrew Hill, Steve Rozen and Timm Triplett for helpful discussions and comments.

Author information

Affiliations

  1. Department of Genomics, Wyeth Research, 35 Cambridge Park Drive, Cambridge, Massachusetts 02140, USA dslonim@wyeth.com

    • Donna K. Slonim

Authors

  1. Search for Donna K. Slonim in:

Competing interests

The author declares no competing financial interests.

Supplementary information

About this article

Publication history

Published

DOI

https://doi.org/10.1038/ng1033