Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

The use and analysis of microarray data

Key Points

  • Functional genomics is the study of gene function through the parallel expression measurements of genomes. The tools used to carry out these measurements most commonly include complementary DNA microarrays, oligonucleotide microarrays or serial analysis of gene expression (SAGE). Regardless of the specific technique, with the end result is 4,000–50,000 measurements of gene expression per sample. As a complete experiment might involve up to hundreds of microarrays, the resultant RNA expression data sets can vary greatly in size.

  • In addition to their use in basic research and target discovery, there are many other uses of functional genomics in drug discovery, including biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease subclass determination.

  • Current methodologies to analyse RNA expression data sets can be roughly divided into two categories: supervised approaches, or analysis to determine genes that fit a specified pattern; and unsupervised approaches, or analysis looking for characterization of the components of a data set, without the a priori input of a training signal.

  • Hierarchical clustering is particularly advantageous in representing all the expression patterns seen in an experiment in a compact way. Self-organizing maps provide a two-dimensional visual survey of expression patterns with fewer computational requirements compared with hierarchical clustering. Relevance networks provide networks constructed from pairs of genes with strong positive or negative correlation, and can include phenotypic measurements. Principal components are used for visualization, by displaying samples on coordinate axes that capture the most variance in the data.

  • Nearest-neighbour methods find those genes that are most similar to an ideal gene pattern. Support vector machines are used to separate biological samples from differing conditions or diseases, by finding a plane to separate them in a higher-dimensional feature-rich space.

  • Challenges after analysis can include linking probes to genes and other biological knowledge, a process that never ends. Operationally, one is never done analysing a set of microarray data. The analysis of microarray data sets in a setting devoid of biological knowledge will be less rewarding than tapping into that knowledge. Finally, in the application of functional genomics to drug discovery, to extract the most information from microarrays, an open mind is needed with regard to the choices of analytical methods, using supervised methods, unsupervised methods and methods yet to be invented.


Functional genomics is the study of gene function through the parallel expression measurements of genomes, most commonly using the technologies of microarrays and serial analysis of gene expression. Microarray usage in drug discovery is expanding, and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease-subclass determination. This article reviews the different ways to analyse large sets of microarray data, including the questions that can be asked and the challenges in interpreting the measurements.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Schematized experimental process using a microarray.
Figure 2: Dissimilarity measures.
Figure 3: Clustering and network-determination methods used in microarray analysis.


  1. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).

    Article  CAS  Google Scholar 

  2. Lockhart, D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnol. 14, 1675–1680 (1996).

    Article  CAS  Google Scholar 

  3. Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).

    Article  CAS  Google Scholar 

  4. Wu, T. D. Analysing gene expression data from DNA microarrays to identify candidate genes. J. Pathol. 195, 53–65 (2001).

    Article  CAS  Google Scholar 

  5. Eickhoff, B., Korn, B., Schick, M., Poustka, A. & van der Bosch, J. Normalization of array hybridization experiments in differential gene expression analysis. Nucleic Acids Res. 27, 33 (1999).

    Article  Google Scholar 

  6. Zien, A., Aigner, T., Zimmer, R. & Lengauer, T. Centralization: a new method for the normalization of gene expression data. Bioinformatics 17 (Suppl. 1), S323–S331 (2001).

    Article  Google Scholar 

  7. Li, C. & Hung Wong W., Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, research0032–0032 (2001). This article describes normalization techniques, as well as a popular alternative quantification method for Affymetrix microarrays.

    Google Scholar 

  8. Ramdas, L. et al. Sources of nonlinearity in cDNA microarray expression measurements. Genome Biol. 2, research0047– 0047 (2001).

    Article  Google Scholar 

  9. Tseng, G. C., Oh, M. K., Rohlin, L., Liao, J. C. & Wong, W. H. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res. 29, 2549–2557 (2001).

    Article  CAS  Google Scholar 

  10. Livesey, F. J., Furukawa, T., Steffen, M. A., Church, G. M. & Cepko, C. L. Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx. Curr. Biol. 10, 301–310 (2000).

    Article  CAS  Google Scholar 

  11. Jelinsky, S. A. & Samson, L. D. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc. Natl Acad. Sci. USA 96, 1486–1491 (1999).

    Article  CAS  Google Scholar 

  12. Chen, J. J. et al. Profiling expression patterns and isolating differentially expressed genes by cDNA microarray system with colorimetry detection. Genomics 51, 313–324 (1998).

    Article  CAS  Google Scholar 

  13. Ishii, M. et al. Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics 68, 136–143 (2000).

    Article  CAS  Google Scholar 

  14. Vernon, S. D. et al. Reproducibility of alternative probe synthesis approaches for gene expression profiling with arrays. J. Mol. Diagn. 2, 124–127 (2000).

    Article  CAS  Google Scholar 

  15. Baugh, L. R., Hill, A. A., Brown, E. L. & Hunter, C. P. Quantitative analysis of mRNA amplification by in vitro transcription. Nucleic Acids Res. 29, E29 (2001).

    Article  CAS  Google Scholar 

  16. Schadt, E. E., Li, C., Su, C. & Wong, W. H. Analyzing high-density oligonucleotide gene expression array data. J. Cell. Biochem. 80, 192–202 (2000).

    Article  CAS  Google Scholar 

  17. Yang, Y. H., Buckley, M. J., Dudoit, S. & Speed, T. P. Comparison of Methods for Image Analysis on cDNA Microarray Data (Univ. California, Berkeley, 2000).

    Google Scholar 

  18. Emmert-Buck, M. R. et al. Laser capture microdissection. Science 274, 998–1001 (1996).

    Article  CAS  Google Scholar 

  19. Ross, D. T. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genet. 24, 227–235 (2000). Using dendrograms, Ross and colleagues found clusters of genes measured across the various cancer cell lines in the NCI-60 panel.

    Article  CAS  Google Scholar 

  20. Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R. & Kohane, I. S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA 97, 12182–12186 (2000).

    Article  CAS  Google Scholar 

  21. Kuo, W. P., Jenssen, T. K., Butte, A. J., Ohno-Machado, L. & Kohane, I. S. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18, 405–412 (2002). One of the first studies to compare published measurements of, in theory, the same cancer cell lines on cDNA and oligonucleotide microarrays. Shows that these measurements are not directly comparable.

    Article  CAS  Google Scholar 

  22. Tusher, V. G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116–5121 (2001).

    Article  CAS  Google Scholar 

  23. Butte, A. J. et al. Determining significant fold differences in gene expression analysis. Pac. Symp. Biocomput. 6–17 (2001).

  24. Park, P. J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52–63 (2001).

  25. Pavlidis, P. & Noble, W. S. Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol. 2, research0042.10–0042.15 (2001).

    Article  Google Scholar 

  26. Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999). One of the first publications to show how microarrays can assist in difficult clinical diagnosis; in this case, determining acute lymphocytic leukaemia from acute myelogenous leukaemia using a nearest-neighbour approach.

    Article  CAS  Google Scholar 

  27. Quinlan, J. C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, California, 1992).

    Google Scholar 

  28. Rumelhart, D., McClelland, J. & The Parallel Distributed Processing Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (MIT Press, Cambridge, Massachusetts, 1986).

    Google Scholar 

  29. Furey, T. S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000).

    Article  CAS  Google Scholar 

  30. Brown, M. P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).

    Article  CAS  Google Scholar 

  31. Chow, M. L., Moler, E. J. & Mian, I. S. Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol. Genomics 5, 99–111 (2001).

    Article  CAS  Google Scholar 

  32. Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

    Article  CAS  Google Scholar 

  33. Raychaudhuri, S., Stuart, J. M. & Altman, R. B. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455–466 (2000).

  34. Fiehn, O. et al. Metabolite profiling for plant functional genomics. Nature Biotechnol. 18, 1157–1161 (2000).

    Article  CAS  Google Scholar 

  35. Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA 95, 334–339 (1998). One of the first large microarray publications, with 112 genes measured in 9 conditions, analysed using dendograms created using Euclidean distance.

    Article  CAS  Google Scholar 

  36. Hilsenbeck, S. G. et al. Statistical analysis of array expression data as applied to the problem of tamoxifen resistance. J. Natl Cancer Inst. 91, 453–459 (1999).

    Article  CAS  Google Scholar 

  37. Ben-Dor, A. et al. Tissue classification with gene expression profiles. J. Comput. Biol. 7, 559–583 (2000).

    Article  CAS  Google Scholar 

  38. Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA 96, 2907–2912 (1999). Tamayo and colleagues were the first to use self-organizing maps to show clusters of genes measured across time from differentiating hematopoetic cells.

    Article  CAS  Google Scholar 

  39. Toronen, P., Kolehmainen, M., Wong, G. & Castren, E. Analysis of gene expression data using self-organizing maps. FEBS Lett. 451, 142–146 (1999).

    Article  CAS  Google Scholar 

  40. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998). The first group to show the now-standard Eisen-style dendrogram.

    Article  CAS  Google Scholar 

  41. Liang, S., Fuhrman, S. & Somogyi, R. Reveala general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 18–29 (1998).

  42. Wuensche, A. Genomic regulation modeled as a network with basins of attraction. Pac. Symp. Biocomput. 89–102 (1998).

  43. Szallasi, Z. & Liang, S. Modeling the normal and neoplastic cell cycle with 'realistic Boolean genetic networks': their application for understanding carcinogenesis and assessing therapeutic strategies. Pac. Symp. Biocomput. 66–76 (1998).

  44. Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).

    Article  CAS  Google Scholar 

  45. Butte, A. & Kohane, I. in Fall Symposium, American Medical Informatics Association (ed. Lorenzi, N.) 711–715 (Hanley and Belfus, Washington DC, 1999).

    Google Scholar 

  46. Butte, A. J. & Kohane, I. S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418–429 (2000).

  47. Spellman, P. T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998). The first publication to merge several microarray experiments, to show clusters using dendrograms constructed using correlation coefficients, and to analyse the time-series pattern of genes using Fourier analysis.

    Article  CAS  Google Scholar 

  48. Yeung, K. Y. & Ruzzo, W. L. An Empirical Study of Principal-Components Analysis for Clustering Gene Expression Data Technical Report UW-CSE-2000-11-03. (Univ. Washington, Washington DC, 2000).

  49. Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000). Alizadeh and colleagues were the first to use microarrays to find subtypes of a single disease that could be defined only by their gene-expression patterns, and which showed significant differences in patient mortality.

    Article  CAS  Google Scholar 

  50. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000).

    Article  CAS  Google Scholar 

  51. Kohane, I. S., Kho, A. T. & Butte, A. J. Microarrays for an Integrative Genomics (MIT Press, Cambridge, Massachusetts, 2002).

    Book  Google Scholar 

  52. Perou, C. M. Show me the data! Nature Genet. 29, 373 (2001).

    Article  CAS  Google Scholar 

Download references


The author wishes to thank T. Deshpande, A. Kho, M. Ramoni and I. Kohane for critical comments and interesting discussions on the manuscript. During the writing of this work, the author has been funded by and wishes to thank the Endocrine Fellows Foundation, the Genentech Centre for Clinical Research and Education, the Lawson Wilkins Paediatric Endocrinology Society, the Harvard Centre for Neurodegenerative Research and the Merck–Massachusetts Institute of Technology partnership. The author was also supported in part by grants from the National Heart, Lung and Blood Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, and the National Institute of Neurological Disorders and Stroke.

Author information

Authors and Affiliations


Related links

Related links


acute lymphocytic leukaemia

acute myelogenous leukaemia





Eisen's laboratory

GeneCluster 2.0

National Cancer Institute




Instead of fitting a complex polynomial curve to data, splines allow the fitting of data by putting together smaller, less complex curves.


Different RNA molecules are separated by mass on a gel, then radioactively labelled complementary DNA or RNA molecules are used to quantify specific RNA amounts.


The synthesis of a strand of DNA from RNA, which is used to make a complementary DNA copy of sample RNA.


A graphical representation in which variables (that is, genes) are represented as nodes. Arrows between nodes represent conditional dependence, which is interpretable as causal associations.


A measurement of the degree of fit of a linear-regression line to data points, calculated as the average distance of points from the regression line normalized to the standard deviations of the individual coordinates.


Points are restated in terms of their ordinal rank (for example, first, second, third) before calculation of the correlation coefficient.


A visual representation of hierarchical clusters.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Butte, A. The use and analysis of microarray data. Nat Rev Drug Discov 1, 951–960 (2002).

Download citation

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing