The use and analysis of microarray data

Butte, Atul

doi:10.1038/nrd961

Review Article
Published: 01 December 2002

The use and analysis of microarray data

Atul Butte¹

Nature Reviews Drug Discovery volume 1, pages 951–960 (2002)Cite this article

4866 Accesses
340 Citations
24 Altmetric
Metrics details

Key Points

Functional genomics is the study of gene function through the parallel expression measurements of genomes. The tools used to carry out these measurements most commonly include complementary DNA microarrays, oligonucleotide microarrays or serial analysis of gene expression (SAGE). Regardless of the specific technique, with the end result is 4,000–50,000 measurements of gene expression per sample. As a complete experiment might involve up to hundreds of microarrays, the resultant RNA expression data sets can vary greatly in size.
In addition to their use in basic research and target discovery, there are many other uses of functional genomics in drug discovery, including biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease subclass determination.
Current methodologies to analyse RNA expression data sets can be roughly divided into two categories: supervised approaches, or analysis to determine genes that fit a specified pattern; and unsupervised approaches, or analysis looking for characterization of the components of a data set, without the a priori input of a training signal.
Hierarchical clustering is particularly advantageous in representing all the expression patterns seen in an experiment in a compact way. Self-organizing maps provide a two-dimensional visual survey of expression patterns with fewer computational requirements compared with hierarchical clustering. Relevance networks provide networks constructed from pairs of genes with strong positive or negative correlation, and can include phenotypic measurements. Principal components are used for visualization, by displaying samples on coordinate axes that capture the most variance in the data.
Nearest-neighbour methods find those genes that are most similar to an ideal gene pattern. Support vector machines are used to separate biological samples from differing conditions or diseases, by finding a plane to separate them in a higher-dimensional feature-rich space.
Challenges after analysis can include linking probes to genes and other biological knowledge, a process that never ends. Operationally, one is never done analysing a set of microarray data. The analysis of microarray data sets in a setting devoid of biological knowledge will be less rewarding than tapping into that knowledge. Finally, in the application of functional genomics to drug discovery, to extract the most information from microarrays, an open mind is needed with regard to the choices of analytical methods, using supervised methods, unsupervised methods and methods yet to be invented.

Abstract

Functional genomics is the study of gene function through the parallel expression measurements of genomes, most commonly using the technologies of microarrays and serial analysis of gene expression. Microarray usage in drug discovery is expanding, and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease-subclass determination. This article reviews the different ways to analyse large sets of microarray data, including the questions that can be asked and the challenges in interpreting the measurements.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Schematized experimental process using a microarray.**

**Figure 3: Clustering and network-determination methods used in microarray analysis.**

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Genome-wide association studies

Article 26 August 2021

References

Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).
Article CAS Google Scholar
Lockhart, D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnol. 14, 1675–1680 (1996).
Article CAS Google Scholar
Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).
Article CAS Google Scholar
Wu, T. D. Analysing gene expression data from DNA microarrays to identify candidate genes. J. Pathol. 195, 53–65 (2001).
Article CAS Google Scholar
Eickhoff, B., Korn, B., Schick, M., Poustka, A. & van der Bosch, J. Normalization of array hybridization experiments in differential gene expression analysis. Nucleic Acids Res. 27, 33 (1999).
Article Google Scholar
Zien, A., Aigner, T., Zimmer, R. & Lengauer, T. Centralization: a new method for the normalization of gene expression data. Bioinformatics 17 (Suppl. 1), S323–S331 (2001).
Article Google Scholar
Li, C. & Hung Wong W., Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, research0032–0032 (2001). This article describes normalization techniques, as well as a popular alternative quantification method for Affymetrix microarrays.
Google Scholar
Ramdas, L. et al. Sources of nonlinearity in cDNA microarray expression measurements. Genome Biol. 2, research0047– 0047 (2001).
Article Google Scholar
Tseng, G. C., Oh, M. K., Rohlin, L., Liao, J. C. & Wong, W. H. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res. 29, 2549–2557 (2001).
Article CAS Google Scholar
Livesey, F. J., Furukawa, T., Steffen, M. A., Church, G. M. & Cepko, C. L. Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx. Curr. Biol. 10, 301–310 (2000).
Article CAS Google Scholar
Jelinsky, S. A. & Samson, L. D. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc. Natl Acad. Sci. USA 96, 1486–1491 (1999).
Article CAS Google Scholar
Chen, J. J. et al. Profiling expression patterns and isolating differentially expressed genes by cDNA microarray system with colorimetry detection. Genomics 51, 313–324 (1998).
Article CAS Google Scholar
Ishii, M. et al. Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics 68, 136–143 (2000).
Article CAS Google Scholar
Vernon, S. D. et al. Reproducibility of alternative probe synthesis approaches for gene expression profiling with arrays. J. Mol. Diagn. 2, 124–127 (2000).
Article CAS Google Scholar
Baugh, L. R., Hill, A. A., Brown, E. L. & Hunter, C. P. Quantitative analysis of mRNA amplification by in vitro transcription. Nucleic Acids Res. 29, E29 (2001).
Article CAS Google Scholar
Schadt, E. E., Li, C., Su, C. & Wong, W. H. Analyzing high-density oligonucleotide gene expression array data. J. Cell. Biochem. 80, 192–202 (2000).
Article CAS Google Scholar
Yang, Y. H., Buckley, M. J., Dudoit, S. & Speed, T. P. Comparison of Methods for Image Analysis on cDNA Microarray Data (Univ. California, Berkeley, 2000).
Google Scholar
Emmert-Buck, M. R. et al. Laser capture microdissection. Science 274, 998–1001 (1996).
Article CAS Google Scholar
Ross, D. T. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genet. 24, 227–235 (2000). Using dendrograms, Ross and colleagues found clusters of genes measured across the various cancer cell lines in the NCI-60 panel.
Article CAS Google Scholar
Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R. & Kohane, I. S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA 97, 12182–12186 (2000).
Article CAS Google Scholar
Kuo, W. P., Jenssen, T. K., Butte, A. J., Ohno-Machado, L. & Kohane, I. S. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18, 405–412 (2002). One of the first studies to compare published measurements of, in theory, the same cancer cell lines on cDNA and oligonucleotide microarrays. Shows that these measurements are not directly comparable.
Article CAS Google Scholar
Tusher, V. G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116–5121 (2001).
Article CAS Google Scholar
Butte, A. J. et al. Determining significant fold differences in gene expression analysis. Pac. Symp. Biocomput. 6–17 (2001).
Park, P. J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52–63 (2001).
Pavlidis, P. & Noble, W. S. Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol. 2, research0042.10–0042.15 (2001).
Article Google Scholar
Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999). One of the first publications to show how microarrays can assist in difficult clinical diagnosis; in this case, determining acute lymphocytic leukaemia from acute myelogenous leukaemia using a nearest-neighbour approach.
Article CAS Google Scholar
Quinlan, J. C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, California, 1992).
Google Scholar
Rumelhart, D., McClelland, J. & The Parallel Distributed Processing Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (MIT Press, Cambridge, Massachusetts, 1986).
Google Scholar
Furey, T. S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000).
Article CAS Google Scholar
Brown, M. P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).
Article CAS Google Scholar
Chow, M. L., Moler, E. J. & Mian, I. S. Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol. Genomics 5, 99–111 (2001).
Article CAS Google Scholar
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).
Article CAS Google Scholar
Raychaudhuri, S., Stuart, J. M. & Altman, R. B. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455–466 (2000).
Fiehn, O. et al. Metabolite profiling for plant functional genomics. Nature Biotechnol. 18, 1157–1161 (2000).
Article CAS Google Scholar
Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA 95, 334–339 (1998). One of the first large microarray publications, with 112 genes measured in 9 conditions, analysed using dendograms created using Euclidean distance.
Article CAS Google Scholar
Hilsenbeck, S. G. et al. Statistical analysis of array expression data as applied to the problem of tamoxifen resistance. J. Natl Cancer Inst. 91, 453–459 (1999).
Article CAS Google Scholar
Ben-Dor, A. et al. Tissue classification with gene expression profiles. J. Comput. Biol. 7, 559–583 (2000).
Article CAS Google Scholar
Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA 96, 2907–2912 (1999). Tamayo and colleagues were the first to use self-organizing maps to show clusters of genes measured across time from differentiating hematopoetic cells.
Article CAS Google Scholar
Toronen, P., Kolehmainen, M., Wong, G. & Castren, E. Analysis of gene expression data using self-organizing maps. FEBS Lett. 451, 142–146 (1999).
Article CAS Google Scholar
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998). The first group to show the now-standard Eisen-style dendrogram.
Article CAS Google Scholar
Liang, S., Fuhrman, S. & Somogyi, R. Reveala general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 18–29 (1998).
Wuensche, A. Genomic regulation modeled as a network with basins of attraction. Pac. Symp. Biocomput. 89–102 (1998).
Szallasi, Z. & Liang, S. Modeling the normal and neoplastic cell cycle with 'realistic Boolean genetic networks': their application for understanding carcinogenesis and assessing therapeutic strategies. Pac. Symp. Biocomput. 66–76 (1998).
Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).
Article CAS Google Scholar
Butte, A. & Kohane, I. in Fall Symposium, American Medical Informatics Association (ed. Lorenzi, N.) 711–715 (Hanley and Belfus, Washington DC, 1999).
Google Scholar
Butte, A. J. & Kohane, I. S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418–429 (2000).
Spellman, P. T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998). The first publication to merge several microarray experiments, to show clusters using dendrograms constructed using correlation coefficients, and to analyse the time-series pattern of genes using Fourier analysis.
Article CAS Google Scholar
Yeung, K. Y. & Ruzzo, W. L. An Empirical Study of Principal-Components Analysis for Clustering Gene Expression Data Technical Report UW-CSE-2000-11-03. (Univ. Washington, Washington DC, 2000).
Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000). Alizadeh and colleagues were the first to use microarrays to find subtypes of a single disease that could be defined only by their gene-expression patterns, and which showed significant differences in patient mortality.
Article CAS Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000).
Article CAS Google Scholar
Kohane, I. S., Kho, A. T. & Butte, A. J. Microarrays for an Integrative Genomics (MIT Press, Cambridge, Massachusetts, 2002).
Book Google Scholar
Perou, C. M. Show me the data! Nature Genet. 29, 373 (2001).
Article CAS Google Scholar

Download references

Acknowledgements

The author wishes to thank T. Deshpande, A. Kho, M. Ramoni and I. Kohane for critical comments and interesting discussions on the manuscript. During the writing of this work, the author has been funded by and wishes to thank the Endocrine Fellows Foundation, the Genentech Centre for Clinical Research and Education, the Lawson Wilkins Paediatric Endocrinology Society, the Harvard Centre for Neurodegenerative Research and the Merck–Massachusetts Institute of Technology partnership. The author was also supported in part by grants from the National Heart, Lung and Blood Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, and the National Institute of Neurological Disorders and Stroke.

Author information

Authors and Affiliations

Children's Hospital Informatics Program and Division of Endocrinology, Children's Hospital, 300 Longwood Avenue, Boston, 02115, Massachusetts, USA
Atul Butte

Authors

Atul Butte
View author publications
You can also search for this author in PubMed Google Scholar

Glossary

SPLINES: Instead of fitting a complex polynomial curve to data, splines allow the fitting of data by putting together smaller, less complex curves.
NORTHERN BLOT: Different RNA molecules are separated by mass on a gel, then radioactively labelled complementary DNA or RNA molecules are used to quantify specific RNA amounts.
REVERSE TRANSCRIPTION: The synthesis of a strand of DNA from RNA, which is used to make a complementary DNA copy of sample RNA.
BAYESIAN NETWORK: A graphical representation in which variables (that is, genes) are represented as nodes. Arrows between nodes represent conditional dependence, which is interpretable as causal associations.
PEARSON CORRELATION COEFFICIENT: A measurement of the degree of fit of a linear-regression line to data points, calculated as the average distance of points from the regression line normalized to the standard deviations of the individual coordinates.
RANK CORRELATION COEFFICIENT: Points are restated in terms of their ordinal rank (for example, first, second, third) before calculation of the correlation coefficient.
DENDROGRAM: A visual representation of hierarchical clusters.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Butte, A. The use and analysis of microarray data. Nat Rev Drug Discov 1, 951–960 (2002). https://doi.org/10.1038/nrd961

Download citation

Issue Date: 01 December 2002
DOI: https://doi.org/10.1038/nrd961

This article is cited by

Characterisation of changes in global genes expression in the lung of ICR mice in response to the inflammation and fibrosis induced by polystyrene nanoplastics inhalation
- You Jeong Jin
- Ji Eun Kim
- Dae Youn Hwang
Toxicological Research (2023)
RETRACTED ARTICLE: Optimizing the Performance of Neural Network for Bladder Cancer Prediction and Diagnosis Using Intelligent Firefly
- Tawfeeq Abdullah Alkanhal
Arabian Journal for Science and Engineering (2023)
A gene signal amplifier platform for monitoring the unfolded protein response
- Carlos A. Origel Marmolejo
- Bhagyashree Bachhav
- Laura Segatori
Nature Chemical Biology (2020)
Meta- and cross-species analyses of insulin resistance based on gene expression datasets in human white adipose tissues
- Junghyun Jung
- Go Woon Kim
- Wonhee Jang
Scientific Reports (2018)
Curated compendium of human transcriptional biomarker data
- Nathan P. Golightly
- Avery Bell
- Stephen R. Piccolo
Scientific Data (2018)

The use and analysis of microarray data

Key Points

Abstract

Access options

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Refining the impact of genetic evidence on clinical success

Genome-wide association studies

References

Acknowledgements

Author information

Authors and Affiliations

Related links

DATABASES

Cancer.gov

LocusLink

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Characterisation of changes in global genes expression in the lung of ICR mice in response to the inflammation and fibrosis induced by polystyrene nanoplastics inhalation

RETRACTED ARTICLE: Optimizing the Performance of Neural Network for Bladder Cancer Prediction and Diagnosis Using Intelligent Firefly

A gene signal amplifier platform for monitoring the unfolded protein response

Meta- and cross-species analyses of insulin resistance based on gene expression datasets in human white adipose tissues

Curated compendium of human transcriptional biomarker data

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Related links

Related links

DATABASES

Cancer.gov

LocusLink

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links