Key Points
-
Functional genomics is the study of gene function through the parallel expression measurements of genomes. The tools used to carry out these measurements most commonly include complementary DNA microarrays, oligonucleotide microarrays or serial analysis of gene expression (SAGE). Regardless of the specific technique, with the end result is 4,000–50,000 measurements of gene expression per sample. As a complete experiment might involve up to hundreds of microarrays, the resultant RNA expression data sets can vary greatly in size.
-
In addition to their use in basic research and target discovery, there are many other uses of functional genomics in drug discovery, including biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease subclass determination.
-
Current methodologies to analyse RNA expression data sets can be roughly divided into two categories: supervised approaches, or analysis to determine genes that fit a specified pattern; and unsupervised approaches, or analysis looking for characterization of the components of a data set, without the a priori input of a training signal.
-
Hierarchical clustering is particularly advantageous in representing all the expression patterns seen in an experiment in a compact way. Self-organizing maps provide a two-dimensional visual survey of expression patterns with fewer computational requirements compared with hierarchical clustering. Relevance networks provide networks constructed from pairs of genes with strong positive or negative correlation, and can include phenotypic measurements. Principal components are used for visualization, by displaying samples on coordinate axes that capture the most variance in the data.
-
Nearest-neighbour methods find those genes that are most similar to an ideal gene pattern. Support vector machines are used to separate biological samples from differing conditions or diseases, by finding a plane to separate them in a higher-dimensional feature-rich space.
-
Challenges after analysis can include linking probes to genes and other biological knowledge, a process that never ends. Operationally, one is never done analysing a set of microarray data. The analysis of microarray data sets in a setting devoid of biological knowledge will be less rewarding than tapping into that knowledge. Finally, in the application of functional genomics to drug discovery, to extract the most information from microarrays, an open mind is needed with regard to the choices of analytical methods, using supervised methods, unsupervised methods and methods yet to be invented.
Abstract
Functional genomics is the study of gene function through the parallel expression measurements of genomes, most commonly using the technologies of microarrays and serial analysis of gene expression. Microarray usage in drug discovery is expanding, and its applications include basic research and target discovery, biomarker determination, pharmacology, toxicogenomics, target selectivity, development of prognostic tests and disease-subclass determination. This article reviews the different ways to analyse large sets of microarray data, including the questions that can be asked and the challenges in interpreting the measurements.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).
Lockhart, D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnol. 14, 1675–1680 (1996).
Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).
Wu, T. D. Analysing gene expression data from DNA microarrays to identify candidate genes. J. Pathol. 195, 53–65 (2001).
Eickhoff, B., Korn, B., Schick, M., Poustka, A. & van der Bosch, J. Normalization of array hybridization experiments in differential gene expression analysis. Nucleic Acids Res. 27, 33 (1999).
Zien, A., Aigner, T., Zimmer, R. & Lengauer, T. Centralization: a new method for the normalization of gene expression data. Bioinformatics 17 (Suppl. 1), S323–S331 (2001).
Li, C. & Hung Wong W., Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, research0032–0032 (2001). This article describes normalization techniques, as well as a popular alternative quantification method for Affymetrix microarrays.
Ramdas, L. et al. Sources of nonlinearity in cDNA microarray expression measurements. Genome Biol. 2, research0047– 0047 (2001).
Tseng, G. C., Oh, M. K., Rohlin, L., Liao, J. C. & Wong, W. H. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res. 29, 2549–2557 (2001).
Livesey, F. J., Furukawa, T., Steffen, M. A., Church, G. M. & Cepko, C. L. Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx. Curr. Biol. 10, 301–310 (2000).
Jelinsky, S. A. & Samson, L. D. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc. Natl Acad. Sci. USA 96, 1486–1491 (1999).
Chen, J. J. et al. Profiling expression patterns and isolating differentially expressed genes by cDNA microarray system with colorimetry detection. Genomics 51, 313–324 (1998).
Ishii, M. et al. Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics 68, 136–143 (2000).
Vernon, S. D. et al. Reproducibility of alternative probe synthesis approaches for gene expression profiling with arrays. J. Mol. Diagn. 2, 124–127 (2000).
Baugh, L. R., Hill, A. A., Brown, E. L. & Hunter, C. P. Quantitative analysis of mRNA amplification by in vitro transcription. Nucleic Acids Res. 29, E29 (2001).
Schadt, E. E., Li, C., Su, C. & Wong, W. H. Analyzing high-density oligonucleotide gene expression array data. J. Cell. Biochem. 80, 192–202 (2000).
Yang, Y. H., Buckley, M. J., Dudoit, S. & Speed, T. P. Comparison of Methods for Image Analysis on cDNA Microarray Data (Univ. California, Berkeley, 2000).
Emmert-Buck, M. R. et al. Laser capture microdissection. Science 274, 998–1001 (1996).
Ross, D. T. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genet. 24, 227–235 (2000). Using dendrograms, Ross and colleagues found clusters of genes measured across the various cancer cell lines in the NCI-60 panel.
Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R. & Kohane, I. S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA 97, 12182–12186 (2000).
Kuo, W. P., Jenssen, T. K., Butte, A. J., Ohno-Machado, L. & Kohane, I. S. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18, 405–412 (2002). One of the first studies to compare published measurements of, in theory, the same cancer cell lines on cDNA and oligonucleotide microarrays. Shows that these measurements are not directly comparable.
Tusher, V. G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116–5121 (2001).
Butte, A. J. et al. Determining significant fold differences in gene expression analysis. Pac. Symp. Biocomput. 6–17 (2001).
Park, P. J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52–63 (2001).
Pavlidis, P. & Noble, W. S. Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol. 2, research0042.10–0042.15 (2001).
Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999). One of the first publications to show how microarrays can assist in difficult clinical diagnosis; in this case, determining acute lymphocytic leukaemia from acute myelogenous leukaemia using a nearest-neighbour approach.
Quinlan, J. C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, California, 1992).
Rumelhart, D., McClelland, J. & The Parallel Distributed Processing Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition (MIT Press, Cambridge, Massachusetts, 1986).
Furey, T. S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000).
Brown, M. P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).
Chow, M. L., Moler, E. J. & Mian, I. S. Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol. Genomics 5, 99–111 (2001).
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).
Raychaudhuri, S., Stuart, J. M. & Altman, R. B. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455–466 (2000).
Fiehn, O. et al. Metabolite profiling for plant functional genomics. Nature Biotechnol. 18, 1157–1161 (2000).
Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA 95, 334–339 (1998). One of the first large microarray publications, with 112 genes measured in 9 conditions, analysed using dendograms created using Euclidean distance.
Hilsenbeck, S. G. et al. Statistical analysis of array expression data as applied to the problem of tamoxifen resistance. J. Natl Cancer Inst. 91, 453–459 (1999).
Ben-Dor, A. et al. Tissue classification with gene expression profiles. J. Comput. Biol. 7, 559–583 (2000).
Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA 96, 2907–2912 (1999). Tamayo and colleagues were the first to use self-organizing maps to show clusters of genes measured across time from differentiating hematopoetic cells.
Toronen, P., Kolehmainen, M., Wong, G. & Castren, E. Analysis of gene expression data using self-organizing maps. FEBS Lett. 451, 142–146 (1999).
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998). The first group to show the now-standard Eisen-style dendrogram.
Liang, S., Fuhrman, S. & Somogyi, R. Reveala general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 18–29 (1998).
Wuensche, A. Genomic regulation modeled as a network with basins of attraction. Pac. Symp. Biocomput. 89–102 (1998).
Szallasi, Z. & Liang, S. Modeling the normal and neoplastic cell cycle with 'realistic Boolean genetic networks': their application for understanding carcinogenesis and assessing therapeutic strategies. Pac. Symp. Biocomput. 66–76 (1998).
Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).
Butte, A. & Kohane, I. in Fall Symposium, American Medical Informatics Association (ed. Lorenzi, N.) 711–715 (Hanley and Belfus, Washington DC, 1999).
Butte, A. J. & Kohane, I. S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418–429 (2000).
Spellman, P. T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998). The first publication to merge several microarray experiments, to show clusters using dendrograms constructed using correlation coefficients, and to analyse the time-series pattern of genes using Fourier analysis.
Yeung, K. Y. & Ruzzo, W. L. An Empirical Study of Principal-Components Analysis for Clustering Gene Expression Data Technical Report UW-CSE-2000-11-03. (Univ. Washington, Washington DC, 2000).
Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000). Alizadeh and colleagues were the first to use microarrays to find subtypes of a single disease that could be defined only by their gene-expression patterns, and which showed significant differences in patient mortality.
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000).
Kohane, I. S., Kho, A. T. & Butte, A. J. Microarrays for an Integrative Genomics (MIT Press, Cambridge, Massachusetts, 2002).
Perou, C. M. Show me the data! Nature Genet. 29, 373 (2001).
Acknowledgements
The author wishes to thank T. Deshpande, A. Kho, M. Ramoni and I. Kohane for critical comments and interesting discussions on the manuscript. During the writing of this work, the author has been funded by and wishes to thank the Endocrine Fellows Foundation, the Genentech Centre for Clinical Research and Education, the Lawson Wilkins Paediatric Endocrinology Society, the Harvard Centre for Neurodegenerative Research and the Merck–Massachusetts Institute of Technology partnership. The author was also supported in part by grants from the National Heart, Lung and Blood Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, and the National Institute of Neurological Disorders and Stroke.
Author information
Authors and Affiliations
Related links
Related links
DATABASES
Cancer.gov
LocusLink
FURTHER INFORMATION
Glossary
- SPLINES
-
Instead of fitting a complex polynomial curve to data, splines allow the fitting of data by putting together smaller, less complex curves.
- NORTHERN BLOT
-
Different RNA molecules are separated by mass on a gel, then radioactively labelled complementary DNA or RNA molecules are used to quantify specific RNA amounts.
- REVERSE TRANSCRIPTION
-
The synthesis of a strand of DNA from RNA, which is used to make a complementary DNA copy of sample RNA.
- BAYESIAN NETWORK
-
A graphical representation in which variables (that is, genes) are represented as nodes. Arrows between nodes represent conditional dependence, which is interpretable as causal associations.
- PEARSON CORRELATION COEFFICIENT
-
A measurement of the degree of fit of a linear-regression line to data points, calculated as the average distance of points from the regression line normalized to the standard deviations of the individual coordinates.
- RANK CORRELATION COEFFICIENT
-
Points are restated in terms of their ordinal rank (for example, first, second, third) before calculation of the correlation coefficient.
- DENDROGRAM
-
A visual representation of hierarchical clusters.
Rights and permissions
About this article
Cite this article
Butte, A. The use and analysis of microarray data. Nat Rev Drug Discov 1, 951–960 (2002). https://doi.org/10.1038/nrd961
Issue Date:
DOI: https://doi.org/10.1038/nrd961
This article is cited by
-
Characterisation of changes in global genes expression in the lung of ICR mice in response to the inflammation and fibrosis induced by polystyrene nanoplastics inhalation
Toxicological Research (2023)
-
RETRACTED ARTICLE: Optimizing the Performance of Neural Network for Bladder Cancer Prediction and Diagnosis Using Intelligent Firefly
Arabian Journal for Science and Engineering (2023)
-
A gene signal amplifier platform for monitoring the unfolded protein response
Nature Chemical Biology (2020)
-
Meta- and cross-species analyses of insulin resistance based on gene expression datasets in human white adipose tissues
Scientific Reports (2018)
-
Curated compendium of human transcriptional biomarker data
Scientific Data (2018)