Computational analysis of microarray data

Quackenbush, John

doi:10.1038/35076576

Review Article
Published: 01 June 2001

Computational genetics

Computational analysis of microarray data

John Quackenbush¹

Nature Reviews Genetics volume 2, pages 418–427 (2001)Cite this article

6429 Accesses
1010 Citations
7 Altmetric
Metrics details

Key Points

The completion of the sequencing of a large number of prokaryotic and eukaryotic genomes presents several challenges and opportunities, including the functional classification of predicted genes.
Microarray analysis promises to contribute to the functional annotation of genomes and has already provided a wealth of genome-wide expression data.
Much attention has been focused on experimental protocols for microarray studies, but the strategies for data analysis have a profound (and perhaps underappreciated) effect on the interpretation of the results.
Expression data from each experiment must first be normalized to account for systematic experimental variation, including unequal dye incorporation and detection efficiencies.
For comparison between experiments, data is often first filtered to select a subset or to exclude genes for which there is much missing data. A distance metric must then be chosen, which determines how we measure similarity between gene-expression patterns. Genes and experiments can then be grouped using various computational methods. Each step can influence how the expression data are grouped.
Clustering algorithms, which are the most widely used approaches to analysing gene expression, can be classified as hierarchical or non-hierarchical (self-organizing maps (SOMs), k-means clustering and principal component analysis), agglomerative (hierarchical) or divisive (k-means, SOMs), and supervised (support vector machine) or non-supervised (hierarchical and k-means clustering, SOMs).
A synthetic data set with well-defined relationships between genes is used to show the differences between some of these methods.
The choice of data analysis strategy should be influenced by the purpose of the microarray experiment, and the user's knowledge of the biology of the system under investigation.

Abstract

Microarray experiments are providing unprecedented quantities of genome-wide data on gene-expression patterns. Although this technique has been enthusiastically developed and applied in many biological contexts, the management and analysis of the millions of data points that result from these experiments has received less attention. Sophisticated computational tools are available, but the methods that are used to analyse the data can have a profound influence on the interpretation of the results. A basic understanding of these computational tools is therefore required for optimal experimental design and meaningful data analysis.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 2: A synthetic gene-expression data set.**

**Figure 4: Principal component analysis.**

**Figure 5: The effect of data filtering.**

Re-evaluation of publicly available gene-expression databases using machine-learning yields a maximum prognostic power in breast cancer

Article Open access 05 October 2023

reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

Article Open access 06 December 2021

Band-based similarity indices for gene expression classification and clustering

Article Open access 03 November 2021

References

Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).
Article CAS PubMed Google Scholar
Lockhart, D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnol. 14, 1675–1680 (1996).
Article CAS Google Scholar
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with complementary DNA microarray. Science 270, 467–470 (1995).
Article CAS PubMed Google Scholar
Schena, M. et al. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl Acad. Sci. USA 93, 10614–10619 (1996).
Article CAS PubMed PubMed Central Google Scholar
Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA 95, 334–339 (1998).This is one of the first analyses of large-scale gene expression — in this case, RT–PCR data — using clustering and data-mining techniques. It elegantly shows how integrating the results derived using various distance metrics can reveal different but meaningful patterns in the data.
Article CAS PubMed PubMed Central Google Scholar
Michaels, G. S. et al. Cluster analysis and data visualization of large-scale gene expression data. Pacific Symp. Biocomput. 1998, 42–53 (1998).
Google Scholar
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).This is an excellent demonstration of the power of hierarchical clustering to the analysis of microarray data. The authors also provide software — Cluster and Treeview — which became the standard for analysing microarray data.
Article CAS PubMed PubMed Central Google Scholar
Weinstein, J. N. et al. An information-intensive approach to the molecular pharmacology of cancer. Science 275, 343–349 (1997).Weinstein and colleagues present one of the first and most elegant applications of hierarchical clustering and other data-mining and visualization techniques to the analysis of large-scale data in molecular biology.
Article CAS PubMed Google Scholar
Sokal, R. R. & Michener, C. D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38, 1409–1438 (1958).
Google Scholar
Shannon, C. C. & Weaver, W. The Mathematical Theory of Communication (Illinois Univ. Press, Illinois, 1963).
Google Scholar
Kohonen, T. Self Organizing Maps (Springer, Berlin, 1995).
Book Google Scholar
Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA 96, 2907–2912 (1999).Tamayo and colleagues use self-organizing maps (SOMs) to explore patterns of gene expression generated using Affymetrix arrays, and provide the GENECLUSTER implementation of SOMs.
Article CAS PubMed PubMed Central Google Scholar
Eisen, M. B. & Brown, P. O. DNA arrays for analysis of gene expression. Meth. Enzymol. 303, 179–205 (1999).
Article CAS Google Scholar
Hegde, P. et al. A concise guide to microarray analysis. Biotechniques 29, 548–560 (2000).
Article CAS PubMed Google Scholar
Boguski, M. S. & Schuler, G. D. ESTablishing a human transcript map. Nature Genet. 10, 369–371 (1995).
Article CAS PubMed Google Scholar
Quackenbush, J. et al. The TIGR gene indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res. 29, 159–164 (2001).
Article CAS PubMed PubMed Central Google Scholar
Burke, J., Wang, H., Hide, W. & Davison, D. B. Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Res. 8, 276–290 (1998).
Article CAS PubMed PubMed Central Google Scholar
Ermolaeva, O. et al. Data management and analysis for gene expression arrays. Nature Genet. 20, 19–23 (1998).
Article CAS PubMed Google Scholar
Sherlock, G. et al. The Stanford Microarray Database. Nucleic Acids Res. 29, 152–155 (2001).
Article CAS PubMed PubMed Central Google Scholar
Chen, Y., Dougherty, E. R. & Bittner, M. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Opt. 2, 364–374 (1997).
Article CAS PubMed Google Scholar
Heyer, L. J., Kruglyak, L. & Yooseph, S. Exploring expression data: identification and analysis of coexpressed genes. Genome Res. 9, 1106–1115 (1999).
Article CAS PubMed PubMed Central Google Scholar
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. & Church, G. M. Systematic determination of genetic network architecture. Nature Genet. 22, 281–285 (1999).
Article CAS PubMed Google Scholar
Raychaudhuri, S., Stuart, J. M. & Altman, R. B. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 2000, 455–466 (2000).
Google Scholar
Brown, M. P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA 97, 262–267 (2000).This paper shows the power of supervised techniques, in this case support vector machines, to provide additional insight into gene expression and function.
Article CAS PubMed PubMed Central Google Scholar
Golub, T. R. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).
Article CAS PubMed Google Scholar
Furey, T. S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000).
Article CAS PubMed Google Scholar
Hedenfalk, I. et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344, 539–548 (2001).
Article CAS PubMed Google Scholar
Chatterjee, S. & Price, B. Regression Analysis by Example (John Wiley and Sons, New York, 1991).
Google Scholar
Cleveland, W. S. & Devlin, S. J. Locally weighted regression: an approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83, 596–610 (1988).
Article Google Scholar
Sokal, R. R. & Sneath, P. H. A. Principles of Numerical Taxonomy (W. H. Freeman & Co., San Francisco, 1963).
Google Scholar
Ward, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
Article Google Scholar

Download references

Acknowledgements

Cluster analysis was done using the The Institute for Genomic Research MeV software package developed by A. Sturn, A. I. Saeed and J.Q., which is available at http://pga.tigr.org/tools.shtml, along with the sample data set used here. The author also thanks A. Sturn, N. H. Lee, R. L. Malek and E. Snesrud for valuable discussions and comments. This work is supported by grants from the US National Science Foundation, the US National Cancer Institute, and the US National Heart, Lung, and Blood Institute.

Author information

Authors and Affiliations

The Institute for Genomic Research, 9,712 Medical Center Drive, Rockville, 20850, Maryland, USA
John Quackenbush

Authors

John Quackenbush
View author publications
You can also search for this author in PubMed Google Scholar

Supplementary information

Supplementary figure 1 k-means clustering and Supplementary box 1 Distance metrics (PDF 242 kb)

Glossary

CLUSTER ANALYSIS: The term 'cluster analysis' actually encompasses several different classification algorithms that can be used to develop taxonomies (typically as part of exploratory data analysis). Note that in this classification, the higher the level of aggregation, the less similar are members in the respective class.
CENTROID: The centroid of a cluster is the weighted average point in the multidimensional space; in a sense, it is the centre of gravity for the respective cluster.
DENDROGRAM: A branching 'tree' diagram representing a hierarchy of categories on the basis of degree of similarity or number of shared characteristics, especially in biological taxonomy. The results of hierarchical clustering are presented as dendrograms, in which the distance along the tree from one element to the next represents their relative degree of similarity.
NEURAL NETWORKS: Neural networks are analytic techniques modelled after the (proposed) processes of learning in cognitive systems and the neurological functions of the brain. Neural networks use a data 'training set' to build rules capable of making predictions or classifications on data sets.
FACTOR ANALYSIS: Factor analysis is a data reduction and exploratory method similar to pincipal component analysis. Factor analysis techniques seek to reduce the number of variables and to detect structure in the relationships between elements in an analysis.
PRINCIPAL COORDINATE ANALYSIS: Like principal component analysis, principal coordinate analysis seeks to reduce the dimensionality of a spatial representation of a data set by creating new coordinate axes that are a combination of the originals, and projecting the data onto those new axes.
HYPERPLANE: A hyperplane is an N-dimensional analogy of a line or plane, which divides an 'N + 1' dimensional space into two.
KERNEL FUNCTION: In support vector machines, the kernel function is a generalization of the distance metric; it measures the distance between two expression vectors as the data are projected into higher-dimensional space.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Quackenbush, J. Computational analysis of microarray data . Nat Rev Genet 2, 418–427 (2001). https://doi.org/10.1038/35076576

Download citation

Issue Date: 01 June 2001
DOI: https://doi.org/10.1038/35076576

This article is cited by

Cellular clarity: a logistic regression approach to identify root epidermal regulators of iron deficiency response
- Selene R. Schmittling
- DurreShahwar Muhammad
- Cranos M. Williams
BMC Genomics (2023)
Transcriptomic characterization of Trichoderma harzianum T34 primed tomato plants: assessment of biocontrol agent induced host specific gene expression and plant growth promotion
- Mohd Aamir
- V. Shanmugam
- Pankaj Sah
BMC Plant Biology (2023)
Pro-MAP: a robust pipeline for the pre-processing of single channel protein microarray data
- Metoboroghene Oluwaseyi Mowoe
- Shaun Garnett
- Jonathan Michael Blackburn
BMC Bioinformatics (2022)
Development of a new methodology to determine size differences of nanoparticles with nanoparticle tracking analysis
- Yann Pellequer
- Gilbert Zanetta
- Renaud Seigneuric
Applied Nanoscience (2021)

Computational analysis of microarray data

Key Points

Abstract

Access options

Similar content being viewed by others

Re-evaluation of publicly available gene-expression databases using machine-learning yields a maximum prognostic power in breast cancer

reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

Band-based similarity indices for gene expression classification and clustering

References

Acknowledgements

Author information

Authors and Affiliations

Supplementary information

Supplementary figure 1 k-means clustering and Supplementary box 1 Distance metrics (PDF 242 kb)

Related links

PUBLIC EST SEQUENCES

CDNA DATABASES

IMAGE-PROCESSING SOFTWARE

DATA ANALYSIS TOOLS

META-LISTS OF OTHER AVAILABLE SOFTWARE

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Cellular clarity: a logistic regression approach to identify root epidermal regulators of iron deficiency response

Transcriptomic characterization of Trichoderma harzianum T34 primed tomato plants: assessment of biocontrol agent induced host specific gene expression and plant growth promotion

Pro-MAP: a robust pipeline for the pre-processing of single channel protein microarray data

Development of a new methodology to determine size differences of nanoparticles with nanoparticle tracking analysis

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Supplementary information

Related links

Related links

PUBLIC EST SEQUENCES

CDNA DATABASES

IMAGE-PROCESSING SOFTWARE

DATA ANALYSIS TOOLS

META-LISTS OF OTHER AVAILABLE SOFTWARE

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links