We explore an ensemble of multivariate statistical methods for the analysis of gene expression data from cDNA microarray experiments. The statistical questions we investigate are motivated by the experimental program carried out in the laboratories of professors Brown and Botstein at Stanford University to characterise the molecular variations among cancers of the breast, prostate, liver and brain, based on transcript abundance for tens of thousands of genes in several hundred tumour samples. The aims of the statistical analysis are to assist the biologists in (i) developing a classification or “taxonomy” of tumours based on gene expression data, and (ii) identifying a subset of “marker” genes that characterise the different tumour types defined in (i).

The statistical analysis for addressing aims (i) and (ii) involves a synthesis of approaches from the fields of cluster analysis and discriminant analysis, also known in the pattern recognition literature as unsupervised and supervised learning, respectively. Unsupervised methods (e.g. projection methods, partitioning methods such as k-medoids, hierarchical clustering and self-organising maps) are investigated to identify possible tumour types, some already recognised (e.g. tumour site of origin) others (most?) not. A concomitant question is the development of statistics for comparing different clusterings. Supervised methods (e.g. linear discriminant analysis, nearest neighbour methods, neural networks and tree-structured classifiers) are used to examine in greater detail the definitions of tumour types from unsupervised learning methods and identify a subset of marker genes which provide a more compact description of these tumour types. Our analysis focuses on the CART (Classification and Regression Tree) method and its extensions to perform variable (marker gene) selection and develop a classifier for tumour types. We apply re-sampling methods such as bagging and boosting to improve the accuracy of the classifier and the variable selection process.